# Optimal arrangements of hyperplanes for multiclass classification

In this paper, we present a novel approach to construct multiclass clasifiers by means of arrangements of hyperplanes. We propose different mixed integer non linear programming formulations for the problem by using extensions of widely used measures for misclassifying observations. We prove that kernel tools can be extended to these models. Some strategies are detailed that help solving the associated mathematical programming problems more efficiently. An extensive battery of experiments has been run which reveal the powerfulness of our proposal in contrast to other previously proposed methods.

## Authors

• 7 publications
• 3 publications
• 7 publications
• ### Continuous maximal covering location problems with interconnected facilities

In this paper we analyze a continuous version of the maximal covering lo...
05/07/2020 ∙ by Víctor Blanco, et al. ∙ 0

• ### A Mathematical Programming approach to Binary Supervised Classification with Label Noise

In this paper we propose novel methodologies to construct Support Vector...
04/21/2020 ∙ by Víctor Blanco, et al. ∙ 0

• ### Robust Optimal Classification Trees under Noisy Labels

In this paper we propose a novel methodology to construct Optimal Classi...
12/15/2020 ∙ by Víctor Blanco, et al. ∙ 0

• ### Developing Approaches for Solving a Telecommunications Feature Subscription Problem

Call control features (e.g., call-divert, voice-mail) are primitive opti...
01/16/2014 ∙ by David Lesaint, et al. ∙ 0

• ### Optimal binning: mathematical programming formulation

The optimal binning is the optimal discretization of a variable into bin...
01/22/2020 ∙ by Guillermo Navas-Palencia, et al. ∙ 11

• ### An Integer Linear Programming Model for Tilings

In this paper, we propose an Integer Linear Model whose solutions are th...
07/08/2021 ∙ by Gennaro Auricchio, et al. ∙ 0

• ### FPT-algorithms for The Shortest Lattice Vector and Integer Linear Programming Problems

In this paper, we present FPT-algorithms for special cases of the shorte...
10/01/2017 ∙ by D. V. Gribanov, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Support Vector Machine (SVM) is a widely-used methodology in supervised binary classification, firstly proposed by Cortes and Vapnik [6]. Given a set of data together with a label, the general idea under the SVM methodologies is to find a partition of the feature space and an assignment rule from data to each of the cells in the partition that maximizes the separation between the classes of a training sample and that minimizes certain measure for the misclassifying errors. At that point, convex optimization tools come into scene and the shape of the obtained dual problem allows one to project the data out onto a higher dimensional space where the separation of the classes can be more adequately performed, but whose problem can be solved with the same computational effort that the original one. This fact is the so-called kernel trick, and has motivated the use of this tool with success in a wide range of applications [2, 14, 10, 20, 25].

Most of the SVM proposals and extensions concern instances with only two different classes. Some extensions have been proposed for this case by means of choosing different measures for the separation between classes [12, 13, 5]

[19], regularization strategies [18], etc. However, the analysis of SVM-based methods for instances with more than two classes has been, from our point of view, only partially investigated. To construct a -label classification rule for , one is provided with a training sample of observations and labels for each of the observations in such a sample,

. The goal is to find a decision rule which is able to classify out-of-sample data into a single class learning from the training data.

The most common approaches to construct multiclass classifiers are based on the extension of the methodologies for the binary case, as Deep Learning tools

[1], -Nearest Neighborhoods [7, 26] or Naïve Bayes [16]. Also, a few approaches have also been proposed for multiclass classification using the power of binary SVMs. In particular, the most popular are: one-versus-all (OVA) and one-versus-one methods (OVO). The first, OVA, computes, for each class , a binary SVM classifier by labeling the observations as if the observation is in class and otherwise. Then, it is repeated for all the classes ( times), and each observation is classified into the class whose constructed hyperplane is further from it. In the OVO approach, each class is separated from any other by computing hyperplanes (one for each pair of classes). Although OVA and OVO shares the advantages that they can be efficiently solved and that they inherit all the good properties of binary SVM, they are not able to correctly classify many multiclass instances, as datasets with different and separated clouds of observations which belong to the same class. Some attempts have been proposed to construct multiclass SVMs by solving a compact optimization problem which involve all the classes at the same time, as WW [28], CS [8] or LLW [15]

, where the authors consider different choices for the hinge loss in multicategory classification. Some of them (OVO, CS and WW) are implemented in some of the most popular softwares used in machine learning as

R [23] or Python [24]. However, as far as we know, it does not exist multiclass classification methods that keep the essence of binary SVM which stems on finding a globally optimal partition of the feature space.

Here, we propose a novel approach to handle multiclass classification extending the paradigm of binary SVM classifiers. In particular, our method will find a polyhedral partition of the feature space and an assignment of classes to cells of the partition, by maximizing the separation between classes and minimizing the missclassifying errors. For biclass instances, and using a single separating hyperplane, the method coincides with the classical SVM, although different new alternatives appear even for biclass datasets allowing more than one hyperplane to separate the data. We propose different mathematical programming formulations for the problem. These models share the same modelling idea and they allow us to consider different measures for the missclassifying errors (hinge or ramp-based losses). The models will belong to the family of Mixed Integer Non Linear Programming (MINLP) problems, in which the nonlinearities come from the representation of the margin distances between classes, that can be modeled as a set of second order cone constraints [4]

. This type of constraints can be handled by any of the available off-the-shelf optimization solvers (CPLEX, Gurobi, XPress, SCIP, …). However, the number of binary variables in the model may become an inconvenient when trying to compute classifiers for medium to large size instances. For the sake of obtaining the classifiers in lower computational times, we detail some strategies which can be applied to reduce the dimensionality of the problem. Recently, a few new approaches have been proposed for different classification problems using discrete optimiation tools. For instance, in

[27] the authors construct classification hyperboxes for multiclass classification, and in [9, 19, 21] new mixed integer linear programming tools are provided for feature selection in SVM.

In case the data are, by nature, nonlinearly separable, in classical SVM one can apply the so-called kernel trick to project the data out onto a higher dimensional space where the linear separation has a better performance. The magic there is that one does not need to know which specific transformation is performed on the data and that the decision space of the mathematical programming problem which is needed to be solved, is the same as the original. Here, we will show that kernel trick can be extended to our framework and will also allow us to find nonlinear classifiers with our methodology.

Finally, we successfully apply our method to some well-known dataset in multiclass classification, and compare the results with those obtained with the best SVM-based classifiers (OVO, CS and WW).

The paper is organized as follows. In sections 2 and 3 we describe and set up the elements of the problem, afterwards we introduce the MINLP formulation of our model. Simultaneously we present a linear version, measuring the margin with the norm. Moreover, we point out that a Ramp Loss version of the model can be easily derived with very few modifications. In 3.2 we show that this model admits the use of kernels as the binary SVM does. In section 4

we introduce some heuristics that we have developed to obtain an initial solution of the MINLP. Finally, section

5 contains computational results on real data sets and its comparison with the available methods mentioned above.

## 2. Multiclass Support Vector Machines

In this section, we introduce the problem under study and set the notation used through this paper.

Given a training sample the goal of supervised classification is to find a separation rule to assign labels () to data (), in order to be applied to out-of-sample data. We assume that a given number, , of linear separators have to be built to obtain a partition of into polyhedral cells. The linear separators are nothing but hyperplanes in , , in the form for . Each of the subdivisions obtained with such an arrangement of hyperplanes will be then assigned to a label in . In Figure 1 we show the framework of our approach. In the left side, a of cells-to-classes. In this case such a subdivision and assignment reaches a perfect classification of the given training data.

Under these settings, our goal is to construct such an arrangement of hyperplanes, , induced by (the first component of each vector accounts for the intercept) and to assign a single label to each one of the cells in the subdivision it induces. First, observe that each cell in the subdivision, can be uniquely identified with a -vector in , each of the signs representing in which side of the hyperplane is . Hence, a suitable assignment will be a function , which maps cells (equivalently sign-patterns) to labels in . Hence, for a given , which belongs to a cell, we will identify it with its sign-pattern , where for . Then, the classification rule is defined as , the predicted label of . The goodness of the fitting will be based, on comparing predictions and actual labels on a training sample, but also on maximally separating the classes in order to find good predictions and avoid undesired overfitting.

In binary classification datasets, SVM is a particular case of our approach if , i.e., a single hyperplane to partition the feature space is used. In such a case, signs are in and classes in , so whenever there are observations in both classes, the assignment is one-to-one. However, even for biclass instances, if more than one hyperplane is used, one may find better classifiers. In Figure 2 we draw the same dataset of labeled (red and blue) observations and the result of applying a classical SVM (left) and our approach with hyperplanes. In that picture one may see that not only the misclassifying errors are smaller with two hyperplanes, as expected, but also the separation between classes is larger, improving the predictive power of the classifier.

This approach is particularly useful for datasets in which there are several separated “clouds” of observations that belong to the same class. In Figure 3, we show two different instances in which, again, the colors indicate the class of the observations. The classes in both instances cannot be appropriately separated using any of the linear SVM-based methods while we were able to perfectly separate the classes using 5 hyperplanes.

In Figure 4 we compare our approach and the One-versus-One (OVO) approach in an instance with observations. In the left picture we show the result of separating the classes with four hyperplanes, reaching a perfect classification of the training sample. In the right side we show the best linear OVO classifier, in which only of the data were correctly classified. We would like also to highlight that, although nonlinear SVM-approaches may separate the data more conveniently, our approach may help to avoid using kernels and ease the interpretation of the results.

Different choices are possible to compute multiclass classifiers under the proposed framework. We will consider two different models which share the same paradigm but differ in the way they account for misclassifying errors. Recall that in SVM-based methods, two functions are simultaneously minimized when constructing a classifier. On the one hand, a measure of the good performance of the separation rule on out-of-sample observations, based on finding a maximum separation between classes; and (2) a measure of misclassifying errors for the training set of observations. Both criteria are adequately weighted in order to find a good compromise between the goals.

In what follows we describe the way we account for the two criteria in our multiclass classification framework.

### 2.1. Separation between classes

Concerning the first criterion, we measure such a separation between classes as usual in SVM-based methods. Let be the coefficients and intercepts of a set of hyperplanes. The Euclidean distance between the shifted hyperplanes and is given by , where is the Euclidean norm in (see [22]).

Hence, in order to find globally optimal hyperplanes with maximum separation, we maximize the minimum separation between classes, that is . This measure will conveniently keep the minimum separation between classes as largest as possible. Observe that finding the maximum min-separation is equivalent to minimize . For a given arrangement of hyperplanes, , we will denote by .

Here, we observe that different criteria could be used to model the separation between classes. For instance, one may consider to maximize the summation of all separations namely . However, although mathematically possible, this approach does not capture the original concept in classical SVM and we have left it to be developed by the interested reader.

### 2.2. Misclassifying errors

The performance of the classifier on the training set is usually measured with some type of misclassifying errors. Classical SVMs with hinge-loss errors use, for observations not well-classified, a penalty proportional to the distance to the side in which they would be well-classified. Then the overall sum of these errors is minimized. We extend the notion of hinge-loss errors to the multiclass setting as follows:

###### Definition 2.1 (Multiclass Hinge-Losses).

Let be an arrangement of hyperplanes and an observation/label, with the sign-patters of with respect to the hyperplanes in . Let a function that assigns to each cell induced by a class. Let the signs of the closest cell whose assigned class by is .

• is said incorrectly classified with respect to if , otherwise it is said that is well-classified.

• The multiclass in-margin hinge-loss for with respect to the hyperplane is defined as:

 hI(¯x,¯y,Hr)={max{0,min{1,1−sr(x)(ωtrx+ωr0)}}if % x is well classified through Hr,0otherwise.
• The multiclass out-margin hinge-loss for with respect to the hyperplane is defined as:

 hO(¯x,¯y,Hr)={1−s(x)r(ωtrx+ωr0)if x is not well % classified through Hr,0otherwise.

The losses and account for missclassifying errors because of different causes. On the one hand, models the errors due to observations that although adequately classified with respect to , they belong to the margin between the shifted hyperplanes and . On the other hand, measures, for incorrectly classified observations, how far is from being well-classified. Note that if an observation, asides from being wrong classified, belongs to the margin between and , then only should be accounted for. In Figure 5 we illustrate the differences between the two types of losses.

## 3. Mixed Integer Non Linear Programming Formulations

In this section we describe the two mathematical optimization models that we propose for the multiclass separation problem. With the above notation, the problem can be mathematically stated as follows:

 min hH(H1,…,Hm)+C1n∑i=1m∑r=1hI(xi,yi,Hr)+C2n∑i=1m∑r=1hO(xi,yi,Hr) (1) s.t. Hr is an hyperplane in Rp, for r=1,…,m.

where and are constants which model the cost of misclassified and strip-related errors. Usually these constants will be considered equal, nevertheless, in practice studying different values on them might lead to better results on predictions. A case of interest results considering .

Observe that the problem above consists of finding the arrangement of hyperplanes minimizing a combination of the three quality measures described in the previous section: the maximum margin between classes and the overall sums of the in-margin and out-margin misclassifying errors. In what follows, we describe how the above problem can be re-written as a mixed integer non linear programming problem by means of adequate decision variables and constraints. Furthermore, the proposed model will consist of a set of continuous and binary variables, a linear objective function, a set of linear constraints and a set of second order cone constraints. It will allow us to push the model to a commercial solver in order to solve, at least, small to medium instances.

First, we describe the variables and constraints needed to model the first term in the objective function. We consider the continuous variables and to represent the coefficients and intercept of hyperplane , for . Since there is no distinction between hyperplanes, we can assume, without loss of generality that they are non-decreasingly sorted with respect to the norms of their coefficients, i.e., . Then, it is straightforward to see that the term can be replaced in the objective function by , and the following set of constraints allows to model the desired term:

 12∥ω1∥2≥12∥ωr∥2,∀r=2,…,m. (2)

For the second term, each of the in-margin misclassifying errors, , will be identified with the continuous variables , for , . Observe that to determine each of these errors, one has to first determine whether the observation is well-classified or not with respect to the th hyperplane. First, we consider the following two sets of binary variables:

 tir={1if ωtrxi+ωr0≥0,0otherwise.zis={1if i is assigned to class s,0otherwise.

for , , . The -variables model the sign-patterns of observation, while the -variables give the allocation profile observations-classes. As mentioned above, the classification rule is based on assigning sign-patterns to classes.

The adequate definition of the -variables is assured with the following constraints:

 ωtirxi+wr0≥−T(1−tir), ∀i∈N,r∈M (3) ωtirxi+wr0≤Ttir ∀i∈N,r∈M (4)

where is a big enough constant. Observe that

can be easily and accurately estimated based on the data set under consideration.

The following constraints assure the adequate relationships between the variables:

 k∑s=1zis=1, ∀i∈N, (5) ∥zi−zj∥1≤2∥ti−tj∥1, ∀i,j∈N, (6)

Observe that (5) enforce that a single class is assigned to each observation while (6) assure that the assignments of two observations must coincide if their sign-patterns are the same. Also, the set of -variables automatically determines whether an observation is incorrectly classified through the amount (where is the binary encoding of the class of the th observation - if and otherwise-, which is part of the input data). Observe that equals zero if and only if the predicted and the actual class coincide.

Now, we will model whether the th observation is well classified or not, with respect to the th hyperplane. Observe that the measure of how far is an incorrectly classified observation from being well-classified, needs a further analysis. One may has an incorrectly classified observation and several training observations in its same class. We assume that the error for this observation is the missclasifying error with respect to the closest cell for which there are well-classified observations in its class. Thus, we need to model the decision on the well-classified representative observation for an incorrectly-classified observation. We consider the following set of binary variables:

 hij=⎧⎪⎨⎪⎩1if xj, which is well classified % and verifies δj=δi, is the representative of xi  in its closest cell through hyperplanes,0otherwise

These variables are correctly defined by imposing the following constraints:

 ∑j∈N:yi=yjhij=1, ∀i∈N, (7) ξj+hij≤1 ∀i,j∈N(yi=yj), (8) hii=1−ξi ∀i∈N, (9)

The first set of constraints, (7), impose a single assignment between observations belonging to the same class. Constraints (8) avoid choosing incorrectly classified representative observations. The set of constraints (9) avoid self-assignments for incorrectly classified data, and also enforces well classified observations to be represented by themselves.

With these variables, we can model the in-margin errors by means of the following constraints:

 ωtrxi+ωr0≥1−eir−T(3−tir−tjr−hij), ∀r∈M, (10) ωtrxi+ωr0≤−1+eir+T(1+tir+tjr−hij), ∀r∈M, (11)

These constraints model, by using the sign-patters given by , that, . Note that the constraints are activated if either , i.e., if the well-classified observation is the representative observation for and both are in the positive side of the th-hyperplane; or and , i.e., if the well-classified observation is the representative observation for and both are in the negative side of the th-hyperplane. Thus, constraints (10) and (11) adequatelly model the in-margin errors for all observations . Furthermore, because of (3) and (4), and those described above, the variables always take values smaller than or equal to .

Finally, the third addend, the out-margin errors, will be modeled through the continuous variables , for , . With the set of variables described above, the out-margin misclassifying errors can be adequately modeled through the following constraints:

 dir≥1−ωtrxi−ωr0−T(2+tir−tjr−hij), ∀i,j∈N(yi=yj),r∈M, (12) dir≥1+ωtrxi+ωr0−T(2−tir+tjr−hij), ∀i,j∈N(yi=yj),r∈M, (13)

There, the constraints are active only in case and , that is, if is a well classified observation in the positive side of while is incorrectly classified in the negative side of being the representative observation for (note that in case is well-classified then by (9) and then, the constraint cannot be activated). The main difference with respect to (10) and (11) is that the constraints are activated only in case is incorrectly classified. The second set of constraints, namely (13), can be analogously justified in terms of the negative side of .

According to the above constraints, a missclassified observation is penalized in two ways with respect to each hyperplane . In case that is well-classified with respect to , but it belong to the margins, then and (). Otherwise, if is wrongly classified with respect to , then and ().

We illustrate the convenience of the proposed constraints on the data drawn in Figure 6. Observe that A is not correctly classified since it lies within a cell in which blue-class is not assigned. Suppose that B, a well classified observation, is the representative of A (), then the model would have to penalize two types of errors. Regarding to , if we suppose , then , leading to an activation on constraint (12) being . On the other hand, even tough A is well classified with respect to , we also have to penalize its margin violation. Again if we assume , then , what would make an activation on constraint (10) being .

The above comments can be summarized in the following mathematical programming formulation for Problem (1):

 min ∥ω1∥2+C1n∑i=1m∑r=1eir+C2n∑i=1m∑r=1dir (MCSVM) s.t. ωr∈Rd,ωr0∈R, ∀r∈M, dir,eir≥0,tir∈{0,1} ∀i∈N,r∈M, hij∈{0,1}, ∀i,j∈N, zis∈{0,1}, ∀i∈N,s∈K,

() is a mixed integer non linear programming model, whose nonlinear terms come from the norm minimization in the objective function, so they are second order cone representable. Therefore, the model is suitable to be solved using any of the available commercial solvers, as Gurobi, CPLEX, etc. The main bottleneck of the formulation stems on the number of binary variables which is of the order .

### 3.1. Building the classification rule

Recall that the main goal of multiclass classification is to determine a decision rule such that, given any observation, it is able to assign it a class. Hence, once the solution of () is obtained, the decision rule has to be derived. Given , two different situations are possible: (a) belongs to a cell with an assigned class; and (b) belongs to a cell with no training observations inside, so with non assigned class. For the first case, is assigned to its cell’s class. In the second case, different strategies to determine a class for are posible. We propose the following assignment rule based on the same allocation methods used in (): observations are assigned to their closest well-classified representatives. More specifically, let be the sign-patterns of with respect to the optimal arrangement obtained from (), and let (here stand for the optimal vector obtained by solving ()). Then, among all the well-classified observations in the training sample, , we assign to the class of the one whose cell is, in average, the closest (less separated from ). Such a classification of can be performed by enumerating all the possible assignments, and computing the distance measure over all of them. Equivalently, one can solve the following mathematical programming problem:

 min n∑j∈Jm∑r=1s(xj)r+s(x)r=0,Hj|(ω∗r)tx+ω∗r0| s.t. n∑j∈JHj=1, Hj∈{0,1},∀j∈J

where . The integrality condition in the problem above can be relaxed, since the constraint is T.U. and thus, the problem is a linear programming problem. Clearly, the solution of the above problem gives the optimal labelling of with respect to existing cells in the arrangement.

One could also consider other robust measures for such an assignment following the same paradigm, as min-max error or the like.

###### Remark 3.1 (Ramp Loss Missclassifying errors).

An alternative measure of misclassifying training errors is ramp loss. The ramp loss version of the model is interesting for certain instances since they allow to improve robustness against potential outliers. Instead of using out of margin hinge loss errors

, the ramp-loss measure consists of penalizing wrong classified observations by a constant, independently on how far they are from being well classified. Given an observation/label , the ramp-loss is defined as:

 RL((¯x,¯y),H)={0if ¯x is well-classified1otherwise

Note that, for our training sample, the ramp-loss is modeled in our model through the -variables. More specifically, for all . In order to do that we just need to do the following few modifications on the MINLP problem:

 min ∥ω1∥2+C1n∑i=1m∑r=1eir+C2n∑i=1ξi (MCSVMRL) s.t. ωr∈Rd,ωr0∈R, ∀r∈M, eir≥0, ∀i∈N,r∈M, hij∈{0,1}, ∀i,j∈N, zis∈{0,1}, ∀i∈N,s∈K, tir∈{0,1}, ∀i∈N,r∈M, ξi∈{0,1}, ∀i∈N.

### 3.2. Nonlinear Multiclass Classification

Finally, we analyze a crucial question in any SVM-based methodology, which is whether one can apply the Theory of Kernels in our framework. Using kernels means been able to apply transformations to the features, , to a higher dimensional space, where the separation of the data is more adequately performed. In case the desired transformation, , is known, one could transform the data and solve the problem () with a higher number of variables. However, in binary SVMs, formulating the dual of the classification problem, one can observe that it only depends of the observations via the inner products of each pair of observations (originally in ), i.e., through the amounts for . If the transformation is applied to the data, the observations only appear in the problem as for . Thus, kernels are defined as generalized inner products as for each , and they can be provided using any of the well-known families of kernel functions (see e.g., [11]). Moreover, Mercer’s theorem gives sufficient conditions for a function

to be a kernel function (one which is constructed as the inner product of a transformation of the features) which allows one to construct kernel measures that induce transformations. The main advantage of using kernels, apart from a probably better separation in the projected space, is that in binary SVM, the complexity of the

transformed problem is the same as the one of the original problem. More specifically, the dual problems have the same structure and the same number of variables.

Although problem () is a MINLP, and then, duality results do not hold, one can apply decomposition techniques to separate the binary and the continuous variables and then, iterate over the binary variables by recursively solving certain continuous and easier problems (see e.g. Benders…). The following result, whose proof is detailed in the Appendix, states that our approach also allows us to find nonlinear classifiers via the kernel tools.

###### Theorem 3.1.

Let be a transformation of the feature space. Then, one can obtain a multiclass classifier which only depends on the original data by means of the inner products , for .

See Appendix.∎

## 4. A Math-Heuristic Algorithm

As mentioned above, the computational burden for solving (), which is a mixed integer non linear programming problem (in which the nonlinearities come from the norm minimization in the objective function), is the combination of the discrete aspects and the non-linearities in the model. In this section we provide some heuristic strategies that allow us to cut down the computational effort by fixing some of the variables. It will also allow to provide good-quality initial feasible solutions when solving, exactly, () using a commercial solver. Two different strategies are provided. The first one consists of applying a variable fixing strategy to reduce the number of -variables in the model. Note that in principle, variables of this type are considered in the model. The second approach consists of fixing to zero some of the -variables. These variables allow to model assignments between observations and classes. The proposed approach is a math-heuristic approach, since after applying the adequate dimensionaly reductions, Problem () (or ()) has to be solved. Also, although our strategies do not ensure any kind of optimality measure, they have a very good performance as will be shown in our computational experiments. Observe that when classifying data sets, the measure of the efficiency of a decision rule, as ours, is usually done by means of the accuracy of the classification on out-of-sample data, and the objective value of the proposed model is just an approximated measure of such an accuracy which cannot be computed only with the training data.

### 4.1. Reducing the h-variables

Our first strategy comes from the fact that for a given observation , there may be several possible choices for to be one with the same final result. Recall that could be equal to one whenever is a well-classified observation in the same class as . The errors and are then computed by using the class of but not the observation itself. Thus, if a set of observations of the same class is close enough, being then well-classified, a single observation of them can be take to act as the representative element of the group. In order to illustrate the procedure, we show in Figure 7 (left) a -classes and -points instance in which the classes are easily identified by applying any clustering strategy. In such a case () has -variables, but if we allow only to take value at a single point in each cluster, we obtain the same result but reducing to the number of variables. In the right picture we show the clusters performed using the data, and a (random) selection of a unique point at each cluster for which the -values are allowed to be one.

This strategy is summarized in Algorithm 1.

### 4.2. Reducing the z-variables

The second strategy consists of fixing to zero some of the point-to-class assignments (-variables). In the dataset drawn in Figure 8 one may observe that points in the black-class will not be assigned to the red-class because of proximity. Following this idea, we derive a procedure to set some of the -variables to zero. The strategy is described in Algorithm 2. In that figure one can observe that for the red cluster we obtain the following set of distances: . Such a set of distances is reduced to because we would not take into account the distance to the green cluster on the very right (). Thus, we would fix to zero all variables that relate each cluster with the maximum of their minimum distance set, that is, in this case we fix to zero the -variables associated to the black cluster with the red cluster ( and ).

This strategy fixes a number of variables making the problem easier to solve.

## 5. Experiments

### 5.1. Real Datasets

In this section we report the results of our computational experience. We have run a series of experiments to analyze the performance of our model in some real widely used datasets in the classification literature, and that are available in the UCI machine learning repository [17]. The summarized information about the datasets is detailed in Table 1. In such a table we report, for each dataset the number of observations considered in the training sample () and test sample (), the number of features (), the number of classes (), the number of hyperplanes used in our separation (), and the number of hyperplanes that the OVO methodology needs to obtain the classifier ().

For these datasets, we have run both the hinge-loss () and the ramp-loss () models, using as margin distances those based on the and the norm. We have performed a -cross validation scheme to test each of the approaches. Thus, the data sets were partitioned into 5 train-test random partitions. Then, the models were solved for the training sample and the resulting classification rule is applied to the test sample. We report the average accuracy in percentage of the models of the repetitions of the experiment on test, which is defined as:

 ACC=#Well Classified Test ObservationsnTe⋅100

The parameters of our models were also chosen after applying a grid-based -cross validation scheme. In particular, we move the parameters (number of hyperplanes to locate) and the misclassification costs and in:

 m∈{2,…,k},C1,C2∈{0.1,0.5,1,5,10}.

For hinge-loss models , meanwhile for ramp-loss models we consider to give a hight penalty to badly wrongly-classified observations. As a result we obtain a misclassification error for each grid point, and we select the parameters that provide the lower error to make adjustments on the test set. The same methodology was applied to the classical methods, OVO, Weston-Watkins (WW) and Crammer-Singer (CS), by moving the single misclassifying cost in .

The mathematical programming models were coded in Python 3.6, and solved using Gurobi 7.5.2 on a PC Intel Core i7-7700 processor at 2.81 GHz and 16GB of RAM.

In Table 2 we report the average accuracies obtained in our 4 models and compare them with the ones obtained with OVO, WW and CS. The first two columns ( RL and HL) provide the average accuracies of our two approaches (Ramp Loss - RL- and Hing Loss -HL-) using the -norm as the distance measure. On the other hand, the third and four columns ( RL and HL) provide the results but for the -norm. In the last three columns, we report the average accuracies obtained with the classical methodologies (OVO, WW and CS). The best accuracies obtained for each of the datasets are bolfaced in the table. One can observe that our models can always replicate or improve the scores of the former models, as expected. When comparing the results obtained for ur approaches with the different norms, the Euclidean norm seems to provide slightly better results than the -norm. However, the models for the norm are mixed integer linear programming models, while the -norm based models are mixed integer nonlinear programming problems, which may imply a higher computational difficulty when solving large-size instances.

## 6. Conclusions

In this paper we propose a novel modeling framework for multiclass classification based on the Supoort Vector Machine paradigm, in which the different classes are linearly separated and the separation between classes is maximized. We propose two approaches whose goals are to compute an arrangement of hyperplanes which subdivide the space into cells, and each cell is assigned to a class based on the training data. The models result in Mixed Integer (Non Linear) Programming problems. Some strategies are presented in order to help solvers to find the optimal solution of the problem. We also prove that the kernel trick can be extended to this framework. The powerful of the approach is illustrated on some well-known datasets in the multicategory classification literature as well as in some synthetic small examples.

## Acknowledgements

The authors were partially supported by the research project MTM2016-74983-C2-1-R (MINECO, Spain). The first author has been also supported by project PP2016-PIP06 (Universidad de Granada) and the research group SEJ-534 (Junta de Andalucía).

## References

• [1] Agarwal, N., Balasubramanian, V.N., Jawahar, C.: Improving multiclass classification by deep networks using dagsvm and triplet loss. Pattern Recognition Letters (2018).
• [2] Bahlmann, C., Haasdonk, B., Burkhardt, H.: On-line handwriting recognition with support vector machines ” a kernel approach. In: Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02), IWFHR ’02, pp. 49–. IEEE Computer Society, Washington, DC, USA (2002).
• [3] Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems. Numerische mathematik 4(1), 238–252 (1962)
• [4] Blanco, V., Ben Ali, S. and Puerto, J. (2014). Revisiting several problems and algorithms in Continuous Location with norms. Computational Optimization and Applications 58(3): 563-595.
• [5] Blanco, V., Puerto, J., and Rodríguez-Chía, A. M. (2017). On -Support Vector Machines and Multidimensional Kernels. arXiv preprint arXiv:1711.10332.
• [6] Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995)
• [7] Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE transactions on information theory 13(1), 21–27 (1967)
• [8] Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001)
• [9]

Ghaddar, B., and Naoum-Sawaya, J. (2018). High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research, 265(3), 993-1004.

• [10] Harris, T.: Quantitative credit risk assessment using support vector machines: Broad versus narrow default definitions. Expert Systems with Applications 40(11), 4404–4413 (2013)
• [11] Horn, D., Demircioglu, A., Bischl, B., Glasmachers, T., and Weihs, C. (2016). A comparative study on large scale kernelized support vector machines. Advances in Data Analysis and Classification, 1-17.
• [12] K. Ikeda and N. Murata (2005). Geometrical Properties of Nu Support Vector Machines with Different Norms. Neural Computation 17(11), 2508-2529.
• [13] K. Ikeda and N. Murata (2005). Effects of norms on learning properties of support vector machines. ICASSP (5), 241-244
• [14] Kašćelan, V., Kašćelan, L., Novović Burić, M.: A nonparametric data mining approach for risk prediction in car insurance: a case study from the montenegrin market. Economic research-Ekonomska istraživanja 29(1), 545–558 (2016)
• [15] Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99(465), 67–81 (2004)
• [16]

Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval.

In: European conference on machine learning, pp. 4–15. Springer (1998)
• [17] Lichman, M.: UCI machine learning repository (2013).
• [18] López, J., Maldonado, S., and Carrasco, M. (2018). Double regularization methods for robust feature selection and SVM classification via DC programming. Information Sciences, 429, 377-389.
• [19] Labbé, M., Martínez-Merino, L. I., and Rodríguez-Chía, A. M. (2018). Mixed Integer Linear Programming for Feature Selection in Support Vector Machine. arXiv preprint arXiv:1808.02435.
• [20] Majid, A., Ali, S., Iqbal, M., Kausar, N.: Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Computer methods and programs in biomedicine 113(3), 792–808 (2014)
• [21] Maldonado, S., Pérez, J., Weber, R., Labbé, M. (2014). Feature selection for support vector machines via mixed integer linear programming. Information sciences, 279, 163-175.
• [22] Mangasarian, O.L. Arbitrary-norm separating plane. Oper. Res. Lett., 24 (1– 2):15–23 (1999).
• [23] Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F.

e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.6-8.

https://CRAN.R-project.org/package=e1071 (2017)
• [24] Pedregosa, F., Varoquaux, G. , Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E., Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830, 2011.
• [25]

Radhimeenakshi, S.: Classification and prediction of heart disease risk using data mining techniques of support vector machine and artificial neural network.

In: Computing for Sustainable Global Development (INDIACom), 2016 3rd International Conference on, pp. 3107–3111. IEEE (2016)
• [26]

Tang, X., Xu, A.: Multi-class classification using kernel density estimation on k-nearest neighbours.

Electronics Letters 52(8), 600–602 (2016)
• [27] Üney, F., Türkay, M. (2006). A mixed-integer programming approach to multi-class data classification problem. European journal of operational research, 173(3), 910-920.
• [28] Weston, J., Watkins, C.: Support vector machines for multi-class pattern recognition. In: European Symposium on Artificial Neural Networks, pp. 219–224 (1999)

## Proof of Theorem 3.1

Note that once the binary variables of our model are fixed, the problem becomes polynomial time solvable and it reduces to find the coordinates of the coefficients and intercepts of the hyperplanes and the different misclassifying errors. In particular, it is clear that the MINLP formulation for the problem is equivalent to:

 minh,z,t,ξ Φ(h,z,t,ξ) s.t. hij∈{0,1}, ∀i,j∈N, tir∈{0,1} ∀i∈N,r∈M, zis∈{0,1}, ∀i∈N,s∈K, ξi∈{0,1}, ∀i∈N.

where is the evaluation of the margin and hinge-loss errors for any assignment provided by the binary variables. That is,

 Φ(h,z,t,ξ)= minω,ω0,e,d12∥ω1∥2+C1n∑i=1m∑r=1eir+C2n∑i=1m∑r=1dir s.t. (???),(???)−(???), ωr∈Rd,ωr0∈R, ∀r∈M, dir,eir≥0 ∀i∈N,r∈M.

The above problem would be separable provided that the first constraints (2) were relaxed. For the sake of simplicity in the notation, we consider the following functions, for , defined as:

 κ1(t):=T(1−t),κ2(t):=Tt,κ3(t):=−1+Tt

for . Note that , , and that , .

Based on the separability mentioned above, we introduce another instrumental family of problems for all , namely,

 Φr(h,z,t,ξ,ω1) =minωr,ωr0,e,dC1n∑i=1eir+C2n∑i=1dir s.t. 12∥ωr∥2−12∥ω1∥2 ≤0, −ωtirxi−ωr0 ≤κ1(tir),∀i,r, ωtirxi+ωr0 ≤κ2(tir),∀i,r, −ωtrxi−ωr0−eir ≤κ3(u+ijr),∀i,j,r ωtrxi+ωr0−eir ≤κ3(u−ijr),∀i,j,r, −ωtrxi−ωr0−dir ≤κ3(q+ijr),∀i,j(yi=yj), ωtrxi+ωr0−dir ≤κ3(q−ijr),∀i,j(yi=yj), −dir ≤0,∀i, −eir ≤0,∀i, ωr∈Rd, ωr0∈R,

where for simplifying the notation we have introduced the auxiliary variables , , and , for and .

Observe that , apart from the first constraint, only considers variables associated to the th hyperplane.

Moreover, we need another problem that accounts for the first part of .

 Φ1(h,z,t,ξ) =minω1,ω10,e,d12∥ω1∥21+C1n∑i=1ei1+C2n∑i=1di1 s.t. −ωt1xi−ω10≤κ1(ti1),∀i, ωt1xi+ω10≤κ2(ti1),∀i, −ωt1xi−ω10−ei1≤κ3(u+ij1),∀i,j, ωt1xi+ω10−ei1≤κ3(u−ij1),∀i,j,r, −ωt1xi−ω10−di1≤κ3(q+ij1),∀i,j(yi=yj), ωt1xi+ω10−di1≤κ3(q