A Mathematical Programming approach to Binary Supervised Classification with Label Noise

04/21/2020 ∙ by Víctor Blanco, et al. ∙ 0

In this paper we propose novel methodologies to construct Support Vector Machine -based classifiers that takes into account that label noises occur in the training sample. We propose different alternatives based on solving Mixed Integer Linear and Non Linear models by incorporating decisions on relabeling some of the observations in the training dataset. The first method incorporates relabeling directly in the SVM model while a second family of methods combines clustering with classification at the same time, giving rise to a model that applies simultaneously similarity measures and SVM. Extensive computational experiments are reported based on a battery of standard datasets taken from UCI Machine Learning repository, showing the effectiveness of the proposed approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The primary goal of supervised classification is to find patterns from a training sample of labeled data in order predict the labels of out-of-sample data, in case the possible number of labels is finite. Among the most relevant applications of classification methods are those related with security, as in spam filtering or intrusion detection. The main difference of these applications with respect to other uses of classification approaches is that malicious adversaries can adaptively manipulate their data to mislead the outcome of an automatic analysis. For instance, spammers often modify their emails by obfuscating words which typically appear in known spam or by adding words which are likely to appear in legitimate emails. Thus, one has to, not only derive a classification rule from a training sample, able to adequately classify out-of-sample data, but also to take into account that some of the labels might be incorrect. Analyzing the vulnerabilities of classifiers and their robustness against attacks, to better understand how their security may be improved, has recently received growing interest from the scientific community. In [5] the authors propose robust alternatives when the features of the training sample observations are corrupted. On the other hand, [6]

provides an algorithmic approach to handle adversarial modifications of the labels, in case the labels are independently flipped with the same probability, by correcting the kernel matrix.

As it has just been exposed, doubting on the reliability of the labels on the target variable is usual when having suspicions about the possibility of an intentional flip among these labels. However, it is not by far the only case in which one must think about this possibility. Nowadays, it is commonly said that a data scientist spends around an 80 of his time dealing with collecting and preprocessing data, meanwhile the other 20 is used to model and extract information from databases. Thus, mistakes converted into wrong label assignments are very likely to happen. For instance, data can be wrongly identified at the very beginning of the data collection phase, or code errors can occur when preprocessing a database, leading to a dataset with label noise.

In this paper, we propose a methodology to construct a classification rule by means of an ad hoc adaptation of a Support Vector Machine classifier that incorporates the detection and correction of label noises in the dataset. Support Vector Machine (SVM) is a widely-used methodology in supervised binary classification, firstly proposed by Cortes and Vapnik [12]

. Given a number of observations with their corresponding labels, the SVM technique consists, in its simplest form, of finding an hyperplane in the feature space so that each class belongs to a different half-space maximizing the separation between classes (in a training sample) and minimizing some measure of the misclassifying errors. This problem can be cast within the class of convex optimization and its dual has very good properties that allow one to extend the methodology to construct also nonlinear classifiers. Most of the SVM literature concentrates on binary classification where several extensions are available. One can use different measures for the separation between classes 

[9, 22, 23], select important features [28], apply regularization strategies [27, 38], use twin separators [37], etc.

One of the main reasons of the success of SVM tools in classification, may be that one can project the original data out onto a higher dimensional space where the separation of the classes can be more adequately performed, and still with the same computational effort that was required in the original problem. This property is the so-called kernel trick, and very likely this is one of the reasons that has motivated the successful use of this tool in a wide range of applications [2, 19, 24, 29, 39].

According to [35], three main groups of approaches for dealing with noisy datasets have been already proposed in the literature: (1) Design of algorithms which filter noisy and/or mislabeled vectors from the input data [17, 18]; (2) Construction of robust classifiers against noisy labeling [14]; and (3) Use of noise models (typically, it is retrieved in parallel with the obtained classifier, and they are finally coupled for a higher-quality classification) [41, 40].

Our proposal falls within the third group of the above approaches. We provide a method to simultaneously construct the SVM-based classifier and to re-label observations which allows us to obtain separating hyperplanes that would had been impossible to obtain throughout standard SVM and that can report much better results for many different problems.

The construction of SVM-based classifiers that simultaneously relabel observations has many advantages when dealing with label noise datasets, but also when working on problems in which false positives and false negatives have different misclassifying costs. Also, in problems with unbalanced classes (as for instance in datasets on fraud with credit card transactions in which around a of the observations are not fraudulent transactions [15, 31] or in the number of claims in non-life insurances [11]). In Figure 1 we illustrate this situation. One can observe in the left picture the projection on the plane of a set of observations labeled by fraudulent (red) and non fraudulent (green) transactions. Linear separators seems to be impossible to construct for this instance, but also non linear classifiers will result in overfitting. However, as shown in the right picture, if one allows a few of the labels to be changed, one can obtain better classifiers. Note that in this case, false positives are more costly than false negatives (since asking for a little more of information via text message on the phone normally solves this true negative cases). It is also important to remark that this separating hyperplane could not have been obtained through standard SVM since all the support vectors belong to the same class (green points).

 

Figure 1. Original data (left) and optimal hyperplane separating re-labeled classes with our method (right).

In this paper we propose two different approaches. We present a model in which re-labeling observations depends on the errors of the SVM-based method itself searching for a compromise between the gain obtained in misclassification error and margin and the penalty paid for each change of labels. On the other hand, we will also introduce two models in which re-labeled observations will come from similarity measures on the data. To asses the validity of these methods we have performed a battery of computational experiments on 6 different real datasets. For these datasets we have repeated the experiments for 5 different scenarios, by randomly flipping a 0, 20, 30, 40 or 50 of the labels in the original data. When comparing our method with respect to classical SVM we can see that we obtain better results on noise label datasets.

The rest of the paper is organized as follows. In section 2 we set up and describe the elements of the problem to be considered. Afterward, in section 3 we introduce the different formulations of our models, to end up in section 4 presenting our computational experiments. Finally, we end this article in section 5 with some conclusions and an outline of our future work.

2. Preliminaries

In this section we introduce the problem under study and set the notation used through this paper.
Given a training sample , the goal a of linear SVM is to obtain a hyperplane separating the data into their two different classes . Among all possible hyperplanes that can obtain such a separation between the classes, SVM looks for the one with maximum margin (maximum distance from classes to the separating hyperplane) while minimizing the misclassification errors. Let us denote by a hyperplane in in the form for some and (the vector is the result of the transpose operator applied to the vector ). This hyperplane will induce a subdivision of the data space into three regions: the (positive) half-space , the (negative) half-space and the strip . In the SVM model, positive-class observations () will be forced to lie on the positive half-space, and the same constraint will be imposed for the negative-class () observations on the negative half-space. When these constraints are violated for an observation, a penalization error is accounted for in the optimization problem. The separation (margin) between classes is computed as the width of the strip . As mentioned before, the SVM separating hyperplane will be obtained from an equilibrium of maximizing the separation between classes and minimizing these penalization errors. Denoting by the misclassification error of observation , and by the constant of penalization of these errors, the SVM can be formulated as the following Non Linear Problem (NLP):

()
s.t.

In Figure 2 we can see a set of points belonging to two different, blue and green, classes (left picture) and its SVM optimal solution for a given parameter (right picture). The black line is the separating hyperplane while the other two parallel lines are delimiting the strip, , between classes. The points that lie on these parallel lines, the boundary of the strip, are the so called support vectors, and they verify that . Finally, we represent in red color the magnitude of the errors induced by margin violations.

 

Figure 2. Original set of points (left) and optimal SVM solution on these points (right).

If we further analyze the above dataset, we can see that there are four blue observations at the very right of the dataset, and two green observations on the left that have a strong impact when building the classifier. These observations do not allow one to construct a SVM separator of the dataset as the one we can see in Figure 3, since that would lead to very big misclassification errors with a very tiny margin.

Figure 3. Not optimal solution on the SVM problem.

Moreover, there are another two green observations, besides the two on the left, that are closer to the blue cloud of points than to the green one. Hence, if we could consider that these four green points and the four blue ones on the right were wrongly labeled (because of their closeness to the rest of points), we might consider a separating hyperplane with a slope like the one presented on the left of Figure 4 as a better classifier. However, this separating hyperplane would be impossible to obtain with the SVM model since all the support vectors belong to the same class and to avoid huge misclassification errors the model would forbid such a slope.

Motivated by the above kind of configurations, we have studied different models in which a separating hyperplane is obtained not only based on the original labels but also on the possibility of relabeling some of the original observations of the training sample at a given penalty cost. We say that an observation is relabeled if one of the following two assumptions occurs:

  • , but our model considers that ,

  • , but our model considers that .

We will use the notation to represent the class that the model is considering for observation . Hence, an observation is said to be relabeled if .

Following the example shown in Figures 2 and 3, we can see on the right of Figure 4 the solution of our model, with a separating hyperplane with the desired slope. Considering the original classes (blue and green), purple points represent the points that the model considers to be blue (despite of their actual label), and orange points represent the points that the model considers to be green. This separating hyperplane is optimal in our problem, the model considers that support points belong to different classes (even thought that is not true regarding to the original values) and no misclassification errors appear in the solution (which is also not true for the original labels).

Figure 4. Optimal solution after re-labeling.

The underlying idea in these models is that based on the geometry of the problem, relabeling some observations can lead to more robust/accurate classifiers. This classifiers can be very useful when dealing with datasets with outliers, and also in datasets in which some noise is known to be added to the data labels.

3. Mathematical Programming models

In this section we present the three mathematical optimization models that we propose to solve the problem consisting on building a hyperplane for binary classification, and, simultaneously, relabeling potential noisy observations. In the first model, relabeling labels on the original observations will be based on the errors with respect to the separating hyperplane. On the other hand, besides considering the errors with respect to the separating hyperplane, the other two models will also take into account information from data based on the geometry of the points through the k-means and the k-medians methods. Nevertheless, despite the fact that some observations are relabeled in our models, in order to make predictions, we will maintain the state for predictions on out of sample data which establishes that observations that lie on the positive half-space of the separating hyperplane will be predicted as positive class observations, meanwhile observations that lie on the negative half-space will be predicted as negative class observations.

3.1. Model 1: Re-label SVM


The first model that we propose relies on a very basic idea, observations will be relabeled based on the error with respect to the separating hyperplane, i.e., a penalty for each relabeling will be considered and the model will determine whether the cost compensate the global misclassification error. Let be the final label for the observation (after relabeling), for all . Hence, using the notation introduced before, the model can be synthetically summarized in the following way.

s.t.

The model above is a SVM model in which observations can be relabeled, and thus, instead of considering on the separability constraint, the relabeled observations are used. In what follows we describe how to incorporate the relabeling to the constraints and the objective function. Observe that if no cost is assumed for relabeling, the model will relabel most of the observations to obtain a null misclassification error, resulting in senseless classifiers. Thus, we model this cost with a penalty, so that the model will try to maintain the original labels on data and it will only relabel observations when a strong gain on the margin or a strong minimization on the errors is produced.

In order to derive a suitable mathematical programming formulation for the problem, we consider the following set of binary variables to model relabelings:

for .

With these variables, , where is the unitary cost of relabeling. Also, to construct the classifier, we consider the following auxiliary set of continuous variables:

for , for .

and by .

Observe that, with the above notation,

Based on the discussion above, our problem can be formulated as follows:

()
s.t. (1)
(2)
(3)
(4)
(5)

In the formulation above, constraints (1) and (2) allow to model the relabeled observations whereas (3) declares that the coefficients of the hyperplane are continuous variables. Constraint (4) defines a set of variables that will be equal to the coefficients of the hyperplane when an observation is relabeled, and zero otherwise. With these new coefficients, if an observation is not relabeled, constraints (1) coincide with those of the classical SVM, that together with the objective function and (5) allow one modeling the misclassification errors as .

Note that () is a Mixed Integer Nonlinear Problem due to its objective function, because even though constraints (2) are written in a nonlinear way, they can be linearized as follows:

for a big enough constant.

Remark 3.1.

In the same manner that we formulate the problem above using a hinge-loss point of view for the misclassification errors, it can be easily adapted to other loss functions as the ramp loss

[21]. This latter case results in the following mathematical programming model:

()
s.t.

Here, the observations that lie outside the margin in the wrong side of the separating hyperplane are equally penalized in the objective function regardless of the misclassification distance.

3.2. Cluster-SVM models

The second family of models that we propose for detecting label noises in the data are based on using similarity measures on the observations. These models will be called Cluster-SVM methods since they perform, simultaneously, two tasks: clustering and classification by SVM. On the one hand, the cluster phase of these methods will induce relabeling based on heterogeneity of the information, whereas the SVM phase computes the classifier after relabeling. We present here two different alternatives for clustering data into two groups and its linkage to a classification system: the 2-median and the 2-mean problems.

The goal of these methods is to find two clusters for a given set of observations, considering that an observation will belong to exactly one cluster. These clusters are built by finding two distinguished points (centroids or medians) representing each of the two groups determined by the observations closer to them, in a way that the overall sum of distances from points to their respective distinguished points is minimum. We distinguish two models under these settings by using two different distance measures: the and the norms.

Let us denote by and the two (unknown) distinguished points, and , the distance from the observation to its closest distinguished points, for (here will represent either the or the -norm). The representation of such a closest distance to the distinguished points will be incorporated to the mathematical programming model using the following set of binary variables:

These clusters represent similar observations and will help the SVM methodology, together with the relabeling, to find more accurate classifiers.

Combining the ideas presented on RE-SVM with the clustering based methods, we can derive a new family of models, that assign observations to two groups based on the clusters obtained by minimizing the overall sum of the norm-based distances from the data points to their corresponding reference points. Moreover, it also tries to separate as much as possible these two clusters by means of a hyperplane. Each one of the clusters is assigned to one of the differentiated classes in our classification problem. Finally, this hyperplane will induce a subdivision of the data space in a way that the decision rule of the classification problem for out-of-sample data is the same that is used in standard SVM. We present below the MIP formulation for this problem. Let be a big enough positive constant and representing either the or the -norm.

()
s.t. (6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)

The objective function of aggregates the following four elements to be simultaneously optimized:

  • The margin (measured with the or norm) has to be maximized.

  • The errors of classification with respect to the separating hyperplane have to be minimized.

  • Relabeled observations have to be penalized.

  • Distances from observations to their reference points have to be minimized.

The aggregation of these four terms leads to define a hyperplane with a good margin, separating two homogenous clusters with respect to distances and classes. Constraint (6) enforces the positive (resp. negative) class observations to be located on the positive (resp. negative) half-space of the separating hyperplane. Each relabeled observation is penalized by units, not allowing a large number of relabeling unless it compensates large misclassification errors or unless they lead to a margin gain. This methodology allows us to keep the same decision rule for out-of-sample data as the one used in standard SVM. Constraints (7) and (8) permit to determine the closest centroid to each observation, whereas constraints (9) and (10) enforce the misclassification errors to be computed with respect to the cluster, i.e. the classification is performed with respect to the classes that have been created based on the similarity of the observations.

The above model results in two different problems depending on the norm-based distances applied.

2Median SVM Model:

This model results from () using the norm . It will be referred to as the -Median SVM model. The problem turns out to be a mixed integer linear problem and can be solved using any of the off-the-shelf MIP solvers.

2Mean SVM Model:

This is the version of model () using the . Since we are using a nonlinear norm, the -Means SVM results in a Mixed Integer Nonlinear Programming problem, that can be reformulated as a Mixed Integer Second Order Cone Optimization (MISOCO) problem. As for the MIP there are nowadays available off-the-shelf commercial optimization solvers implementing routines for its efficient solution.

Remark 3.2 (2- Cluster SVM Model).

One could also consider different -norms () for both the margin measure and the clusters similarity measures. In this case, the problem becomes also a MINLP problem, but based on the results provided in [7], it can also be efficiently reformulated as a MISOCO problem.

4. Experiments

In this section we report the results of our computational experience. We have studied six real datasets from UCI Machine Learning Repository (see [6]), all of them are binary classification problems that come from different topics. The datasets used are: Statlog - Australian Credit Approval (Australian), Breast Cancer (BreastCancer), Statlog - Heart (Heart), Parkinson Dataset with replicated acoustic features (Parkinson), Vertebral Column (Vertebral) and Wholesale Customers (Wholesale). The summarized information about these datasets is detailed in Table 1. For each dataset we report in this table the size () and the dimension of the problem ().

Dataset
Australian 690 14
BreastCancer 683 9
Heart 270 13
Parkinson 240 40
Vertebral 310 6
Wholesale 440 7
Table 1. Datasets used in our computational experiments.

For each of these datasets we have performed five different experiments. The goal in these experiments is to make predictions as accurate as possible on out of sample data. The first experiment consists on making predictions by training the models with the original data. On the other hand, in order to represent attacks in the training data, we have considered four different scenarios in which a random amount of labels, within the set , have been flipped for training data, i.e., four scenarios in which we have added some label-noise on training data.
We have performed a 5-fold cross validation scheme. Thus, data have been splitted into 5 train-test random partitions. In each of these folds we have trained our models and we have used the other four folds for testing. Moreover, we have repeated this 5-fold cross validation 5 times for each dataset, in order to avoid beneficial starting partitions, and we report the average results obtained. For all the instances we have trained our three models and we have compared them with our benchmark, which is standard SVM. The measure used to evaluate the performance of the models have been the accuracy, in percentage, on out of sample data:

In each of the instances we have used a grid on the cost parameters and the best result obtained in test among these parameters is the one reported. The grids used in the experiments are the following:

  • SVM: .

  • RE-SVM: .

  • -medians-SVM: , .

  • -means-SVM: , .

The mathematical programming models were coded in Python 3.6, and solved using Gurobi 7.5.2 on a PC Intel Core i7-7700 processor at 2.81 GHz and 16GB of RAM. We have not solved to optimality all the instances, especially those with the 2-means-SVM in which the problem becomes nonlinear, and hence we have established a time limit of 30 seconds for all the experiments. Moreover, in order to help the solver on the 2-means-SVM, we have upload it with an initial feasible solution that was obtained in the 2-medians-SVM problem.
In Table 2 we report the average accuracy results obtained in all the experiments for the different models and the different levels of label-noise. In such a table we have used the yellow-green color to indicate the results in which we are a better than the benchmark, the green color to indicate whether we are a better than the benchmark, and the cyan color to highlight the results in wich we are at least a above the benchmark. Regarding to the results, we can conclude that our three models perform better than SVM when an attack in the training data is produced. Besides, the stronger the attack, the bigger the difference between our models’ results and SVM’s results. We can also point out that 2-medians-SVM and 2-means-SVM perform better than RE-SVM for heavy attacks (

of flipped observations), however, these models require more time to be trained since they have one more hyperparameter to calibrate. To illustrate this, we show in Figure

5 the accuracy boxplots of the 500 instances per dataset (5 partitions 5 scenarios 5 folds 4 models) in which we see how SVM model has lower tails and wider boxes than RE-SVM, and RE-SVM has wider boxes than 2-medians-SVM and 2-means-SVM, which are explained by the behavior of these models against the attacks.

0% 20% 30% 40% 50%
Australian SVM 86.11 85.43 79.23 68.13 59.47
RE-SVM 86.42 85.68 83.37 76.97 66.13
2-medians-SVM 86.08 85.84 84.67 78.95 69.54
2-means-SVM 85.97 85.74 82.65 77.14 67.70
BreastCancer SVM 96.49 93.47 89.96 85.94 68.16
RE-SVM 96.88 96.20 94.97 90.36 77.00
2-medians-SVM 96.63 95.31 94.46 91.10 87.31
2-means-SVM 96.96 95.93 95.39 93.11 90.01
Heart SVM 82.23 76.86 69.68 63.79 56.90
RE-SVM 82.84 78.38 73.16 68.86 61.25
2-medians-SVM 82.01 78.75 77.29 75.38 71.99
2-means-SVM 82.06 78.81 77.40 75.97 72.90
Parkinson SVM 81.66 74.74 70.17 62.28 57.82
RE-SVM 82.43 77.64 73.22 67.29 62.97
2-medians-SVM 80.32 78.62 78.12 77.51 76.28
2-means-SVM 80.47 79.22 78.78 78.20 77.03
Vertebral SVM 84.51 75.43 71.34 66.78 57.47
RE-SVM 85.10 79.61 74.83 72.33 67.92
2-medians-SVM 85.31 82.62 80.80 78.30 76.31
2-means-SVM 86.28 84.32 81.77 79.91 76.76
Wholesale SVM 90.08 85.30 79.74 72.23 57.73
RE-SVM 90.39 88.77 85.97 80.12 69.07
2-medians-SVM 90.58 89.54 87.79 82.78 73.54
2-means-SVM 91.23 89.56 87.39 85.88 82.92
Table 2. Accuracy results of our computational experiments.
Figure 5. Accuracy Boxplots of the obtained accuracies.

5. Conclusions

This paper presents a methodology to construct a classification rule that at the same time incorporates the detection of label noises in the datasets. Our methodology combines the power of SVM and the features of clustering analysis to simultaneously identify wrong labels to build a separating hyperplane maximizing the margin, minimizing the misclassification errors and penalizing relabeling. The rationale is simple: observations identified as wrongly labeled will be relabeled only if the gain in margin or the decrease in misclassification error compensate the flipping. In spite of its theoretical simplicity we show the exceptional performance of our methodology in a number of databases taken from the UCI repository.

These models are implemented using mathematical programming formulations with some integer variables (MIP). In all cases, they give rise to models that are simple and that enjoy the quality of being solvable by nowadays off-the-shelf commercial solver (GUROBI, CPLEX, XPRESS…)

Our findings are not only of theoretical interest. Its practical performance when applied to databases is remarkable. In all tested cases, our methods are superior to the considered benchmark that in our case is standard SVM. Thus, they are directly applicable to datasets in which flipped labels are suspected, resulting in robust classifiers to noisy labels.

Further research on the topic includes, among others, the application of alternative clustering strategies, as those based on ordered median objective functions, the extension of the proposed models to the multiclass SVM framework or the twin SVM methodology. Also, the the use of kernel tools in our approaches, in order to be able to construct non linear classifiers has to be investigated.

Acknowledgements

This research has been partially supported by Spanish Ministry of Education and Science/FEDER grant number MTM2016-74983-C02-(01-02), and projects FEDER-US-1256951, CEI-3-FQM331 and NetmeetData: Ayudas Fundación BBVA a equipos de investigación científica 2019.

References

  • [1] Agarwal, N., Balasubramanian, V.N., Jawahar, C.: Improving multiclass classification by deep networks using dagsvm and triplet loss. Pattern Recognition Letters (2018). DOI 10.1016/j.patrec.2018.06.034. URL http://dx.doi.org/10.1016/j.patrec.2018.06.034
  • [2] Bahlmann, C., Haasdonk, B., Burkhardt, H.: On-line handwriting recognition with support vector machines ” a kernel approach. In: Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02), IWFHR ’02, pp. 49–. IEEE Computer Society, Washington, DC, USA (2002). URL http://dl.acm.org/citation.cfm?id=851040.856840
  • [3] Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems. Numerische mathematik 4(1), 238–252 (1962)
  • [4] Bennett, K. P., Demiriz, A. (1999). Semi-supervised support vector machines. Advances in Neural Information processing systems 11, 368–374.
  • [5] Bi, J., Zhang, T. (2005). Support vector classification with input data uncertainty. In Advances in neural information processing systems (pp. 161-168).
  • [6] Biggio, B., Nelson, B., Laskov, P. (2011, November). Support vector machines under adversarial label noise. In Asian Conference on Machine Learning (pp. 97-112).
  • [7] Blanco, V., Ben Ali, S. and Puerto, J. (2014). Revisiting several problems and algorithms in Continuous Location with norms. Computational Optimization and Applications 58(3): 563-595.
  • [8] Blanco, V., Puerto, J., Salmerón, R. (2018). Locating hyperplanes to fitting set of points: A general framework. Computers & Operations Research, 95, 172-193.
  • [9] Blanco, V., Puerto, J., and Rodríguez-Chía, A. M. (2020). On -Support Vector Machines and Multidimensional Kernels. Journal of Machine Learning Research 21 (2020).
  • [10] Blanco, V., Japón, A., Puerto, J. (2020). Optimal arrangements of hyperplanes for multiclass classification. Advances in Data Analysis and Classification 14, 175–199.
  • [11] Boucher, J‐P., Denuit, M. and Guillen, M. Number of accidents or number of claims? An approach with zero‐inflated Poisson models for panel data. Journal of Risk and Insurance 76(4), 821–846 (2009).
  • [12] Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995)
  • [13] Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE transactions on information theory 13(1), 21–27 (1967)
  • [14]

    Duan, Y., Wu, O. (2018). Learning with auxiliary less-noisy labels. IEEE transactions on neural networks and learning systems, 28(7), 1716-1721.

  • [15] Federal Trade Commission. Consumer sentinel network data book for January-December 2016. March 2017.
  • [16]

    Ghaddar, B., and Naoum-Sawaya, J. (2018). High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research, 265(3), 993-1004.

  • [17]

    Ghoggali, N., Melgani, F. (2009). Automatic ground-truth validation with genetic algorithms for multispectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 47(7), 2172-2181.

  • [18] Han, X., Chang, X. (2013). An intelligent noise reduction method for chaotic signals based on genetic algorithms and lifting wavelet transforms. Information Sciences, 218, 103-118.
  • [19] Harris, T.: Quantitative credit risk assessment using support vector machines: Broad versus narrow default definitions. Expert Systems with Applications 40(11), 4404–4413 (2013)
  • [20] Horn, D., Demircioglu, A., Bischl, B., Glasmachers, T., and Weihs, C. (2016). A comparative study on large scale kernelized support vector machines. Advances in Data Analysis and Classification, 1-17.
  • [21] Huang, X.L., Shi, L. and Suykens, J.A.K. Ramp loss linear programming support vector machine J. Mach. Learn. Res., 15 (2014), 2185-2211.
  • [22] K. Ikeda and N. Murata (2005). Geometrical Properties of Nu Support Vector Machines with Different Norms. Neural Computation 17(11), 2508-2529.
  • [23] K. Ikeda and N. Murata (2005). Effects of norms on learning properties of support vector machines. ICASSP (5), 241-244
  • [24] Kašćelan, V., Kašćelan, L., Novović Burić, M.: A nonparametric data mining approach for risk prediction in car insurance: a case study from the montenegrin market. Economic research-Ekonomska istraživanja 29(1), 545–558 (2016)
  • [25]

    Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval.

    In: European conference on machine learning, pp. 4–15. Springer (1998)
  • [26] Lichman, M.: UCI machine learning repository (2013). URL UCIMachineLearningRepository
  • [27] López, J., Maldonado, S., and Carrasco, M. (2018). Double regularization methods for robust feature selection and SVM classification via DC programming. Information Sciences, 429, 377-389.
  • [28] Labbé, M., Martínez-Merino, L. I., and Rodríguez-Chía, A. M. (2018). Mixed Integer Linear Programming for Feature Selection in Support Vector Machine. Discrete Applied Mathematics, https://doi.org/10.1016/j.dam.2018.10.025.
  • [29] Majid, A., Ali, S., Iqbal, M., Kausar, N.: Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Computer methods and programs in biomedicine 113(3), 792–808 (2014)
  • [30] Maldonado, S., Pérez, J., Weber, R., Labbé, M. (2014). Feature selection for support vector machines via mixed integer linear programming. Information sciences, 279, 163-175.
  • [31] S. Maldonado, C. Bravo, J. López, J. Pérez (2017) Integrated framework for profit-based feature selection and SVM classification in credit scoring. Decision Support Systems, 104, 113-121.
  • [32] Mangasarian, O.L. Arbitrary-norm separating plane. Oper. Res. Lett., 24 (1– 2):15–23 (1999).
  • [33] Martínez, D., Millerioux, G. (2000). Support vector committee machines. European Symposium on Artificial Neural Networks-ESSANN’2000.
  • [34] Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F.

    e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.6-8.

    https://CRAN.R-project.org/package=e1071 (2017)
  • [35]

    Nalepa, J., Kawulok, M. (2018). Selecting training sets for support vector machines: a review. Artificial Intelligence Review, 1-44.

  • [36] Pedregosa, F., Varoquaux, G. , Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E., Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830, 2011.
  • [37] Peng, X, Chen, D. (2018). PTSVRs: Regression models via projection twin support vector machine Information Sciences 435, 1–14.
  • [38] Peng, X. Xu, D., Kong, L. and Chen, D. (2016). L1-norm loss based twin support vector machine for data recognition Information Sciences 340–341, 86-103.
  • [39] Radhimeenakshi, S.: Classification and prediction of heart disease risk using data mining techniques of support vector machine and artificial neural network. In: Computing for Sustainable Global Development (INDIACom), 2016 3rd International Conference on, pp. 3107–3111. IEEE (2016)
  • [40] Xiao, H., Biggio, B., Nelson, B., Xiao, H., Eckert, C., Roli, F. (2015). Support vector machines under adversarial label contamination. Neurocomputing, 160, 53-62.
  • [41] Xu, L., Crammer, K., Schuurmans, D. (2006, July). Robust support vector machine training via convex outlier ablation. In AAAI (Vol. 6, pp. 536-542).