1 Introduction
One of the current challenges in machine learning is the lack of sufficient data [Forman and Cohen2004]
. In this scenario, overfitting becomes hard to avoid, outliers and noise represent an important issue and the model generally has high variance. Several approaches have been proposed to deal with small datasets, although the work in this matter is still scarce. From all the proposed approaches, synthetic sample generation or data augmentation techniques
[Li and Wen2014, Wong et al.2016, Yang et al.2011] have shown competitive performance, acting as a regulariser [Zhang et al.2017], preventing overfiting and improving the robustness of both classifiers and regressors.
The generation of virtual examples is highly nontrivial and has been studied from different perspectives. Proposed methods use prior information [Niyogi, Girosi, and Poggio2002], add noise [Zhang et al.2017], apply simple transformations [Cireşan et al.2010, Simard, Steinkraus, and Platt2003, Krizhevsky, Sutskever, and Hinton2012, Szegedy et al.2015] or use data oversampling approaches [Chawla et al.2002, PérezOrtiz et al.2016].
The most straightforward oversampling approach is to randomly replicate data. However, this can lead to overfitting [Galar et al.2012]. Another common approach is to do oversampling taking into account the data distribution. A convex combination of patterns close in the input space has been successfully used for that purpose [Chawla et al.2002, Zhang et al.2017, PérezOrtiz et al.2016].
In this paper we investigate the benefits and limitations of this simple data augmentation technique coupled with SSL support vector machines. The motivations for such an approach are: i) when performing oversampling one of the biggest challenges is how to label synthetic examples (potentially alleviated when using SSL as no label is assumed) and ii) the hypothesis that oversampling by convex combination of patterns can support the cluster assumption in SSL and help to simplify the classification task. The cluster assumption states that high density regions with different class labels must be separated by a low density region. Given this, two patterns are likely to have the same class label if they can be connected by a path passing through high density regions. The method proposed here is based on the synthetic generation of high density regions as an inductive bias for the classifier. We perform a thorough set of experiments over synthetic and benchmark binary datasets, showing how this approach helps to mitigate the effect of small, highdimensional and imbalanced datasets.
2 Methodology
2.1 Data oversampling by convex combination
Assume that data forms a finite sample i.i.d. from a distribution and that our aim is to construct a finitesample function of . Resampling approximates the finitesample distribution of the function computed over by the exact distribution of the function over :
(1) 
where is defined as the resampling distribution and explicitly depends on the observations in . Resampling is commonly used in machine learning for data augmentation.
In the case of binary classification we also have access to a labelling
. When dealing with small or imbalanced datasets, appropriately capturing the joint probability function
might be unrealistic. Because of this, most oversampling approaches are rather simple. Usually, synthetic patterns are generated by convex combination of two seed patterns belonging to the same class and labelled directly using the same class label [Chawla et al.2002]. The first seed pattern is chosen randomly, and the second one is chosen as one of its nearest neighbours. is responsible for avoiding label inconsistencies and exploiting the local information of the data, but it can also significantly limit the diversity of synthetic patterns.Limitations
Figure 1 shows a toy imbalanced dataset where the classes are not convex (left) and some examples of synthetic data patterns that could be created for the minority class in order to balance the class distributions (right). This shows a representation of the main problem encountered when using this oversampling approach, especially when the parameter of nearest neighbour is not properly optimised: synthetic patterns are created in the region of the majority class and if we naively label these patterns as minority class patterns, we introduce what we denote as a label inconsistency.
Different reformulations of SMOTE have been proposed to solve this. One of the recent proposals to do so is to perform synthetic oversampling in the feature space induced by a kernel function [PérezOrtiz et al.2016]. In this new feature space the classes would be ideally linearly separable (and convex) and label inconsistencies can be avoided when using a convex combination of patterns. This technique has been seen to significantly improve the standard SMOTE.
Effect on the data distribution
We study now the effect of oversampling by means of a convex combination of patterns. At every step we create a synthetic instance by selecting at random two patterns :
(2)  
we restrict , where  represents a function that returns the nearest neighbours of . Note that when oversampling within a classification framework is usually also restricted so that .
For simplicity, let us first assume and and
come from the same Normal distribution
. The definition of the characteristic function of the Normal distribution is:
(3) 
The new random variable
will have the characteristic function:(4) 
meaning that the convex combination of these two patterns will follow the distribution: , which for translates into being within . This means that the resampled distribution will most probably have a lower variance, yielding synthetic data more concentrated around the mean.
If seed patterns do not come from the same distribution, i.e. and , then . We assume, however, that given that these patterns are neighbours, they do come from the same distribution.
The density function of assuming is:
(5) 
being the density function of the Normal distribution and the density function . The variance of can thus be evaluated as:
(6)  
This integral can be numerically evaluated. When doing so we see that the original variance is always reduced by 0.333.
Given that oversampling is applied independently per dimension, we have: where is the ith dimension of .
Let us now analyse the multivariate case where , and . For that let us first assume a matrix for changing the basis such that . If we choose
to be a basis formed by the unit eigenvectors of
then it is easy to show that (i.e. the covariance matrix of) is a diagonal matrix formed by the eigenvalues associated to
, i.e. the ith diagonal value is the variance of along the ithe eigenvector of . In the rotated axis Eq. 2 can be rewritten as:(7) 
since is a linear operator. Convex combinations of patterns are thus invariant to rotations of the coordinate axis. In this axis, the data coming from our transformed resampling distribution will have the diagonal covariance matrix:
(8) 
It follows that when oversampling through convex combinations of patterns using the uniform distribution the mean of the data will remain unchanged and so will the eigenvectors of the covariance matrix, but the eigenvalues will shrink.
Figure 2 shows the result of oversampling two Normal distributions, where represents the data associated to class in our classification problem. It can be seen that by performing convex combinations of patterns we change the data distribution. We use this to induce highdensity regions that are later used by the SSL algorithm.
2.2 Semisupervised learning (SSL)
In semisupervised learning (SSL), we not only have access to labelled data drawn i.i.d. according to , but also to additional unlabelled data drawn i.i.d. according to .
Up to this date, theoretical analysis of SSL fails to provide solid evidence for the benefits of using unlabelled patterns in a supervised learning task [BenDavid, Lu, and Pál2008]. Generally, the consensus reached in the literature is that unlabelled data: (i) should be used with care because it has been seen to degrade classifier performance in some cases (e.g. when we assume incorrect data models [Cozman, Cohen, and Cirelo2003] or there are outliers or samples of unknown classes [Shahshahani and Landgrebe1994]; (ii) is mostly beneficial in the presence of a few labelled samples [Singh, Nowak, and Zhu2008, Shahshahani and Landgrebe1994, Cozman, Cohen, and Cirelo2003]
; (iii) can help to mitigate the effect of the Hughes phenomenon (i.e. the curse of dimensionality)
[Shahshahani and Landgrebe1994]; (iv) can help only if there exists a link between the marginal data distribution and the target function to be learnt and both labelled and unlabelled data are generated from the same data distribution [Huang et al.2006]; and finally (v) can improve on the performance of supervised learning when density sets are discernable from unlabelled but not from labelled data [Singh, Nowak, and Zhu2008].SSL algorithms can be classified using the following taxonomy [Chapelle, Schölkopf, and Zien2010]: i) Generative models which estimate the conditional density ; ii) low density separators that maximise the class margin; iii) graphbased models which propagate information through a graph; and finally, iv) algorithms based on a change of representation. The most widely used SSL algorithms belong to the low density separators or the graphbased models groups. Generative approaches are said to solve a more complex problem than discriminative ones and require more data and the algorithms based on a change of representation do not use all the potential of unlabelled data. Because of this, we focus on low density separators.
2.3 Exploiting the cluster assumption
Labelling synthetically generated patterns without knowledge about is a highly nontrivial problem. Instead, we approach this by using SSL, assuming that every synthetic pattern belongs to the set of unlabelled data, .
We exploit the cluster assumption by artificially connecting labelled patterns and belonging to the same class () through unlabelled samples. Two patterns and are said to be connected if there exist a sequence of relatively dense patterns such that the marginal density varies smoothly along the sequence of patterns between and [Singh, Nowak, and Zhu2008]. We have shown in Section 2.1 that oversampling two patterns and by convex combination makes the density function more compact in the region that connects them. This property is maintained for all random variables that are a linear combination of two patterns and that come from the same distribution (independently on whether their distribution has the reproductive property). The cluster assumption is the basis for different lowdensity semisupervised learners. This assumption implies that if two patterns are linked by a path of high density (e.g., if they belong to the same cluster), then their outputs are likely to be similar [Chapelle, Schölkopf, and Zien2010]. Our proposal of using as unlabelled samples can thus be seen as synthetically generating high density regions for each class as an inductive bias for the classifier. An example of oversampling can be found in Figure 3 where oversampled patterns are plotted in black.
Our objective is thus to seek a classifier and a labelling by minimising:
(9) 
where
are loss functions and
and are realvalued parameters which reflect confidence in labels and the cluster assumption respectively. The labels of synthetic data are treated as additional optimisation variables, as it is common in SSL [Sindhwani, Keerthi, and Chapelle2006, Sindhwani and Keerthi2006]. An effective loss function over an unlabelled pattern is , which corresponds to making the optimal choice for unknown label and promotes decision boundaries that pass through lowdensity regions.Choice of low density separator
The most common approach for constructing a SSL low density separator is to use a maximum margin approach (e.g. using Support Vector Machines, SVMs). However, the formulation in Eq. 9 results in a hard optimisation problem when unlabelled data is abundant. In the semisupervised SVM classification setting (S
VM), this minimisation problem is solved over both the hyperplane parameters
and the label vector ,(10) 
where and V is a loss function. This problem is solved under the class balancing constraint:
(11) 
where is a userspecified ratio of unlabelled data to be assigned to the positive class. Unlike SVMs, this SVM formulation leads to a nonconvex optimization problem, which is solved either by combinatorial or continuous optimisation [Chapelle, Sindhwani, and Keerthi2008].
The method chosen in this paper is SVM
, which has shown promising performance and is robust to changes in the hyperparameters
[Chapelle, Sindhwani, and Keerthi2008]. This technique is based on a local combinatorial search guided by a label switching procedure. The vector is initialised as the labelling given by a SVM trained only on the labelled set. This labelling is restricted to maintain the class ratios previously defined by . Subsequent steps of the algorithm comprise of switching the labels of two unlabelled patterns and (in order to maintain class proportions) that satisfy the following condition:(12) 
i.e. the loss after switching these labels is lower.
Concerning the computational complexity of our proposal, the main bottleneck is the SSL part as the complexity of oversampling is linear. The complexity of SVM is of the same order as that of a standard SVM. However, it will be trained with more data (i.e. real plus synthetic).
Ensemble of synthetic hypotheses
Since the estimation of the resampling distribution is a stochastic process, we also consider the use of different resampling distributions in an ensemble framework. The application is straightforward: each member of the ensemble is formed by a resampling distribution and a SVM model . Final labels are computed by majority voting.
Dataset  Dataset  

haberman (HA)  306  3  hepatitis (HE)  155  19 
listeria (LI)  539  4  bands (BA)  365  19 
mammog. (MA)  830  5  heartc (HC)  302  22 
monk2 (MO)  432  6  labor (LA)  57  29 
appendicitis (AP)  106  7  pima (PI)  768  8 
glassG2 (GL)  163  9  credita (CR)  690  43 
saheart (SA)  462  9  specfth. (SP)  267  44 
breastw (BW)  699  9  card (CA)  690  51 
heartY (HY)  270  13  sonar (SO)  156  60 
breast (BR)  286  15  colic (CO)  368  60 
housevot. (HO)  232  16  creditg (CG)  1000  61 
banana  5300  2  ionosphere  351  34 
liver  583  10  wisconsin  569  32 
All nominal variables are transformed into binary ones 
3 Experimental results
In our experiments we try to answer the following questions:

What are the largest contributing factors to the degradation in performance when dealing with small datasets?

Does oversampling prevent the need for collecting further data in small and imbalanced scenarios?

How does our approach of using SSL and not labelling data compares to other approaches in the literature?

In the context of classification, is it class dependent oversampling better than classindependent?
To answer the first question, we do a first experiment using synthetically generated datasets. To answer questions 24, we perform two additional experiments, in which we test a wide range of approaches with realworld benchmark datasets, changing the percentage of missing patterns to study the influence of the data sample size (second experiment) and imbalanced class distributions (third experiment).
All the methodologies have been tested considering the paradigm of Support Vector Machines (SVM) [Cortes and Vapnik1995]. The benchmark datasets are extracted from the UCI repository [Lichman2013] (characteristics shown in Table 1). These datasets are not originally imbalanced or extremely small. Instead, these characteristics are generated synthetically by removing a percentage of patterns at random, so that the performance can be compared against the one with the original full dataset.
Because of space restrictions, we only show mean test results and rankings, but all results can be accessed online^{1}^{1}1https://doi.org/10.17863/CAM.32312.
3.1 Methodologies tested
In order to address the difference between using real vs. synthetic data, we compare standard supervised SVMs (with no oversampling or data missing) to different approaches with data Missing Completely At Random (MCAR). Note that this comparison is not strictly fair, but it provides a useful baseline performance to evaluate our oversampling approaches. Thus, our objective is not to surpass the performance achieved with real data by the use of synthetic one, but rather to reach a similar performance. We also compare our proposed approach to: 1) previous oversampling approaches that use naive labelling [Chawla et al.2002, PérezOrtiz et al.2016] and 2) transductive graphbased SSL, as another alternative for labelling synthetic data. Within our proposed methods we have different approaches: classdependent and independent oversampling (i.e. oversampling classes separately or not) and an ensemble of SVM models using unlabelled synthetically generated patterns. Note that the optimisation procedure of SVM and SVM is different, which may influence the results (SVM is said to be more prone to reach local optima). Because of this, we include another approach as a baseline: SVM model that reintroduces the real data removed at random in the unsupervised set. The main purpose here is to compare oversampled vs. real data within the SVM framework.
3.2 Experimental setup
A stratified
fold technique has been performed to divide all datasets. Each experiment is repeated 3 times in order to obtain robust results (except for deterministic methods). The results are taken as mean and standard deviation of the selected measures. The same seed is used for random number generation, meaning that the same patterns are removed from the dataset and created by oversampling. The cost parameter of SVMbased methods was selected within the values
by means of a nested fold method with the training set. The kernel parameter has been crossvalidated within the values for the SVM based methods. For all the methods using largescale semisupervised SVMs [Sindhwani and Keerthi2006], the regularisation parameters and were optimised within the values (also by means of a nested fold crossvalidation). For easing the comparisons, the number of synthetically generated patterns is set to the same removed initially from the dataset. nearest neighbours were evaluated to generate synthetic samples. The Euclidean distance has been used for all the distance computations.The parameter used for the oversampling method in [PérezOrtiz et al.2016] to control the dimensionality of the feature space has been crossvalidated within the values {0.25, 0.5, 0.75}. The kernel width parameter associated to transductive methods (to construct the graph) has been set to the same value of the SVM kernel used. The rest of parameters have been set to default values.
There are several minor modifications of these algorithms when using them for either small or imbalanced datasets. As stated before, in the case of imbalanced data, we introduce a new parameter for SVM methods, which controls the ratio of patterns assigned to the minority class. This class balancing parameter has been fixed to the initial class distribution (in the first and second experiments where the data is balanced) and crossvalidated within the values {0.5,0.7,0.9} for the imbalanced datasets (where all the synthetically generated patterns are supposed to belong to the minority class, but where we need to allow a certain amount of errors, to fix label inconsistencies). Moreover, for the case of graphbased algorithms, several issues have been noticed in imbalanced domains [Zheng and Skillicorn2016]. To prevent this, we also use a class mass normalisation procedure to adjust the class distribution so that it matches the priors [Zhu, Ghahramani, and Lafferty2003].
3.3 Evaluation metrics
The results have been reported in terms of two metrics:

Accuracy (). However, given that for imbalanced cases this metric is not be the best option, we use the mean of the sensitivities per class (referred to as ).

The Geometric Mean of the sensitivities (
) [Kubat and Matwin1997], where is the sensitivity for the positive class (ratio of correctly classified patterns considering only this class), and is the sensitivity for the negative one.
The measure for the parameter selection was given its robustness [Kubat and Matwin1997].
3.4 Results
Firstly, we test the influence of the number of patterns removed at random. Figure 4 shows the mean degradation in test performance for SMCAR when changing the number of patterns removed from the benchmark datasets. As can be seen, all metrics experience a relatively large degradation.
First experiment: Synthetically generated datasets
synthetic datasets generated with [SánchezMonedero et al.2013] are used. All of these datasets represent binary and perfectly balanced classification tasks, in which the data has been generated using a Normal distribution changing different parameters: 1) dimensionality of the input space (d, which is set to 2, 10 and 100 dimensions), 2) the number of patterns (N, set to 50, 100 and 1000) and 3) the variance of the data (V, controlling the overlapping between the classes and set to 0.167, 0.25 and 0.5). All combinations of these parameters have been explored. All the classes have been designed to be bimodal. Figure 5 shows two examples of the synthetic datasets generated. We test three ratios of patterns removed at random (MCAR): , and .
For this experiment, we use four approaches: SVM (with the original dataset), SMCAR (MCAR, no oversampling), SVM with real unlabelled data (SVMReal, for which the data that we remove is included again as unlabelled in the model) and our proposal using classdependent oversampling (SVMOvS). Note that the comparison against SVM and SVMReal is only for comparison purposes and not strictly fair, since the classifier has access to all the real data, which is not the case for SMCAR and SVMOvS.
d  N  V  MCAR  SVM  SMCAR  SVMReal  SVMOvS 

2  50  0.167  0.2  
2  50  0.167  0.5  
2  50  0.167  0.8  
2  50  0.25  0.2  
2  50  0.25  0.5  
2  50  0.25  0.8  
2  50  0.5  0.2  
2  50  0.5  0.5  
2  50  0.5  0.8  
2  100  0.167  0.2  
2  100  0.167  0.5  
2  100  0.167  0.8  
2  100  0.25  0.2  
2  100  0.25  0.5  
2  100  0.25  0.8  
2  100  0.5  0.2  
2  100  0.5  0.5  
2  100  0.5  0.8  
2  1000  0.167  0.2  
2  1000  0.167  0.5  
2  1000  0.167  0.8  
2  1000  0.25  0.2  
2  1000  0.25  0.5  
2  1000  0.25  0.8  
2  1000  0.5  0.2  
2  1000  0.5  0.5  
2  1000  0.5  0.8  
10  50  0.167  0.2  
10  50  0.167  0.5  
10  50  0.167  0.8  
10  50  0.25  0.2  
10  50  0.25  0.5  
10  50  0.25  0.8  
10  50  0.5  0.2  
10  50  0.5  0.5  
10  50  0.5  0.8  
10  100  0.167  0.2  
10  100  0.167  0.5  
10  100  0.167  0.8  
10  100  0.25  0.2  
10  100  0.25  0.5  
10  100  0.25  0.8  
10  100  0.5  0.2  
10  100  0.5  0.5  
10  100  0.5  0.8  
10  1000  0.167  0.2  
10  1000  0.167  0.5  
10  1000  0.167  0.8  
10  1000  0.25  0.2  
10  1000  0.25  0.5  
10  1000  0.25  0.8  
10  1000  0.5  0.2  
10  1000  0.5  0.5  
10  1000  0.5  0.8  
100  50  0.167  0.2  
100  50  0.167  0.5  
100  50  0.167  0.8  
100  50  0.25  0.2  
100  50  0.25  0.5  
100  50  0.25  0.8  
100  50  0.5  0.2  
100  50  0.5  0.5  
100  50  0.5  0.8  
100  100  0.167  0.2  
100  100  0.167  0.5  
100  100  0.167  0.8  
100  100  0.25  0.2  
100  100  0.25  0.5  
100  100  0.25  0.8  
100  100  0.5  0.2  
100  100  0.5  0.5  
100  100  0.5  0.8  
100  1000  0.167  0.2  
100  1000  0.167  0.5  
100  1000  0.167  0.8  
100  1000  0.25  0.2  
100  1000  0.25  0.5  
100  1000  0.25  0.8  
100  1000  0.5  0.2  
100  1000  0.5  0.5  
100  1000  0.5  0.8 
From this experiment, we had results for datasets with different characteristics for three different MCAR levels and four methods (a total of 324 individual results). These results can be seen in Table 2. To analyse these properly, we summarised these results independently per factor in Figure 6 using boxplots. Some conclusions can be drawn: Firstly, the overlapping of the classes (variance factor) is the main factor contributing to performance degradation. If the data does not overlap (small variance), a high performance can be achieved even if we remove data (compare method (1) to (2)). The same is applicable when data dimensionality is low, e.g. for d=2 and d=10 removing data is not problematic (again, compare method (1) to (2)). However, an important degradation is seen when d=100. The removal of data especially affects small datasets (N=50 and N=100) but not when N=1000. Concerning the proposed approach (SVMOvS), similar results can be achieved using real unlabelled data (SVMReal), which is a positive outcome. Both results are also close to the performance using the complete dataset (compare approaches (3) and (4) to (1)), which means that oversampled data can replace real one, even when real data is labelled. In some cases, such as in highdimensional datasets, the performance even surpasses the one obtained by the original data. The proposal not only helps with small datasets, but also with relatively large ones (N=1000), perhaps because in this scenario the amount of data helps simplify the oversampling task by exploiting better the local information. Thus, we can conclude that the proposed methodology helps specially for high dimensional datasets independently of their size and class overlapping, and that its performance is stable with respect to the percentage of data that we removed (last factor).
MCAR ()  MCAR ()  MCAR ()  
Ranking  
SVM  
SMCAR  
SVM+OvS  
SVM+kOvS  
Transductive graphbased approaches  
Real unlab. data  
Class dep. OvS  
Class indep. OvS  
SVM approaches (proposed)  
Real unlab. data  
Class dep. OvS  
Class indep. OvS  
Ensemble  
Friedman’s test  
Confidence interval  
Fvalue MCAR =: : , :  
Fvalue MCAR =: : , :  
Fvalue MCAR =: : , : 
Figure 7 shows the effect of changing the number of synthetically generated patterns for SVMOvS. It can be seen that similar results are obtained independently of the number of generated patterns. The percentage of data missing at random has, however, an important impact.
Second experiment: Small sample size
For this experiment, we artificially reduce the size of the benchmark datasets (again testing a proportion of , and reduction). Because of the amount of results we only provide the test mean ranking (the lower the better) in Table 3
. It can be seen that the test rejects the nullhypothesis that all of the algorithms perform similarly in mean ranking for all cases. As mentioned before, here, we also include two oversampling approaches from the literature: SVM+OvS
[Chawla et al.2002] and SVM+kOvS [PérezOrtiz et al.2016] and test transductive approaches to label synthetic data. Again, we compare several strategies: classdependent and independent oversampling, the introduction of real unlabelled data in the SVM model for comparison purposes and an ensemble. Note that both SVM and methods based on real unlab. data are unrealistic and only used as a baseline. Several conclusions can be drawn: Comparing all oversampling approaches and SMCAR it can be seen that a convex combination of patterns can be successfully used to generate synthetic data. The use of part of the real data as unlabelled also improves the result to a reasonable extent: it is better than standard data oversampling and if the number of data is not extremely low even better than use the original dataset, which may indicate that there might be some noise in the labels. The combination of oversampling and semisupervised learning approaches is promising and can be applied within each class or using all data independently of their labels, reaching in most cases the baseline performance of the use of the entire dataset. Observing individual results we noticed that for the smallest datasets it is better to use all patterns for oversampling, while for bigger datasets the best approach is to do oversampling dependent on the class. In general, transductive graphbased approaches do not report acceptable results, maybe because they highly depend on the design of a graph or because these techniques precise a larger amount of data. Finally, the introduction of diversity in an ensemble by the use of a stochastic convex combination of patterns is very promising, improving in most cases the results achieved with the original complete dataset.MCAR ()  MCAR ()  
Ranking  
SVM  
SMCAR  
SVM+OvS  
SVM+kOvS  
Transductive graphbased approaches  
Class dep. OvS  
Proposed SVM approaches  
Class dep. OvS  
Class indep. OvS  
Friedman’s test  
Confidence interval  
Fvalue MCAR=: : , :  
Fvalue MCAR=: : , : 
To quantify whether a statistical difference exists among the algorithms studied, a procedure is employed to compare multiple classifiers in multiple datasets [Demsar2006]. Table 3 shows the result of applying the nonparametric statistical Friedman’s test (for a significance level of ) to the mean and rankings. The test rejects the nullhypothesis that all algorithms perform similarly in mean ranking for both metrics (for the differences are larger). This table shows that the best results in and are obtained obviously by the SVM technique (which does not entail any loss of information). This method is followed then by the semisupervised SVMMCAR+USP for and SVMMCAR+LSPFS for . These results demonstrate that standard SMOTE (SVMMCAR+LSP) creates label inconsistencies and that these can be avoided by the use of oversampling in the feature space or a semisupervised approach. Moreover, as the literature has suggested, graphbased (SVMMCAR+LSPH) are not ideal for imbalanced datasets, although they still represent a better option than standard SMOTE. These results are accentuated when increasing the number of data missing at random.
Third experiment: Imbalanced samples
We also study the effect of our proposal in imbalanced classification setups. For this, we artificially induce this imbalance in our data by removing a percentage of patterns for the minority class. In this case, we test a subset of the methods that we used in the previous experiment (results shown in Table 4). Again, we can see that SMOTE (SVM+OvS) can be improved, either by optimising the patterns to generate (SVM+kOvS) or the labels of the synthetic patterns (proposed approaches). It can also be seen that it is better to oversample only the minority class (i.e. class dependent).
4 Conclusions
We explored the idea of introducing synthetic data as unsupervised information in semisupervised support vector machines, where labels of synthetic data are treated as additional optimisation variables. Our experimental study has shown that: 1) synthetic patterns help when data is scarce with respect to the data dimensionality and can be used in a variety of cases as an alternative to collecting more data; 2) convex combination of input training data can be used for generating those synthetic samples, but these do not have to be necessarily labelled; and 3) the introduction of synthetic data as unsupervised knowledge can help to improve the classification in small, highdimensional or imbalanced scenarios by acting as an inductive bias for the classifier.
Future work comprises testing such approach in a regression setting and with other semisupervised learning approaches (e.g. the use of synthetic imaging data with autoencoders or deep belief networks).
5 Acknowledgments
This work was supported by the project TIN201785887C21P of the Spanish Ministry of Economy and Competitiveness (MINECO) and FEDER funds (EU). PT was supported by the European Commission Horizon 2020 Innovative Training Network SUNDIAL (SUrvey Network for Deep Imaging Analysis and Learning), Project ID: 721463.
References
 [BenDavid, Lu, and Pál2008] BenDavid, S.; Lu, T.; and Pál, D. 2008. D.: Does unlabeled data provably help? worstcase analysis of the sample complexity of semisupervised learning. In In: 21st Annual Conference on Learning Theory.
 [Chapelle, Schölkopf, and Zien2010] Chapelle, O.; Schölkopf, B.; and Zien, A. 2010. SemiSupervised Learning. The MIT Press, 1st edition.
 [Chapelle, Sindhwani, and Keerthi2008] Chapelle, O.; Sindhwani, V.; and Keerthi, S. S. 2008. Optimization techniques for semisupervised support vector machines. Journal of Machine Learning Research 9:203–233.

[Chawla et al.2002]
Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer, W. P.
2002.
Smote: Synthetic minority oversampling technique.
Journal of Artificial Intelligence Research
16:321–357.  [Cireşan et al.2010] Cireşan, D. C.; Meier, U.; Gambardella, L. M.; and Schmidhuber, J. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput. 22(12):3207–3220.
 [Cortes and Vapnik1995] Cortes, C., and Vapnik, V. 1995. Supportvector networks. Machine Learning 20(3):273–297.
 [Cozman, Cohen, and Cirelo2003] Cozman, F. G.; Cohen, I.; and Cirelo, M. C. 2003. Semisupervised learning of mixture models. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 99–106.
 [Demsar2006] Demsar, J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30.
 [Forman and Cohen2004] Forman, G., and Cohen, I. 2004. Learning from Little: Comparison of Classifiers Given Little Training. Berlin, Heidelberg: Springer Berlin Heidelberg. 161–172.
 [Galar et al.2012] Galar, M.; Fernández, A.; Barrenechea, E.; Bustince, H.; and Herrera, F. 2012. A review on ensembles for the class imbalance problem: Bagging, boosting, and hybridbased approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42(4):463–484.
 [Huang et al.2006] Huang, J.; Smola, A. J.; Gretton, A.; Borgwardt, K. M.; and Scholkopf, B. 2006. Correcting sample selection bias by unlabeled data. In Proceedings of the 19th International Conference on Neural Information Processing Systems, 601–608. Cambridge, MA, USA: MIT Press.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Pereira, F.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 25. Curran Associates, Inc. 1097–1105.
 [Kubat and Matwin1997] Kubat, M., and Matwin, S. 1997. Addressing the curse of imbalanced training sets: Onesided selection. In Proceedings of the International Conference on Machine Learning, 179–186.

[Li and
Wen2014]
Li, D.C., and Wen, I.H.
2014.
A genetic algorithmbased virtual sample generation technique to improve small data set learning.
Neurocomputing 143:222–230.  [Lichman2013] Lichman, M. 2013. UCI machine learning repository.
 [Niyogi, Girosi, and Poggio2002] Niyogi, P.; Girosi, F.; and Poggio, T. 2002. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86(11):2196–2209.
 [PérezOrtiz et al.2016] PérezOrtiz, M.; Gutiérrez, P. A.; Tino, P.; and HervásMartínez, C. 2016. Oversampling the minority class in the feature space. IEEE Transactions on Neural Networks and Learning Systems 27(9):1947–1961.
 [SánchezMonedero et al.2013] SánchezMonedero, J.; Gutiérrez, P. A.; PérezOrtiz, M.; and HervásMartínez, C. 2013. An nspheres based synthetic data generator for supervised classification. In Advances in Computational Intelligence, 613–621. Berlin, Heidelberg: Springer Berlin Heidelberg.
 [Shahshahani and Landgrebe1994] Shahshahani, B. M., and Landgrebe, D. A. 1994. The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing 32(5):1087–1095.
 [Simard, Steinkraus, and Platt2003] Simard, P. Y.; Steinkraus, D.; and Platt, J. C. 2003. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition  Volume 2, ICDAR ’03, 958–. Washington, DC, USA: IEEE Computer Society.
 [Sindhwani and Keerthi2006] Sindhwani, V., and Keerthi, S. S. 2006. Large scale semisupervised linear svms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 477–484. ACM.
 [Sindhwani, Keerthi, and Chapelle2006] Sindhwani, V.; Keerthi, S. S.; and Chapelle, O. 2006. Deterministic annealing for semisupervised kernel machines. In Proceedings of the 23rd international conference on Machine learning, 841–848. ACM.
 [Singh, Nowak, and Zhu2008] Singh, A.; Nowak, R. D.; and Zhu, X. 2008. Unlabeled data: Now it helps, now it doesn’t. In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., eds., NIPS, 1513–1520. Curran Associates, Inc.

[Szegedy et al.2015]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.;
Vanhoucke, V.; and Rabinovich, A.
2015.
Going deeper with convolutions.
In
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 1–9.  [Wong et al.2016] Wong, S. C.; Gatt, A.; Stamatescu, V.; and McDonnell, M. D. 2016. Understanding data augmentation for classification: when to warp? CoRR abs/1609.08764.

[Yang et al.2011]
Yang, J.; Yu, X.; Xie, Z.Q.; and Zhang, J.P.
2011.
A novel virtual sample generation method based on gaussian distribution.
KnowledgeBased Systems 24(6):740 – 748.  [Zhang et al.2017] Zhang, H.; Cissé, M.; Dauphin, Y. N.; and LopezPaz, D. 2017. mixup: Beyond empirical risk minimization. CoRR abs/1710.09412.
 [Zheng and Skillicorn2016] Zheng, Q., and Skillicorn, D. 2016. Spectral graphbased semisupervised learning for imbalanced classes. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) 960–967.
 [Zhu, Ghahramani, and Lafferty2003] Zhu, X.; Ghahramani, Z.; and Lafferty, J. 2003. Semisupervised learning using gaussian fields and harmonic functions. In ICML, 912–919.
Comments
There are no comments yet.