Exploiting Synthetically Generated Data with Semi-Supervised Learning for Small and Imbalanced Datasets

03/24/2019 ∙ by Maria Perez-Ortiz, et al. ∙ University of Cambridge University of Birmingham Universidad de Córdoba UCL 0

Data augmentation is rapidly gaining attention in machine learning. Synthetic data can be generated by simple transformations or through the data distribution. In the latter case, the main challenge is to estimate the label associated to new synthetic patterns. This paper studies the effect of generating synthetic data by convex combination of patterns and the use of these as unsupervised information in a semi-supervised learning framework with support vector machines, avoiding thus the need to label synthetic examples. We perform experiments on a total of 53 binary classification datasets. Our results show that this type of data over-sampling supports the well-known cluster assumption in semi-supervised learning, showing outstanding results for small high-dimensional datasets and imbalanced learning problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the current challenges in machine learning is the lack of sufficient data [Forman and Cohen2004]

. In this scenario, over-fitting becomes hard to avoid, outliers and noise represent an important issue and the model generally has high variance. Several approaches have been proposed to deal with small datasets, although the work in this matter is still scarce. From all the proposed approaches, synthetic sample generation or data augmentation techniques

[Li and Wen2014, Wong et al.2016, Yang et al.2011] have shown competitive performance, acting as a regulariser [Zhang et al.2017]

, preventing over-fiting and improving the robustness of both classifiers and regressors.

The generation of virtual examples is highly nontrivial and has been studied from different perspectives. Proposed methods use prior information [Niyogi, Girosi, and Poggio2002], add noise [Zhang et al.2017], apply simple transformations [Cireşan et al.2010, Simard, Steinkraus, and Platt2003, Krizhevsky, Sutskever, and Hinton2012, Szegedy et al.2015] or use data over-sampling approaches [Chawla et al.2002, Pérez-Ortiz et al.2016].

The most straightforward over-sampling approach is to randomly replicate data. However, this can lead to over-fitting [Galar et al.2012]. Another common approach is to do over-sampling taking into account the data distribution. A convex combination of patterns close in the input space has been successfully used for that purpose [Chawla et al.2002, Zhang et al.2017, Pérez-Ortiz et al.2016].

In this paper we investigate the benefits and limitations of this simple data augmentation technique coupled with SSL support vector machines. The motivations for such an approach are: i) when performing over-sampling one of the biggest challenges is how to label synthetic examples (potentially alleviated when using SSL as no label is assumed) and ii) the hypothesis that over-sampling by convex combination of patterns can support the cluster assumption in SSL and help to simplify the classification task. The cluster assumption states that high density regions with different class labels must be separated by a low density region. Given this, two patterns are likely to have the same class label if they can be connected by a path passing through high density regions. The method proposed here is based on the synthetic generation of high density regions as an inductive bias for the classifier. We perform a thorough set of experiments over synthetic and benchmark binary datasets, showing how this approach helps to mitigate the effect of small, high-dimensional and imbalanced datasets.

2 Methodology

2.1 Data over-sampling by convex combination

Assume that data forms a finite sample i.i.d. from a distribution and that our aim is to construct a finite-sample function of . Resampling approximates the finite-sample distribution of the function computed over by the exact distribution of the function over :


where is defined as the resampling distribution and explicitly depends on the observations in . Resampling is commonly used in machine learning for data augmentation.

In the case of binary classification we also have access to a labelling

. When dealing with small or imbalanced datasets, appropriately capturing the joint probability function

might be unrealistic. Because of this, most over-sampling approaches are rather simple. Usually, synthetic patterns are generated by convex combination of two seed patterns belonging to the same class and labelled directly using the same class label [Chawla et al.2002]. The first seed pattern is chosen randomly, and the second one is chosen as one of its -nearest neighbours. is responsible for avoiding label inconsistencies and exploiting the local information of the data, but it can also significantly limit the diversity of synthetic patterns.


Figure 1 shows a toy imbalanced dataset where the classes are not convex (left) and some examples of synthetic data patterns that could be created for the minority class in order to balance the class distributions (right). This shows a representation of the main problem encountered when using this over-sampling approach, especially when the parameter of -nearest neighbour is not properly optimised: synthetic patterns are created in the region of the majority class and if we naively label these patterns as minority class patterns, we introduce what we denote as a label inconsistency.

Figure 1: Example of an over-sampled imbalanced dataset, in which naively labelling synthetic data as minority class patterns might not be suitable.

Different reformulations of SMOTE have been proposed to solve this. One of the recent proposals to do so is to perform synthetic over-sampling in the feature space induced by a kernel function [Pérez-Ortiz et al.2016]. In this new feature space the classes would be ideally linearly separable (and convex) and label inconsistencies can be avoided when using a convex combination of patterns. This technique has been seen to significantly improve the standard SMOTE.

Effect on the data distribution

We study now the effect of over-sampling by means of a convex combination of patterns. At every step we create a synthetic instance by selecting at random two patterns :


we restrict -, where - represents a function that returns the -nearest neighbours of . Note that when over-sampling within a classification framework is usually also restricted so that .

For simplicity, let us first assume and and

come from the same Normal distribution

. The definition of the characteristic function of the Normal distribution is:


The new random variable

will have the characteristic function:


meaning that the convex combination of these two patterns will follow the distribution: , which for translates into being within . This means that the resampled distribution will most probably have a lower variance, yielding synthetic data more concentrated around the mean.

If seed patterns do not come from the same distribution, i.e. and , then . We assume, however, that given that these patterns are neighbours, they do come from the same distribution.

The density function of assuming is:


being the density function of the Normal distribution and the density function . The variance of can thus be evaluated as:


This integral can be numerically evaluated. When doing so we see that the original variance is always reduced by 0.333.

Given that over-sampling is applied independently per dimension, we have: where is the i-th dimension of .

Let us now analyse the multivariate case where , and . For that let us first assume a matrix for changing the basis such that . If we choose

to be a basis formed by the unit eigenvectors of

then it is easy to show that (i.e. the covariance matrix of

) is a diagonal matrix formed by the eigenvalues associated to

, i.e. the i-th diagonal value is the variance of along the i-the eigenvector of . In the rotated axis Eq. 2 can be rewritten as:


since is a linear operator. Convex combinations of patterns are thus invariant to rotations of the co-ordinate axis. In this axis, the data coming from our transformed resampling distribution will have the diagonal covariance matrix:


It follows that when over-sampling through convex combinations of patterns using the uniform distribution the mean of the data will remain unchanged and so will the eigenvectors of the covariance matrix, but the eigenvalues will shrink.

Figure 2 shows the result of over-sampling two Normal distributions, where represents the data associated to class in our classification problem. It can be seen that by performing convex combinations of patterns we change the data distribution. We use this to induce high-density regions that are later used by the SSL algorithm.

Figure 2: Normal distributions and class dependent over-sampled distributions (dotted line).

2.2 Semi-supervised learning (SSL)

In semi-supervised learning (SSL), we not only have access to labelled data drawn i.i.d. according to , but also to additional unlabelled data drawn i.i.d. according to .

Up to this date, theoretical analysis of SSL fails to provide solid evidence for the benefits of using unlabelled patterns in a supervised learning task [Ben-David, Lu, and Pál2008]. Generally, the consensus reached in the literature is that unlabelled data: (i) should be used with care because it has been seen to degrade classifier performance in some cases (e.g. when we assume incorrect data models [Cozman, Cohen, and Cirelo2003] or there are outliers or samples of unknown classes [Shahshahani and Landgrebe1994]; (ii) is mostly beneficial in the presence of a few labelled samples [Singh, Nowak, and Zhu2008, Shahshahani and Landgrebe1994, Cozman, Cohen, and Cirelo2003]

; (iii) can help to mitigate the effect of the Hughes phenomenon (i.e. the curse of dimensionality)

[Shahshahani and Landgrebe1994]; (iv) can help only if there exists a link between the marginal data distribution and the target function to be learnt and both labelled and unlabelled data are generated from the same data distribution [Huang et al.2006]; and finally (v) can improve on the performance of supervised learning when density sets are discernable from unlabelled but not from labelled data [Singh, Nowak, and Zhu2008].

SSL algorithms can be classified using the following taxonomy [Chapelle, Schölkopf, and Zien2010]: i) Generative models which estimate the conditional density ; ii) low density separators that maximise the class margin; iii) graph-based models which propagate information through a graph; and finally, iv) algorithms based on a change of representation. The most widely used SSL algorithms belong to the low density separators or the graph-based models groups. Generative approaches are said to solve a more complex problem than discriminative ones and require more data and the algorithms based on a change of representation do not use all the potential of unlabelled data. Because of this, we focus on low density separators.

Figure 3: Over-sampling in the banana dataset. Left figure shows the original dataset, where colours indicate the class. The figures in the middle show the dataset where data is missing at random (MCAR) for one of the classes with percentages of missing patterns of 50% and 90%. The figures on the right show the over-sampled datasets.

2.3 Exploiting the cluster assumption

Labelling synthetically generated patterns without knowledge about is a highly nontrivial problem. Instead, we approach this by using SSL, assuming that every synthetic pattern belongs to the set of unlabelled data, .

We exploit the cluster assumption by artificially connecting labelled patterns and belonging to the same class () through unlabelled samples. Two patterns and are said to be connected if there exist a sequence of relatively dense patterns such that the marginal density varies smoothly along the sequence of patterns between and [Singh, Nowak, and Zhu2008]. We have shown in Section 2.1 that over-sampling two patterns and by convex combination makes the density function more compact in the region that connects them. This property is maintained for all random variables that are a linear combination of two patterns and that come from the same distribution (independently on whether their distribution has the reproductive property). The cluster assumption is the basis for different low-density semi-supervised learners. This assumption implies that if two patterns are linked by a path of high density (e.g., if they belong to the same cluster), then their outputs are likely to be similar [Chapelle, Schölkopf, and Zien2010]. Our proposal of using as unlabelled samples can thus be seen as synthetically generating high density regions for each class as an inductive bias for the classifier. An example of over-sampling can be found in Figure 3 where over-sampled patterns are plotted in black.

Our objective is thus to seek a classifier and a labelling by minimising:



are loss functions and

and are real-valued parameters which reflect confidence in labels and the cluster assumption respectively. The labels of synthetic data are treated as additional optimisation variables, as it is common in SSL [Sindhwani, Keerthi, and Chapelle2006, Sindhwani and Keerthi2006]. An effective loss function over an unlabelled pattern is , which corresponds to making the optimal choice for unknown label and promotes decision boundaries that pass through low-density regions.

Choice of low density separator

The most common approach for constructing a SSL low density separator is to use a maximum margin approach (e.g. using Support Vector Machines, SVMs). However, the formulation in Eq. 9 results in a hard optimisation problem when unlabelled data is abundant. In the semi-supervised SVM classification setting (S

VM), this minimisation problem is solved over both the hyperplane parameters

and the label vector ,


where and V is a loss function. This problem is solved under the class balancing constraint:


where is a user-specified ratio of unlabelled data to be assigned to the positive class. Unlike SVMs, this SVM formulation leads to a non-convex optimization problem, which is solved either by combinatorial or continuous optimisation [Chapelle, Sindhwani, and Keerthi2008].

The method chosen in this paper is SVM

, which has shown promising performance and is robust to changes in the hyperparameters

[Chapelle, Sindhwani, and Keerthi2008]. This technique is based on a local combinatorial search guided by a label switching procedure. The vector is initialised as the labelling given by a SVM trained only on the labelled set. This labelling is restricted to maintain the class ratios previously defined by . Subsequent steps of the algorithm comprise of switching the labels of two unlabelled patterns and (in order to maintain class proportions) that satisfy the following condition:


i.e. the loss after switching these labels is lower.

Concerning the computational complexity of our proposal, the main bottleneck is the SSL part as the complexity of over-sampling is linear. The complexity of SVM is of the same order as that of a standard SVM. However, it will be trained with more data (i.e. real plus synthetic).

Ensemble of synthetic hypotheses

Since the estimation of the resampling distribution is a stochastic process, we also consider the use of different resampling distributions in an ensemble framework. The application is straightforward: each member of the ensemble is formed by a resampling distribution and a SVM model . Final labels are computed by majority voting.

Dataset Dataset
haberman (HA) 306 3 hepatitis (HE) 155 19
listeria (LI) 539 4 bands (BA) 365 19
mammog. (MA) 830 5 heart-c (HC) 302 22
monk-2 (MO) 432 6 labor (LA) 57 29
appendicitis (AP) 106 7 pima (PI) 768 8
glassG2 (GL) 163 9 credit-a (CR) 690 43
saheart (SA) 462 9 specfth. (SP) 267 44
breast-w (BW) 699 9 card (CA) 690 51
heartY (HY) 270 13 sonar (SO) 156 60
breast (BR) 286 15 colic (CO) 368 60
housevot. (HO) 232 16 credit-g (CG) 1000 61
banana 5300 2 ionosphere 351 34
liver 583 10 wisconsin 569 32
All nominal variables are transformed into binary ones
Table 1: Characteristics for the benchmark datasets.

3 Experimental results

In our experiments we try to answer the following questions:

  1. What are the largest contributing factors to the degradation in performance when dealing with small datasets?

  2. Does over-sampling prevent the need for collecting further data in small and imbalanced scenarios?

  3. How does our approach of using SSL and not labelling data compares to other approaches in the literature?

  4. In the context of classification, is it class dependent over-sampling better than class-independent?

To answer the first question, we do a first experiment using synthetically generated datasets. To answer questions 2-4, we perform two additional experiments, in which we test a wide range of approaches with real-world benchmark datasets, changing the percentage of missing patterns to study the influence of the data sample size (second experiment) and imbalanced class distributions (third experiment).

All the methodologies have been tested considering the paradigm of Support Vector Machines (SVM) [Cortes and Vapnik1995]. The benchmark datasets are extracted from the UCI repository [Lichman2013] (characteristics shown in Table 1). These datasets are not originally imbalanced or extremely small. Instead, these characteristics are generated synthetically by removing a percentage of patterns at random, so that the performance can be compared against the one with the original full dataset.

Because of space restrictions, we only show mean test results and rankings, but all results can be accessed online111https://doi.org/10.17863/CAM.32312.

3.1 Methodologies tested

In order to address the difference between using real vs. synthetic data, we compare standard supervised SVMs (with no over-sampling or data missing) to different approaches with data Missing Completely At Random (MCAR). Note that this comparison is not strictly fair, but it provides a useful baseline performance to evaluate our over-sampling approaches. Thus, our objective is not to surpass the performance achieved with real data by the use of synthetic one, but rather to reach a similar performance. We also compare our proposed approach to: 1) previous over-sampling approaches that use naive labelling [Chawla et al.2002, Pérez-Ortiz et al.2016] and 2) transductive graph-based SSL, as another alternative for labelling synthetic data. Within our proposed methods we have different approaches: class-dependent and independent over-sampling (i.e. over-sampling classes separately or not) and an ensemble of SVM models using unlabelled synthetically generated patterns. Note that the optimisation procedure of SVM and SVM is different, which may influence the results (SVM is said to be more prone to reach local optima). Because of this, we include another approach as a baseline: SVM model that reintroduces the real data removed at random in the unsupervised set. The main purpose here is to compare over-sampled vs. real data within the SVM framework.

Figure 4: Mean test performance across all benchmark datasets for S-MCAR. In the left plot patterns are removed from both classes, whereas in the right plot patterns are removed only for the minority class.
Figure 5: Examples of synthetic datasets generated. For the plot on the right only the first three dimensions are shown.

3.2 Experimental setup

A stratified

-fold technique has been performed to divide all datasets. Each experiment is repeated 3 times in order to obtain robust results (except for deterministic methods). The results are taken as mean and standard deviation of the selected measures. The same seed is used for random number generation, meaning that the same patterns are removed from the dataset and created by over-sampling. The cost parameter of SVM-based methods was selected within the values

by means of a nested -fold method with the training set. The kernel parameter has been cross-validated within the values for the SVM based methods. For all the methods using large-scale semi-supervised SVMs [Sindhwani and Keerthi2006], the regularisation parameters and were optimised within the values (also by means of a nested -fold cross-validation). For easing the comparisons, the number of synthetically generated patterns is set to the same removed initially from the dataset. nearest neighbours were evaluated to generate synthetic samples. The Euclidean distance has been used for all the distance computations.

The parameter used for the over-sampling method in [Pérez-Ortiz et al.2016] to control the dimensionality of the feature space has been cross-validated within the values {0.25, 0.5, 0.75}. The kernel width parameter associated to transductive methods (to construct the graph) has been set to the same value of the SVM kernel used. The rest of parameters have been set to default values.

There are several minor modifications of these algorithms when using them for either small or imbalanced datasets. As stated before, in the case of imbalanced data, we introduce a new parameter for SVM methods, which controls the ratio of patterns assigned to the minority class. This class balancing parameter has been fixed to the initial class distribution (in the first and second experiments where the data is balanced) and cross-validated within the values {0.5,0.7,0.9} for the imbalanced datasets (where all the synthetically generated patterns are supposed to belong to the minority class, but where we need to allow a certain amount of errors, to fix label inconsistencies). Moreover, for the case of graph-based algorithms, several issues have been noticed in imbalanced domains [Zheng and Skillicorn2016]. To prevent this, we also use a class mass normalisation procedure to adjust the class distribution so that it matches the priors [Zhu, Ghahramani, and Lafferty2003].

Figure 6: Box-plot of the mean test accuracy performance across different factors for the synthetic datasets (first experiment).

3.3 Evaluation metrics

The results have been reported in terms of two metrics:

  1. Accuracy (). However, given that for imbalanced cases this metric is not be the best option, we use the mean of the sensitivities per class (referred to as ).

  2. The Geometric Mean of the sensitivities (

    ) [Kubat and Matwin1997], where is the sensitivity for the positive class (ratio of correctly classified patterns considering only this class), and is the sensitivity for the negative one.

The measure for the parameter selection was given its robustness [Kubat and Matwin1997].

3.4 Results

Firstly, we test the influence of the number of patterns removed at random. Figure 4 shows the mean degradation in test performance for S-MCAR when changing the number of patterns removed from the benchmark datasets. As can be seen, all metrics experience a relatively large degradation.

First experiment: Synthetically generated datasets

synthetic datasets generated with [Sánchez-Monedero et al.2013] are used. All of these datasets represent binary and perfectly balanced classification tasks, in which the data has been generated using a Normal distribution changing different parameters: 1) dimensionality of the input space (d, which is set to 2, 10 and 100 dimensions), 2) the number of patterns (N, set to 50, 100 and 1000) and 3) the variance of the data (V, controlling the overlapping between the classes and set to 0.167, 0.25 and 0.5). All combinations of these parameters have been explored. All the classes have been designed to be bi-modal. Figure 5 shows two examples of the synthetic datasets generated. We test three ratios of patterns removed at random (MCAR): , and .

For this experiment, we use four approaches: SVM (with the original dataset), S-MCAR (MCAR, no over-sampling), SVM with real unlabelled data (SVM-Real, for which the data that we remove is included again as unlabelled in the model) and our proposal using class-dependent over-sampling (SVM-OvS). Note that the comparison against SVM and SVM-Real is only for comparison purposes and not strictly fair, since the classifier has access to all the real data, which is not the case for S-MCAR and SVM-OvS.

2 50 0.167 0.2
2 50 0.167 0.5
2 50 0.167 0.8
2 50 0.25 0.2
2 50 0.25 0.5
2 50 0.25 0.8
2 50 0.5 0.2
2 50 0.5 0.5
2 50 0.5 0.8
2 100 0.167 0.2
2 100 0.167 0.5
2 100 0.167 0.8
2 100 0.25 0.2
2 100 0.25 0.5
2 100 0.25 0.8
2 100 0.5 0.2
2 100 0.5 0.5
2 100 0.5 0.8
2 1000 0.167 0.2
2 1000 0.167 0.5
2 1000 0.167 0.8
2 1000 0.25 0.2
2 1000 0.25 0.5
2 1000 0.25 0.8
2 1000 0.5 0.2
2 1000 0.5 0.5
2 1000 0.5 0.8
10 50 0.167 0.2
10 50 0.167 0.5
10 50 0.167 0.8
10 50 0.25 0.2
10 50 0.25 0.5
10 50 0.25 0.8
10 50 0.5 0.2
10 50 0.5 0.5
10 50 0.5 0.8
10 100 0.167 0.2
10 100 0.167 0.5
10 100 0.167 0.8
10 100 0.25 0.2
10 100 0.25 0.5
10 100 0.25 0.8
10 100 0.5 0.2
10 100 0.5 0.5
10 100 0.5 0.8
10 1000 0.167 0.2
10 1000 0.167 0.5
10 1000 0.167 0.8
10 1000 0.25 0.2
10 1000 0.25 0.5
10 1000 0.25 0.8
10 1000 0.5 0.2
10 1000 0.5 0.5
10 1000 0.5 0.8
100 50 0.167 0.2
100 50 0.167 0.5
100 50 0.167 0.8
100 50 0.25 0.2
100 50 0.25 0.5
100 50 0.25 0.8
100 50 0.5 0.2
100 50 0.5 0.5
100 50 0.5 0.8
100 100 0.167 0.2
100 100 0.167 0.5
100 100 0.167 0.8
100 100 0.25 0.2
100 100 0.25 0.5
100 100 0.25 0.8
100 100 0.5 0.2
100 100 0.5 0.5
100 100 0.5 0.8
100 1000 0.167 0.2
100 1000 0.167 0.5
100 1000 0.167 0.8
100 1000 0.25 0.2
100 1000 0.25 0.5
100 1000 0.25 0.8
100 1000 0.5 0.2
100 1000 0.5 0.5
100 1000 0.5 0.8
Table 2: Mean performance for the synthetic datasets considered.

From this experiment, we had results for datasets with different characteristics for three different MCAR levels and four methods (a total of 324 individual results). These results can be seen in Table 2. To analyse these properly, we summarised these results independently per factor in Figure 6 using box-plots. Some conclusions can be drawn: Firstly, the overlapping of the classes (variance factor) is the main factor contributing to performance degradation. If the data does not overlap (small variance), a high performance can be achieved even if we remove data (compare method (1) to (2)). The same is applicable when data dimensionality is low, e.g. for d=2 and d=10 removing data is not problematic (again, compare method (1) to (2)). However, an important degradation is seen when d=100. The removal of data especially affects small datasets (N=50 and N=100) but not when N=1000. Concerning the proposed approach (SVM-OvS), similar results can be achieved using real unlabelled data (SVM-Real), which is a positive outcome. Both results are also close to the performance using the complete dataset (compare approaches (3) and (4) to (1)), which means that over-sampled data can replace real one, even when real data is labelled. In some cases, such as in high-dimensional datasets, the performance even surpasses the one obtained by the original data. The proposal not only helps with small datasets, but also with relatively large ones (N=1000), perhaps because in this scenario the amount of data helps simplify the over-sampling task by exploiting better the local information. Thus, we can conclude that the proposed methodology helps specially for high dimensional datasets independently of their size and class overlapping, and that its performance is stable with respect to the percentage of data that we removed (last factor).

Transductive graph-based approaches
Real unlab. data
Class dep. OvS
Class indep. OvS
SVM approaches (proposed)
Real unlab. data
Class dep. OvS
Class indep. OvS
Friedman’s test
Confidence interval
F-value MCAR =: : , :
F-value MCAR =: : , :
F-value MCAR =: : , :
Table 3: Mean ranking results for all the methods considered in the small sample size experiment (second experiment).

Figure 7 shows the effect of changing the number of synthetically generated patterns for SVM-OvS. It can be seen that similar results are obtained independently of the number of generated patterns. The percentage of data missing at random has, however, an important impact.

Figure 7: Mean performance for the dataset with parameters d=100, n=50 and V=0.5 using the SVM-OvS method changing the number of synthetically generated data (x0.5, x1, x2, …, x10).

Second experiment: Small sample size

For this experiment, we artificially reduce the size of the benchmark datasets (again testing a proportion of , and reduction). Because of the amount of results we only provide the test mean ranking (the lower the better) in Table 3

. It can be seen that the test rejects the null-hypothesis that all of the algorithms perform similarly in mean ranking for all cases. As mentioned before, here, we also include two over-sampling approaches from the literature: SVM+OvS

[Chawla et al.2002] and SVM+kOvS [Pérez-Ortiz et al.2016] and test transductive approaches to label synthetic data. Again, we compare several strategies: class-dependent and independent over-sampling, the introduction of real unlabelled data in the SVM model for comparison purposes and an ensemble. Note that both SVM and methods based on real unlab. data are unrealistic and only used as a baseline. Several conclusions can be drawn: Comparing all over-sampling approaches and S-MCAR it can be seen that a convex combination of patterns can be successfully used to generate synthetic data. The use of part of the real data as unlabelled also improves the result to a reasonable extent: it is better than standard data over-sampling and if the number of data is not extremely low even better than use the original dataset, which may indicate that there might be some noise in the labels. The combination of over-sampling and semi-supervised learning approaches is promising and can be applied within each class or using all data independently of their labels, reaching in most cases the baseline performance of the use of the entire dataset. Observing individual results we noticed that for the smallest datasets it is better to use all patterns for over-sampling, while for bigger datasets the best approach is to do over-sampling dependent on the class. In general, transductive graph-based approaches do not report acceptable results, maybe because they highly depend on the design of a graph or because these techniques precise a larger amount of data. Finally, the introduction of diversity in an ensemble by the use of a stochastic convex combination of patterns is very promising, improving in most cases the results achieved with the original complete dataset.

Transductive graph-based approaches
Class dep. OvS
Proposed SVM approaches
Class dep. OvS
Class indep. OvS
Friedman’s test
Confidence interval
F-value MCAR=: : , :
F-value MCAR=: : , :
Table 4: Mean test ranking results for all the methods considered in the imbalanced experiment (third experiment).

To quantify whether a statistical difference exists among the algorithms studied, a procedure is employed to compare multiple classifiers in multiple datasets [Demsar2006]. Table 3 shows the result of applying the non-parametric statistical Friedman’s test (for a significance level of ) to the mean and rankings. The test rejects the null-hypothesis that all algorithms perform similarly in mean ranking for both metrics (for the differences are larger). This table shows that the best results in and are obtained obviously by the SVM technique (which does not entail any loss of information). This method is followed then by the semi-supervised SVM-MCAR+USP for and SVM-MCAR+LSPFS for . These results demonstrate that standard SMOTE (SVM-MCAR+LSP) creates label inconsistencies and that these can be avoided by the use of over-sampling in the feature space or a semi-supervised approach. Moreover, as the literature has suggested, graph-based (SVM-MCAR+LSPH) are not ideal for imbalanced datasets, although they still represent a better option than standard SMOTE. These results are accentuated when increasing the number of data missing at random.

Third experiment: Imbalanced samples

We also study the effect of our proposal in imbalanced classification setups. For this, we artificially induce this imbalance in our data by removing a percentage of patterns for the minority class. In this case, we test a subset of the methods that we used in the previous experiment (results shown in Table 4). Again, we can see that SMOTE (SVM+OvS) can be improved, either by optimising the patterns to generate (SVM+kOvS) or the labels of the synthetic patterns (proposed approaches). It can also be seen that it is better to over-sample only the minority class (i.e. class dependent).

4 Conclusions

We explored the idea of introducing synthetic data as unsupervised information in semi-supervised support vector machines, where labels of synthetic data are treated as additional optimisation variables. Our experimental study has shown that: 1) synthetic patterns help when data is scarce with respect to the data dimensionality and can be used in a variety of cases as an alternative to collecting more data; 2) convex combination of input training data can be used for generating those synthetic samples, but these do not have to be necessarily labelled; and 3) the introduction of synthetic data as unsupervised knowledge can help to improve the classification in small, high-dimensional or imbalanced scenarios by acting as an inductive bias for the classifier.

Future work comprises testing such approach in a regression setting and with other semi-supervised learning approaches (e.g. the use of synthetic imaging data with autoencoders or deep belief networks).

5 Acknowledgments

This work was supported by the project TIN2017-85887-C2-1-P of the Spanish Ministry of Economy and Competitiveness (MINECO) and FEDER funds (EU). PT was supported by the European Commission Horizon 2020 Innovative Training Network SUNDIAL (SUrvey Network for Deep Imaging Analysis and Learning), Project ID: 721463.


  • [Ben-David, Lu, and Pál2008] Ben-David, S.; Lu, T.; and Pál, D. 2008. D.: Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning. In In: 21st Annual Conference on Learning Theory.
  • [Chapelle, Schölkopf, and Zien2010] Chapelle, O.; Schölkopf, B.; and Zien, A. 2010. Semi-Supervised Learning. The MIT Press, 1st edition.
  • [Chapelle, Sindhwani, and Keerthi2008] Chapelle, O.; Sindhwani, V.; and Keerthi, S. S. 2008. Optimization techniques for semi-supervised support vector machines. Journal of Machine Learning Research 9:203–233.
  • [Chawla et al.2002] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer, W. P. 2002. Smote: Synthetic minority over-sampling technique.

    Journal of Artificial Intelligence Research

  • [Cireşan et al.2010] Cireşan, D. C.; Meier, U.; Gambardella, L. M.; and Schmidhuber, J. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput. 22(12):3207–3220.
  • [Cortes and Vapnik1995] Cortes, C., and Vapnik, V. 1995. Support-vector networks. Machine Learning 20(3):273–297.
  • [Cozman, Cohen, and Cirelo2003] Cozman, F. G.; Cohen, I.; and Cirelo, M. C. 2003. Semi-supervised learning of mixture models. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 99–106.
  • [Demsar2006] Demsar, J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30.
  • [Forman and Cohen2004] Forman, G., and Cohen, I. 2004. Learning from Little: Comparison of Classifiers Given Little Training. Berlin, Heidelberg: Springer Berlin Heidelberg. 161–172.
  • [Galar et al.2012] Galar, M.; Fernández, A.; Barrenechea, E.; Bustince, H.; and Herrera, F. 2012. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42(4):463–484.
  • [Huang et al.2006] Huang, J.; Smola, A. J.; Gretton, A.; Borgwardt, K. M.; and Scholkopf, B. 2006. Correcting sample selection bias by unlabeled data. In Proceedings of the 19th International Conference on Neural Information Processing Systems, 601–608. Cambridge, MA, USA: MIT Press.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Pereira, F.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 25. Curran Associates, Inc. 1097–1105.
  • [Kubat and Matwin1997] Kubat, M., and Matwin, S. 1997. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the International Conference on Machine Learning, 179–186.
  • [Li and Wen2014] Li, D.-C., and Wen, I.-H. 2014.

    A genetic algorithm-based virtual sample generation technique to improve small data set learning.

    Neurocomputing 143:222–230.
  • [Lichman2013] Lichman, M. 2013. UCI machine learning repository.
  • [Niyogi, Girosi, and Poggio2002] Niyogi, P.; Girosi, F.; and Poggio, T. 2002. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86(11):2196–2209.
  • [Pérez-Ortiz et al.2016] Pérez-Ortiz, M.; Gutiérrez, P. A.; Tino, P.; and Hervás-Martínez, C. 2016. Oversampling the minority class in the feature space. IEEE Transactions on Neural Networks and Learning Systems 27(9):1947–1961.
  • [Sánchez-Monedero et al.2013] Sánchez-Monedero, J.; Gutiérrez, P. A.; Pérez-Ortiz, M.; and Hervás-Martínez, C. 2013. An n-spheres based synthetic data generator for supervised classification. In Advances in Computational Intelligence, 613–621. Berlin, Heidelberg: Springer Berlin Heidelberg.
  • [Shahshahani and Landgrebe1994] Shahshahani, B. M., and Landgrebe, D. A. 1994. The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing 32(5):1087–1095.
  • [Simard, Steinkraus, and Platt2003] Simard, P. Y.; Steinkraus, D.; and Platt, J. C. 2003. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, ICDAR ’03, 958–. Washington, DC, USA: IEEE Computer Society.
  • [Sindhwani and Keerthi2006] Sindhwani, V., and Keerthi, S. S. 2006. Large scale semi-supervised linear svms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 477–484. ACM.
  • [Sindhwani, Keerthi, and Chapelle2006] Sindhwani, V.; Keerthi, S. S.; and Chapelle, O. 2006. Deterministic annealing for semi-supervised kernel machines. In Proceedings of the 23rd international conference on Machine learning, 841–848. ACM.
  • [Singh, Nowak, and Zhu2008] Singh, A.; Nowak, R. D.; and Zhu, X. 2008. Unlabeled data: Now it helps, now it doesn’t. In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., eds., NIPS, 1513–1520. Curran Associates, Inc.
  • [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 1–9.
  • [Wong et al.2016] Wong, S. C.; Gatt, A.; Stamatescu, V.; and McDonnell, M. D. 2016. Understanding data augmentation for classification: when to warp? CoRR abs/1609.08764.
  • [Yang et al.2011] Yang, J.; Yu, X.; Xie, Z.-Q.; and Zhang, J.-P. 2011.

    A novel virtual sample generation method based on gaussian distribution.

    Knowledge-Based Systems 24(6):740 – 748.
  • [Zhang et al.2017] Zhang, H.; Cissé, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. CoRR abs/1710.09412.
  • [Zheng and Skillicorn2016] Zheng, Q., and Skillicorn, D. 2016. Spectral graph-based semi-supervised learning for imbalanced classes. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) 960–967.
  • [Zhu, Ghahramani, and Lafferty2003] Zhu, X.; Ghahramani, Z.; and Lafferty, J. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, 912–919.