One of technological challenges data analytics is facing is an enormous amount of data. This challenge is well known and recently a term ”big data” was coined with the purpose to bring attention to it and to develop new solutions. However, in many important application areas the excess of data is not a problem, quite the opposite, there just isn’t enough data available. There are several reasons for this, the data may be inherently scarce (rare diseases, faults in complex systems, rare grammatical structures…), difficult to obtain (due to proprietary systems, confidentiality of business contracts, privacy of records…), expensive (obtainable with expensive equipment, requiring significant investment of human or material resources…), or the distribution of the events of interests is highly imbalanced (fraud detection, outlier detection, distributions with long tails…). For machine learning approaches the lack of data causes problems in model selection, reliable performance estimation, development of specialized algorithms, and tuning of learning model parameters. While certain problems caused by scarce data are inherent to underrepresentation of the problem and cannot be solved, some aspects can be alleviated by generating artificial data similar to the original one. For example, similar artificial data sets can be of great help in tuning the parameters, development of specialized solutions, simulations, and imbalanced problems as they prevent overfitting of the original data set, yet allow sound comparison of different approaches.
Generating new data similar to a general data set is not an easy task. If there is no background knowledge available on the problem, we have to use the precious scarce data we posses to extract some of its properties and generate new semi-artificial data with similar properties. Weather this is acceptable in the context of the problem is not a matter of proposed approach, we assume that we can afford to set aside at least small part of the data for this purpose. This data may not be lost for modeling, but we shall be aware of extracted properties when considering possibility of overfitting.
The approaches used in existing data generators are limited to low dimensional data (up to 6 variables) or assume certain probability distribution, mostly normal; we review them in Sect. 2. Our approach is limited to classification problems. We first construct of a RBF network prediction model. RBF networks consist of Gaussian kernels which estimate probability density from training instances. Due to properties of Gaussian kernels (discussed in Section 3), the learned kernels can be used in a generative mode to produce new data. In such a way we overcome limitation to low dimensional spaces. We show that our approach can be successfully used for data sets with several hundred attributes and also with mixed data (numerical and categorical).
The paper is organized as follows. In Section 2 we review existing work on generating semi-artificial data. In Section 3 we present RBF neural networks and properties which allow us to generate data based on them. In Section 4 we present the actual implementation based on RSNNS package and explain details on handling nominal and numeric data. In Section 5 we discuss evaluation of generated data and its similarity to original data. We propose evaluation based on statistical properties of the data, as well as similarity between original and generated data estimated with supervised and unsupervised learning methods. In Section 6 we present the quality of the generated data and try to determine working conditions for proposed method as well as a suitable set of parameters. We shortly present an application of the generator for benchmarking of cloud bases big data analytics tool. In Section 7 we conclude with a summary, critical analysis and ideas for further work.
2 Related work
The area of data generators is full of interesting approaches. We cover only general approaches to data generation and do not cover methods specific for a certain problem or a class of problems.
The largest group of data generators is based on assumption about probability distribution the generated data shall be drawn from. Most scientific computational engines and tools contain the random number generators for univariate data drawn from standard distributions. For example, R system [R Core Team, 2013] supports uniform, normal, log-normal, Student’s t, F, Chi-squared, Poisson, exponential, beta, binomial, Cauchy, gamma, geometric, hypergeometric, multinomial, negative binomial, and Weibull distribution. Additional less-known univariate distribution-based random number generators are accessible through add-on packages. If we need univariate data from these distributions, we fit the parameters of the distributions and then use the obtained parameters to generate new data. For example, R package MASS [Venables and Ripley, 2002] provides function fitdistr to obtain the parameters of several univariate distributions.
Random vector generators based on multivariate probability distributions are far less common. Effective random number generators exist for multivariate t and normal distribution with up to 6 variables. Simulating data from multivariate normal distribution is possible via a matrix decomposition of given symmetric positive definite matrix
containing variable covariances. Using the decomposed matrix and sequence of univariate normally distributed random variables one can generate data from multivariate normal distribution as discussed in Sect.4. The approach proposed in this paper relies on the multivariate normal distribution data generator but does not assume that the whole data set is normally distributed. Instead it finds subspaces which can be successfully approximated with Gaussian kernels and use extracted distribution parameters to generate new data in proportion with the requirements.
To generate data from nonnormal multivariate distribution several transformational approaches have been proposed which start by generating data from a multivariate normal distribution and than transform it to the desired final distribution. For example, [Ruscio and Kaczetow, 2008] proposes an iterative approximation scheme. In each iteration the approach generates a multivariate normal data that is subsequently replaced with the nonnormal data sampled from the specified target population. After each iteration, discrepancies between the generated and desired correlation matrices are used to update the intermediate correlation matrix. A similar approach for ordinal data is proposed by [Ferrari and Barbiero, 2012]. The transformational approaches are limited to low dimensional spaces where covariance matrix capturing data dependencies can be successfully estimated. In contrast, our method is not limited to specific data type. The problem space is split into subspaces where dependencies are more clearly expressed and subsequently captured.
Kernel density estimation is a method to estimate the probability density function of a random variable with a kernel function. The inferences about the population are made based on a finite data sample. Several approaches for kernel basedparameter estimation exist. The most frequently used kernels are Gaussian kernels. These methods are intended for low dimensional spaces with up to 6 variables [Härdle and Müller, 2000].
An interesting approach to data simulation are copulas [Nelsen, 1999]
. A copula is a multivariate probability distribution for which the marginal probability distribution of each variable is uniform. Copulas are estimated from the empirical observations and describe the dependence between random variables. They are based on Sklar’s theorem that states that any multivariate joint distribution can be written with univariate marginal distribution functions and a copula which describes the dependence structure between the variables. To generate new data one has to first select the correct copula family, estimate the parameters of the copula, and than generate the data. The process is not trivial and requires in-depth knowledge of the data being modeled. In principle the number of variables used in a copula is not limited, but in practice a careful selection of appropriate attributes and copula family is required[Bandara and Jayasumana, 2011, Mair et al., 2012]. Copulas for both numeric and categorical data exist, but not for mixed types, whereas our approach is not limited in this sense.
3 RBF networks
RBF (Radial Basis Functions) networks have been proposed as a function approximation tool using locally tuned processing units, mostly Gaussian kernels[Moody and Darken, 1989, Zell et al., 1995], but their development still continues [Han and Qiao, 2012, Xie et al., 2012]. The network consists of three layers, see Figure 1 for an example. The input layer has input units, corresponding to input features. The hidden layer contains kernel functions. The output layer consist of a single unit in case of regression or as many units as there are output classes in case of classifications. We assume a classification problem described with pairs of dimensional training instances , where and is one of class labels . Hidden units computations in RBF network estimate the probability of each class :
The weights are multiplied by radial basis functions , which are usually Gaussian kernels:
Vectors present centers and are widths of the kernels. The centers and kernel widths have to be learned or set in advance. The kernel function is applied to the Euclidian distance between each center and given instance . Kernel functions have a maximum at zero distance from the center, while the activation is close to zero for instances which are further away from the center.
Most algorithms used to train RBF networks require a fixed architecture in which the number of units in the hidden layer must be determined before the training starts. To avoid manual setting of this parameter and to automatically learn kernel centers , weights
, and standard deviations, several solutions have been proposed [Reilly et al., 1982, Berthold and Diamond, 1995], among them RBF with Dynamic Decay Adjustment (DDA)[Berthold and Diamond, 1995] which we use in this work. The RBF DDA builds a network by incrementally adding an appropriate number of RBF units. Each unit encodes instances of only one class. During the process of adding new units the kernel widths
are dynamically adjusted (decayed) based on information about neighbors. RBFs trained with the DDA algorithm often achieve classification accuracy comparable to Multi Layer Perceptrons (MLPs) but training is significantly faster[Berthold and Diamond, 1995, Zell et al., 1995].
An example of RBF-DDA network for classification problem with 4 features and a binary class is presented in Fig. 1. The hidden layer of RBF-DDA network contains Gaussian units, which are added to this layer during training. The input layer is fully connected to the hidden layer. The output layer consists of one unit for each possible class. Each hidden unit encodes instances of one class and is therefore connected to exactly one output unit. For classification of a new instance a winner-takes-all approach is used, i.e. the output unit with the highest activation determines the class value.
Our data generator uses the function rbfDDA implemented in R package RSNNS [Bergmeir and Benítez, 2012] which is a R port of SNNS software [Zell et al., 1995]. The implementation uses two parameters: a positive threshold and a negative threshold as illustrated on Fig. 2. The two thresholds define an upper and lower bound for the activation of training instances. Default values of thresholds are and . The thresholds define a safety area where no other center of a conflicting class is allowed. In this way a good separability of classes is achieved. In addition, each training instance has to be in the inner circle of at least one center of the correct class.
4 Data generator
The idea of the proposed data generation scheme is to extract local Gaussian kernels from the learned RBF-DDA network and generate data from each of them in proportion to the desired class value distribution. When class distribution different from the empirically observed is desired, the distribution has to be specified as an input parameter.
A notable property of a Gaussian kernels is their ability not to be used only as discriminative models but also as generative models. To generate data from multivariate normal distribution
one can exploit the following property of multivariate Gaussian distribution:
When we want to simulate multidimensional , for a given symmetric positive definite matrix , we first construct a sample of the same dimensionality. The can easily be constructed using independent variables . Next we decompose
(using Choleski or eigenvalue decomposition). With the obtained matrixand X we use Eq. (1) to get
4.1 Construction of generator
The pseudo code of the proposed generator is given in Figure 3. The input to the generator is the available data set and two parameters. The parameter controls the minimal acceptable kernel weight. The weight of the kernel is defined as the number of training instances which achieve maximal activation with that kernel. All the learned kernels with weight less than are discarded by data generator to prevent overfitting of the training data. The boolean parameter controls the treatment of nominal attributes as described in Sect. 4.2.
Due to specific demands of RBF-DDA algorithm the data has to be preprocessed first (line 2 in Fig. 3). This preprocessing includes normalization of attributes to and preparation of nominal attributes (see Sect. 4.2). Function rbfPrepareData returns normalized data and normalization parameters , which are used later when generating new instances. The learning algorithm takes the preprocessed data and returns the classification model in the form of Gaussian kernels (line 3). We store the learned parameters of the Gaussian kernels, namely their centers , weights , and class values (lines 4, 5, and 6). The kernel weight equals the proportion of training instances which are activated by the -th Gaussian unit. The class value of the unit corresponds to the output unit connected to the Gaussian unit (see Fig. 1 for an illustration). Theoretically, this extracted information would be sufficient to generate new data, however there are several practical considerations, which have to be taken into account if one is to generate new data comparable to the original one.
The task of RBF-DDA is to discriminate between instances with different class values, therefore widths of the kernel are set during the learning phase in such a way that majority of instances are activated by exactly one kernel. Widths of the learned kernels therefore prevent overlapping of competing classes. For the purpose of generating new data the with of the kernel shall be different (not so narrow), or we would only generate instances in the near proximity of kernel centers i.e. existing training instances. The approach we adopted is to take the training instances that activate the particular kernel (lines 7 and 8) and estimate their empirical variance (lines 9, 10, and 11) in each dimension, which is later, in the generation phase, used as the width of the Gaussian kernel. Thematrix extracted from the network is diagonal, with elements presenting the spread of training instances in each dimension. The algorithm returns the data generator consisting of the list of kernel parameters and normalization parameters (line 12).
4.2 Preprocessing the data
Function rbfPrepareData does three tasks: it imputes missing values, prepares nominal attributes, and normalizes the data. The pseudo code of data preprocessing is in Fig.4.
The rbfDDA function in R does not accept missing values, so we have to impute them (line 3). While several advanced imputation strategies exist, the classification accuracy is not of the uttermost importance in our case, so we resorted to median based imputation for numeric attributes, while for nominal attributes we use the most frequent category.
Gaussian kernels are defined only for numeric attributes, so rbfDDA treats all the attributes, including nominal, as numeric. Each nominal attribute is converted to numeric (lines 4-8). We can simply assigning each category a unique integer from 1 to the number of categories (line 8). This may be problematic as this transformation has established an order of categorical values in the converted attribute, inexistent in the original attribute. For example, for attribute the categories are converted into values , respectively, meaning that the category is now closer to than to . To solve this problem we use the binary parameter (line 5) and encode nominal attributes with several binary attributes when this parameter is set to (line 6). Nominal attributes with more than two categories are encoded with the number of binary attributes equal to the number of categories. Each category is encoded by one binary attribute. If the value of the nominal attribute equals the given category, the value of the corresponding binary attribute is set to 1, while the values of the other encoding binary attributes equal 0. E.g., attribute with three categories would be encoded with three binary attributes . If the value of the attribute is then the binary encoding of this value is . The same binary encoding is required also for class values (line 11).
The rbfDDA function in R expects data to be normalized to (line 9). As we want to generate new data in the original, unnormalized form, we have to store the computed normalization parameters (line 10) and, together with attribute encoding information, pass them back to the calling rbfGen function.
4.3 Generating new data
Once we have a generator (produced by function rbfGen) , we can use it to generate new instances. By default the method generates with class values proportionally to the number of class values in the training set of the generator, but the user can specify the desired class distribution as a parameter .
A data generator consists of a list of parameters describing Gaussian kernels and information on attribute transformations . Recall that information for each kernel contains the location of kernel’s center , weight of kernel , class value , and estimated standard deviation . An input to newdata function are also parameters specifying the number of instances to be generated, the desired distribution of class values, controlling the width of the kernels, and as the width of the kernel if estimated width is 0.
Function starts by creating an empty data set (line 2) and than generates instances with each of the kernels stored in the kernel list (lines 2-11).The weight of the kernel , the desired class probability , and the overall number of instances to be generated determine the number of instances to be generated with each kernel (line 4). The weight of the kernel is normalized with the weights of the same class kernels , where presents an indicator function. The width of the kernel determines the spread of the generated values around the center. By default we use the spread as estimated from the training data (line 5). Zeros in individual dimensions are optionally replaced by value of parameter . For kernel width it is also possible to use the generalization of Silverman’s rule of thumb for multivariate case (line 6) [Härdle and Müller, 2000]. In this case the covariance matrix used is diagonal, i.e., diag, and kernel width in each dimension is set to
where is the sample size (in our case number of training instances that activate the particular kernel, and is the estimated spread in that dimension.
The data is generated by mvrnorm function (line 7). The function takes as input , the number of instances to generate, the center of the kernel , and the diagonal covariance matrix . Function exploits the property of Gaussian kernels from Eq. (1) and decomposes covariance matrix with eigenvalue decomposition. The generated data has to be checked for consistency (line 8), i.e., generated attribute values have to be in interval, nominal attributes have to be rounded to values encoding existing categories, etc. As some instances are rejected during this process in practice we generate more than instances with mvrnorm but retain only the desired number of them. We assign the class value to the generated instances (line 9) and append them to (line 10).
When the data are generated with all kernels we have to transform the generated instances to the original scale and encodings. For each nominal attribute we check its encoding (either as a set of binary attributes or as an integer), and transform it back to the original form (lines 11-18). Numeric attributes are denormalized and transformed back to the original scale using minimums and spans stored in (line 18). The function returns the generated data set (line 19).
4.4 Visual inspection of generated data
As a demonstration of the generator we graphically present the generated data on two simple data sets. The first data set forms a two dimensional grid where attributes and are generated with three Gaussian kernels with centers at , , and . Each group of 500 instances is assigned a unique class value (red, blue, and green, respectively) as illustrated in Fig 6a. The generator based on this data consists of eight Gaussian kernels (two for red and blue class each, and four for green class). We illustrate 1500 instances generated with this generator in Fig 6b. As the rbfDDA learner did not find the exact locations of the original centers it approximated the data with several kernels, so there is some difference between the original and generated data, but individual statistics are close as shown in Table 1.
Another simple example is the well known Iris data set which consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The scatter plots of the original data sets are shown in the Fig. 7a where class values are marked with different colors. The generator based on this data consisting of 31 Gaussian units generated 150 instances shown in Fig. 7b. The graphs show considerable similarity between matching pairs of scatter plots.
5 Data quality
We are not aware of any other data generator capable of generating data similar to existing data sets with no limitations in the number and type of attributes. The quality of existing data generators is mostly evaluated by comparing standard statistics: mean, median, standard deviation, skewness, and kurtosis. While these statistics are important indicators of the quality of generated data, they are insufficient for data sets with more attributes (e.g., more than 6). They are computed for each attribute separately, thereby not presenting an overall view of the data, do not take possible interactions between attributes into account, and are difficult to compare when the number of attributes increases. These statistics may also not convey any information about how appropriate and similar is the generated data for machine learning and data mining tasks. To resolve this difficulties and quantify similarity between original and generated data we developed several data quality measures described below. We use measures incorporating standard statistics, measures based on clustering and measures based on classification performance.
5.1 Standard statistics
Standard statistics for numeric attributes we use are the mean, standard deviation, skewness, and kurtosis. We compare also value distributions of attributes from original and generated data. For this we use Hellinger distance and Kolmogorov-Smirnov test (KS). The workflow of the whole process is illustrated in Fig 8.
We normalize each numeric attribute to to make comparison between attributes sensible. Input to each comparison are two data sets (original and generated). Comparison of attributes’ statistics computed on both data sets is tedious especially for data sets with large number of attributes. We therefore first compute standard statistics on attributes and then subtract statistics of the second data set from statistics of the first data set. To summarize the results we report only the average difference for each of the statistics. To compare distributions of attribute values we use Hellinger distance for discrete attributes and KS test for numerical attributes. The Hellinger distance between two discrete univariate distributions and is defined as
The maximal Hellinger distance between two distributions is 1. For numerical attributes we use two sample KS test, which tests whether two one-dimensional probability distributions differ. The statistics uses maximal difference between two empirical cumulative distribution functionsand , on samples of size and , respectively:
To get an overall picture we again report only the average Hellinger distance over all discrete attributes, and percentage of numeric attributes for which p-value of KS test was below 0.05. We set the null hypothesis that that attributes’ values in both data sets are drawn form the same distribution. While these averages do not have a strict statistical meaning, they do illustrate the similarity of two data sets.
5.2 Comparing clustering
Cluster information is an important introspection into the structure of the data. We compare similarity of the clusterings obtained for two data sets (original and generated). To estimate the similarity based on clusterings we use the Adjusted Rand Index (ARI)[Hubert and Arabie, 1985].
5.2.1 Similarity of two clusterings
Starting from the data set with data points, we assume two different clusterings of D, namely and , where , , , . The information on overlap between clusters of and can be expressed with contingency table as shown in Table 2.
There are several measures comparing clusterings based on counting the pairs of points on which two clusterings agree or disagree [Vinh et al., 2009]. Any pair of data points from the total of distinct pairs in falls into one of the following 4 categories.
, the number of pairs that are in the same cluster in both and ;
, the number of pairs that are in different clusters in both and ;
, the number of pairs that are in the same cluster in but in different clusters in ;
, the number of pairs that are in different clusters in but in the same cluster in .
The values , and can be computed from contingency table [Hubert and Arabie, 1985]. Values and indicate agreement between clusterings and , while values and indicate disagreement between and . The original Rand Index [Rand, 1971] is defined as
The Rand Index lies between 0 and 1 and takes the value 1 when the two clusterings are identical, and the value 0 when no pair of points appear either in the same cluster or different clusters in both and . It is desirable that a similarity indicator would take value close to zero for two random clusterings, which is not true for RI. The Adjusted Rand Index [Hubert and Arabie, 1985]
fixes this by using generalized hypergeometric distribution as a model of randomness and computes expected number of entries in the contingency table. It is defined as
The ARI has expected value of 0 for random distribution of clusters, and value 1 for perfectly matching clusterings. ARI can also be negative.
5.2.2 A workflow for comparing clusterings on two data sets
The ARI is used to compare two different clusterings on the same set of instances, while we want to compare similarity of two different sets of instances. To overcome this obstacle, we cluster both data sets separately and extract medoids of the clusters for each clustering. The medoid of a cluster is an existing instance in the cluster whose average similarity to all instances in the cluster is maximal. For each instance in the first data set, we find the nearest medoid in the second clustering and assign it to that cluster, thereby getting a joint clustering of both data sets based on the cluster structure of the second data set. We repeat the analogous procedure for the second data set and get a joint clustering based on the first data set. These two joint clusterings are defined on the same set of instances (union of both original and generated data), therefore we can use ARI to asses similarity of the clusterings and compare structure of both data sets. The workflow of cluster based comparison of two data sets is illustrated in Fig. 9.
As we need to assign new instances to existing clustering we selected partitioning with medoids (PAM) clustering algorithm [Kaufman and Rousseeuw, 1990], which, besides partitions, outputs also medoids. Distance to the medoids is the criterion we use to assign new instances to existing clusters. PAM clustering is implemented in R package cluster [Maechler et al., 2013]. To use this method we first computed distances between instances of each data set using Gower’s method [Gower, 1971]. The method normalizes numeric attributes to and uses 0-1 scoring of dissimilarity between nominal attributes (0 for the same, 1 for different categories). The distance is a sum of dissimilarities over all attributes.
5.3 Comparing classification
The classification is probably the most important task in machine learning and data mining. The judgment how good substitute for original data set the generated instances are, is therefore largely dependent on classification similarity between data sets. The scenario we propose to measure similarity of classification performance is shown in Fig. 10.
The basic idea is to train models with the original data and with the generated data. Both models are tested on yet unseen original and generated data and the performances are compared. If the performance of a model trained on original data is comparable for original and generated data this is an indicator that the generated data is within the original distribution (i.e., there are no significant outliers and all the aspects of the original data are captured). If performance of a model trained on the generated data is comparable for original and generated data this shows that the generated data enables comparable learning and has a good coverage of the original distribution, therefore the generator is able to produce good substitutes for original data concerning machine learning and data mining. Additionally, if the model trained on original data achieves better performance on the generated data than on original data, this indicates that the generator is oversimplified and does not cover all peculiarities of the original data.
In our testing workflow (Fig. 10
) we start with two data sets d1 and d2 (e.g. original and generated one) and split them randomly but stratified into two halves (d1 produces d1a and d1b, d2 is split into d2a and d2b). Each of the four splits is used to train a classifier, and we name the resulting models m1a, m1b, m2a, and m2b, respectively. We evaluate the performance of these models on data unseen during training, so m1a is tested on d1b and d2, m1b is tested on d1a and d2, m2a uses d2b and d1, and m2b uses d2a and d1 as the testing set. Each test produces a performance score (e.g., classification accuracy, AUC…) which we can average as in a 2-fold cross-validation to get the following estimates:
performance of m1 on d1 (model built on original data and tested on original data) is an average performance of m1a on d1b and m1b on d1a,
performance of m1 on d2 (classifier built on original data and tested on generated data) is an average performance of m1a on d2 and m1b on d2,
performance of m2 on d2 (model built on generated data and tested on generated data) is an average performance of m2a on d2b and m2b on d2a,
performance of m2 on d1 (model built on generated data and tested on original data) is an average performance of m2a on d1 and m2b on d1.
These estimates already convey an important information as discussed above, but we can subtract performances of m1 on d1 and m2 on d1 (models built on original and generated data, both tested on original data) to get d1. This difference, in our opinion, is the most important indicator how suitable is the generated data for development of classification methods. If, in ideal case, this difference would be close to zero, the generated data would be a good substitute for the lack of original data, as we can expect that performance of developed methods will be comparable when used on original data.
We want to verify if the proposed generator produces data which is consistent with the original and if it covers the whole space the original data set does. We try to determine working conditions of the generator: on which data set it works and where it fails, on what sort of problems it veritably reproduces the original and where it is less successful. We first describe the evaluation scenario and compare original and generated data. Afterwards we examine parameters of the generator and propose reasonable defaults.
To evaluate the generator we performed a large scale empirical evaluation using 51 data sets from UCI repository [Bache and Lichman, 2014] with great variability in the number of attributes, types of attributes and number of class values. We used R package readMLData [Savicky, 2012] which provides an uniform interface for manipulation of UCI data sets. Assuming that one would mostly desire to generate semi-artificial data when the number of original instances is rather small and to keep computational load of the evaluation low, we limited the number of original instances to be between 50 and 1000 (lower limit is necessary to assure sufficient data for both the generator and the testing set). Taking these conditions into account we extracted 51 classification data sets from a collection of 92 data sets kindly provided by author of the package readMLData. The characteristics of these data sets are presented in Table 3.
For each data set a generator based on rbfDDA learner was constructed with function rbfGen (Fig. 3). We used the value of parameter for all data sets and compared both variants of encoding for nominal attributes. The produced generator was used to generate the same number of instances as in the original data set with the same distribution of classes using the function newdata (Fig. 5). The width of the kernels was estimated from the training instances by setting the parameter var=”estimated”.
We compared the original data set with the generated one using the three workflows described in Sect. 5.
- Statistics of attributes:
we compared standard statistics of numeric attributes (Fig. 8
) - mean. standard deviation, skewness, and kurtosis. The interpretations of skewness and kurtosis is difficult and relevant only per each data set and attribute separately so we exclude skewness and kurtosis from the summary comparisons in this section. For numeric attributes we computed p-values of KS tests under the null hypothesis that attribute values from both compared data sets are drawn from the same distribution. We report the percentage of numeric attributes where this hypothesis was rejected at 0.05 level (lower value reported indicates higher similarity). For discrete attributes we compared similarity of value distributions using Hellinger distance. The response variables was excluded from the data sets for this comparison as the similarity of their distributions was enforced by the generator. We report average Hellinger distance over all discrete attribute in each data set in results below.
the structure of original and constructed data sets were compared with k-medoids clustering, using ARI (Eq. (2)) as presented in workflow on Fig. 9. The response variables was excluded from the data sets. For some data sets ARI exhibits high variance, so we report the average ARI over 100 repetitions of generating new data.
- Classification performance:
we compared the predictive similarity of the data sets using classification accuracy of random forests as illustrated in a workflow on Fig.10. We selected random forests due to the robust performance of this learning algorithm under various conditions [Verikas et al., 2011]. The implementation used comes from R package CORElearn [Robnik-Šikonja and Savicky, 2012]. The default parameters were used: we built 100 random trees with the number of randomly selected attributes in nodes set to square root of the number of attributes. We report 5 x 2 cross-validated performances of models trained and tested on both data sets.
The results of these comparisons are presented in Table 4 for integer encoding of nominal attributes. The column labeled present number of Gaussian kernels in the constructed generator. Relatively large number of units are needed to adequately represent the training data. Nevertheless the generator construction time (function rbfDataGen) in seconds is low as seen from column labeled . For measurements we used a single core of Intel i7 CPU running at 2.67Ghz. The time to generate the data (function newdata) was below 1 sec for 1000 instances in all cases, so we do not report it.
The column labeled with gives the percentage of generated instances exactly equal to the original instances. This mostly happens in data sets with only discrete attributes where the whole problem space is small and identical instances are to be expected. Exception from this are datasets horse-colic, primary-tumor and breast-cancer, where the generators contain majority of Gaussian units with only one activation instance. The reason for this is large number of attributes and consequently a poor generalization of rbfDDA algorithm (note the ratio between the number of instances (and the number of attributes) in Table 3 and the number of Gaussian units in Table 4).
Columns labeled and report average difference in mean and standard deviation for attributes normalized to . In 38 out of 39 cases the difference is below and in 33 out of 39 cases it is below
which shows that moments of the distributions for individual attributes are close to the originals. The distributions of individual attributes are compared with KS test for numeric attributes and with Hellinger distance for discrete attributes. Column labeledgives a percentage of p-values below 0.05 in KS tests comparing matching numeric attributes using the null hypothesis that original and generated data are drawn from the same distribution. For most of the data sets and most of the attributes the KS-test detects the differences in distributions. Column labeled presents average Hellinger distance for matching discrete attributes. While for many data sets the distances are low, there are also some data sets where the distances are relatively high, indicating that distribution differences can be considerable for discrete attributes.
The suitability of the generator as a development, simulation, or benchmarking tools in data mining is evidenced by comparing clustering and classification performance. The column labeled ARI presents adjusted Rand index. We can observe that the clustering similarity is considerable for many data sets (high ARI) but there are also some data sets where it is low.
The columns m1d1, m1d2, m2d1, and m2d2 report 5x2 cross-validated classification accuracy of random forest models trained on either original (m1) or generated (m2) data and tested on both original (d1) and generated (d2) data. A general trend observed is that on majority of data sets model trained on original data (m1) performs better on generated data than on the original data (m1d2 is larger than m1d1 for 41 of 51 data sets). This indicates that some of the complexity of the original data is lost in the generated data. This is confirmed also by models built on generated data (m2) which mostly perform better on the generated than on original data (m2d2 is larger than m2d1 in 45 out of 51 data sets). Nevertheless, the generated data can be a satisfactory substitute for data mining in many cases, namely models build on generated data outperforms model built on original data when both are tested on original data in half the cases (m2d1 is larger than m1d1 in 25 cases out of 51, in 26 cases m1d1 is larger and there is 1 draw).
An overall conclusion is therefore that for a considerable number of data sets the proposed generator can generate semi-artificial data which is a reasonable substitute in development of data mining algorithms.
6.1 Binary encoding of attributes
As discussed in Sect. 4.2
we can encode each nominal attribute with a set of binary attributes instead of a single integer attribute and avoid making unjustified assumption about the order of the attribute’s values. We report the results of tests (statistical, clustering and classification) using binary encoding in Table5. As the binary encoding of nominal attributes is used only for nominal non-binary attribute, we report results only for data sets that include at least one such attribute.
The number of Gaussian kernels is mostly larger with binary encoding of attributes (in 19 of 29 cases), and so is generator construction time (in 28 out of 34 cases), but the differences are relatively small (on average 20 more Gaussian units are created, and the generator needs 3.7 seconds more. We compared significance of the differences between integer and binary encodings using Wilcoxon rank sum test at 0.05 level. The binary encoding produces significantly more equal instances, but lower Hellinger distance, higher ARI, and lower difference between m1d1 and m2d1, which all indicate improved similarity to the original data. The differences in numeric attributes were not significant. As a result of these findings we recommend using binary encoding of nominal attributes.
6.2 Correction of estimated spread
In several generators we observed low number of instances activated per kernel, with many kernels being formed around a single instance. For such cases the estimated variance is zero and might cause an overfitting of the training data. We try to alleviate the problem with the parameter defaultSpread in function newdata (see Fig. 5) which, in case of zero estimated variance for certain dimension, replaces this unrealistic value with a (small) constant variance, typically between 0.01 and 0.20. The results for the default value of defaultSpread=0.05 is reported in Table 6.
We compared similarity using the binary encoding of nominal attributes with defaultSpread=0.05 (Table 6) and binary encoding of nominal attributes (Table 5 combined with Table 4 for data sets missing in Table 5). The Wilcoxon paired rank sum test at 0.05 level shows that this setting produces significantly lower proportion of equal instances, lower difference between means for numeric attributes, and lower the difference between m1d1 and m2d1. Other differences were not significant at this level. The approximation of original data is nevertheless better, so we recommend some experimentation with this setting or using a safe default.
6.3 When RBF-based generator works?
We tried to determine the conditions when RBF based data generation works well. The first hypothesis we tested is whether the success of RBF classification algorithm is related to the success of RBF based data generation. For this we compared the classification performance of rbfDDA with performance of random forests. We selected random forests as it is one of the most successful classifiers, known for its robust performance (see for example [Verikas et al., 2011]). Using cross-validation we compared the classification accuracy and AUC of rbFDDA algorithm from RSNNS package [Bergmeir and Benítez, 2012] with random forest implemented in CORElearn package [Robnik-Šikonja and Savicky, 2012] using the default parameters for both classifiers. Unsurprisingly, random forests produced significantly higher accuracy and AUC. We report results for the accuracy in four left-hand side columns of Table 7. Results for the AUC are highly similar so we skip them.
Unsurprisingly, random forest achieve better accuracy than RBF networks. The difference is significant on 0.05 level for 34 of 51 data sets (for 3 data sets RBF is significantly better, other differences are insignificant).
We tried to identify the main factors affecting the success of proposed RBF based data generator. As a measure of success we use the difference in classification accuracy of models trained on original data (m1) and generated data (m2) and tested on original data (d1). This difference is labeled in Table 7. The factors possibly affecting this indicator are difference in classification accuracies between RBF and RF, number of instances, number of attributes, number of instances per attribute, number of Gaussians in the generator, average number of instances per Gaussian kernel, and number of attributes per Gaussian kernel. These factors are collected in Table 7. In Table 8 we show their correlation with .
The Pearson’s correlation coefficients indicate largest correlation of performance with difference in classification accuracy between RBF and RF, number of attribute and number of Gaussian kernels. All these factors are indicators of difficulty of the problem for RBF classifier, hinting that the usability of the proposed generator depends on the ability of the learning method to capture the structure of the problem.
We tried to predict the success of the data generator using stepwise linear model with independent variables as above, but it turned out that difference in classification accuracy between RBF and RF is the only variable needed. Other prediction methods were also not successful.
6.4 Development of big data tools
During development of big data cloud based framework ClowdFlows[Kranjc et al., 2014], we wanted to test several classification algorithms which are components of the framework, and also the capabilities and scalability of the framework. Though several examples of public big data problems are freely available, each requires (sometimes tedious) preprocessing and adaptations to the specifics of the problem. As the development already required a significant effort of everyone involved such an additional effort was undesired. The use of rbfDataGen contained in an open-source R package semiArtificial [Robnik-Sikonja, 2014] turned out to require little additional work but provided required testing data with desired characteristics. We generated several data sets with different characteristics (varying the number of features, instances, and proportions of classes), which were needed during the development and evaluation of the framework.
7 Conclusions and further work
We present an original and practically useful generator of semi-artificial data which was successfully tested in development of big data tools. The generator captures structure of the problem using RBF classifier and exploits properties of the Gaussian kernel to generate new data similar to the original one. We expect such a tool to be useful in the development and adaptation of data analytics tools to specifics of data sets. Possible other uses are data randomization to ensure privacy, simulations requiring large amounts of data, testing of big data tools, benchmarking, and scenarios with huge amounts of data.
We developed a series of evaluation tools which can provide an estimate of generator’s performance for specific data sets. Using a large collection of UCI data sets we were able to show that the generator was in most cases successful in generating artificial data similar to the original. The success of the generator is related to the success of RBF classifier: where the RBF can successfully capture the properties of the original data, the generator based on RBF will also be successful, and vice versa. Nevertheless we were unable to create a successful prediction model for the quality of the generator. The user is therefore advised to use the provided evaluation tools on the specific data set. The provided results shall provide a good indication on the usability of the generated data for the intended use. The proposed generator together with statistical, clustering and classification performance indicators was turned into an open-source R package semiArtificial [Robnik-Sikonja, 2014].
In the future we plan to extend the generator with new modules using different learning algorithms to capture data structure and generate new data. An interesting approach would also be a rejection approach which uses probability density estimates based on various learning algorithms.
Many thanks to Petr Savicky for interesting discussions on the subject and for the preparation of UCI data sets used in this paper. The author was supported by the Slovenian Research Agency (ARRS) through research programme P2-0209.
- Bache and Lichman  K. Bache and M. Lichman. UCI machine learning repository, 2014. URL http://archive.ics.uci.edu/ml.
- Bandara and Jayasumana  H. D. Bandara and A. P. Jayasumana. On characteristics and modeling of p2p resources with correlated static and dynamic attributes. In IEEE Global Telecommunications Conference (GLOBECOM 2011), pages 1–6. IEEE, 2011.
- Bergmeir and Benítez  C. Bergmeir and J. M. Benítez. Neural networks in R using the Stuttgart neural network simulator: RSNNS. Journal of Statistical Software, 46(7):1–26, 2012.
- Berthold and Diamond  M. R. Berthold and J. Diamond. Boosting the performance of RBF networks with dynamic decay adjustment. In Advances in Neural Information Processing Systems, pages 521–528. MIT Press, 1995.
- Ferrari and Barbiero  P. A. Ferrari and A. Barbiero. Simulating ordinal data. Multivariate Behavioral Research, 47(4):566–589, 2012.
- Fraley et al.  C. Fraley, A. E. Raftery, T. B. Murphy, and L. Scrucca. mclust version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, University of Washington, Department of Statistics, 2012. URL http://CRAN.R-project.org/package=mclust. R package, version 4.2.
- Gower  J. C. Gower. A general coefficient of similarity and some of its properties. Biometrics, pages 857–871, 1971.
- Han and Qiao  H.-G. Han and J.-F. Qiao. Adaptive computation algorithm for RBF neural network. IEEE Transactions on Neural Networks and Learning Systems, 23(2):342–347, 2012.
- Härdle and Müller  W. Härdle and M. Müller. Multivariate and semiparametric kernel regression. In M. G. Schimek, editor, Smoothing and regression: approaches, computation, and application, pages 357–392. John Wiley & Sons, 2000.
- Hennig  C. Hennig. fpc: Flexible procedures for clustering, 2013. URL http://CRAN.R-project.org/package=fpc. R package, version 2.1-7.
- Hubert and Arabie  L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
Kaufman and Rousseeuw 
L. Kaufman and P. J. Rousseeuw.
Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990.
- Kranjc et al.  J. Kranjc, R. Orač, V. Podpečan, M. Robnik-Šikonja, and N. Lavrač. ClowdFlows: Workows for big data on the cloud. Technical report, Jožef Stefan Institute, Ljubljana, Slovenia, 2014.
- Maechler et al.  M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik. cluster: Cluster Analysis Basics and Extensions, 2013. URL http://CRAN.R-project.org/package=cluster. R package, version 1.14.4.
- Mair et al.  P. Mair, A. Satorra, and P. M. Bentler. Generating nonnormal multivariate data using copulas: Applications to SEM. Multivariate Behavioral Research, 47(4):547–565, 2012.
- Moody and Darken  J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2):281–294, 1989.
- Nelsen  R. B. Nelsen. An introduction to copulas. Springer, 1999.
- R Core Team  R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org/.
- Rand  W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971.
- Reilly et al.  D. L. Reilly, L. N. Cooper, and C. Elbaum. A neural model for category learning. Biological cybernetics, 45(1):35–41, 1982.
- Ripley  B. D. Ripley. Stochastic Simulation. Wiley, New York, 1987.
- Robnik-Sikonja  M. Robnik-Sikonja. semiArtificial: Generator of semi-artificial data, 2014. URL http://CRAN.R-project.org/package=semiArtificial. R package version 1.2.0.
- Robnik-Šikonja and Savicky  M. Robnik-Šikonja and P. Savicky. CORElearn - classification, regression, feature evaluation and ordinal evaluation, 2012. URL http://CRAN.R-project.org/package=CORElearn. R package version 0.9.39.
- Ruscio and Kaczetow  J. Ruscio and W. Kaczetow. Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behavioral Research, 43(3):355–381, 2008.
- Savicky  P. Savicky. readMLData: Reading machine learning benchmark data sets from different sources in their original format, 2012. URL http://CRAN.R-project.org/package=readMLData. R package, version 0.9-6.
- Venables and Ripley  W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York, 4th edition, 2002.
- Verikas et al.  A. Verikas, A. Gelzinis, and M. Bacauskiene. Mining data with random forests: A survey and results of new tests. Pattern Recognition, 44(2):330–349, 2011.
- Vinh et al.  N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th International Conference on Machine Learning, ICML 2009, pages 1073–1080, 2009.
- Xie et al.  T. Xie, H. Yu, J. Hewlett, P. Rozycki, and B. Wilamowski. Fast and efficient second-order method for training radial basis function networks. IEEE Transactions on Neural Networks and Learning Systems, 23(4):609–619, 2012.
- Zell et al.  A. Zell, G. Mamier, M. Vogt, N. Mache, R. Huebner, S. Doering, K.-U. Herrmann, T. Soyez, M. Schmalzl, T. Sommer, A. Hatzigeorgiou, D. Posselt, T. Schreiner, B. Kett, G. Clemente, J. Wieland, and J. Gatter. SNNS: Stuttgart neural network simulator. User manual, version 4.2. Technical report, University of Stuttgart and University of Tuebingen, 1995.