1 Introduction
Feature vectors carry useful numerical patterns that characterize the original domain (or a sub original domain  input domain) formed by the feature vectors themselves. Machine learning algorithms generally utilize these patterns to generate classifiers, that can help make decisions from data, by using supervised or unsupervised learning techniques
(Suthaharan, 2015). However, certain data science applications, such as data privacy and data security
(Whitworth and Suthaharan, 2014), require the alteration of these feature patterns to protect data privacy so that it should be difficult to recover the original patterns from the altered patterns (Little, 1993). Perturbation models have been studied and developed for this purpose (Muralidhar and Sarathy, 2003) and (Fienberg and Steele, 1998). The perturbation models generally transform the feature vectors from an original domain to a new set of feature vectors within a transform domain where the data privacy can be protected. On the other hand, the performance of machine learning algorithms can be degraded in the transform domain due to the alternations of the patterns. Hence a significant research has been performed to develop an efficient perturbation model to minimize the degradation of the performance of machine learning algorithms while providing a robust protection of data privacy. Perturbation models may be categorized into two toplevel groups: parametric models and nonparametric models. The parametric models may also be further divided into two subgroups: vector space (or the original domain) models and feature space (or the transform domain) models. The vector space models include the models proposed by
(Muralidhar and Sarathy, 2003), in which the authors have shown that their proposed models perform well in the original domain. Alternatively, (Oliveira and Zaïane, 2004) proposed a feature space model which was constructed using a matrix rotation, and (Lasko and Vinterbo, 2010) also developed a feature space model, but they used a spectral analysis. They showed their proposed techniques performed well in the transform domain. These types of models make parametric statistical assumption which in practice can be easily violated for different types of data. As a consequence, the current techniques may not perform as desired. A thorough review was presented in a recent paper by (Qian and Xie, 2015), in which the authors summarized the possible types of violations of parametric assumptions, including uncertainty in marginal distributional properties of independent variables and possible nonlinear relationship that linear models cannot fully explore (e.g., invertU shape (Aghion et al., 2005)). They proposed a nonparametric model based on density ratios to address these problems and reported that the nonparametric models in general can perform better than the other parametric models.In this paper, we also considered a nonparametric perturbation model without imposing any parametric assumptions on the marginal distribution of features. The main idea is to construct a transform domain (or feature space) from the original domain using parametrized elliptical patterns with the goals of making the restoration of the original patterns very difficult, while maintaining a similar performance for the machine learning algorithms in both the original and the transform domains. Our proposed approach, Elliptical Pattern Analysis (EPA), sets the criteria on privacy strength based on blind source separation attack (Zarzoso and Nandi, 1999), because of the use of mutual interaction between variables to construct transform domain.
Our key contribution includes the use of mutual interaction between two variables (or features); however, this type of aggregation may jeopardize the performance of classification algorithms through the loss of some of the data characteristics (or patterns). To solve this problem, we proposed an additional data aggregation step through the random projection in the feature space before applying any machine learning algorithms. The main idea is to search over possible ways to combine pairs (or blocks) of variables to achieve efficient dimension reduction while maintaining useful predictive information to help laterstage for machine learning algorithms. In particular, we consider classification algorithms and use random forest classification on the reduced feature space. By aggregating feature variables, the proposed method significantly enhances the protection of data privacy and reduces computational cost.
2 A Perturbation Model
We define the proposed EPA approach as a model that transforms a sub original domain (input domain) through a perturbation process such that the feature vector is altered in the transform domain to achieve a set of specific recommended goals  the goals that lead to the protection of data privacy and the generation of classifiers. In this section, the perturbation models is defined using a mathematical transformation () and recommended quantitative measures for quantifying the strength of data privacy () and misclassification error ( or ).
2.1 Mathematical Definition
Suppose x is a feature vector with dimension in the input domain , and y is its perturbed feature vector with dimension (where ) in the transform domain , then we define the mathematical relationship between x and y as follows:
(2.1) 
where the mathematical transformation defines the proposed perturbation model, and its intention is to satisfy the condition for some quantitative measure . In other words, this condition describes the difficulty of recovering the feature vector x from the feature vector y given the transformation and the quantitative measure . One of the applications that satisfy this type of modeling is data privacy where the owner of the data wants to share the data to an intended user, while its privacy is protected, given the transformation and the measure are chosen appropriately.
2.2 Problem Definition
The condition imposed on the proposed perturbation model can adversely affect other applications that require the use of a feature vector in the transform domain to achieve similar or better classification results obtained with the feature vector of the input domain, along with data privacy. Suppose is a performance measure (e.g. misclassification error) of an application , then the performance degradation of the perturbation model can be defined as follows:
(2.2) 
where and we define the degradation measure as follows: . While it is expected that for a perturbation model, it is also possible that we get ; that is better performance with y for a perturbation model. The application that we consider in this paper is a classification technique  in particular the random forest technique  with the misclassification error as the performance measure .
3 The Proposed Methodology
This study requires  as per the definitions and problems stated in the previous section  a perturbation model with its condition measure , and an application with its performance degradation measure . They are presented in this section with a detailed discussion.
3.1 Elliptical Perturbation Model
Our feature vector x in the input domain may be represented by variables (or features), . We also assume is an even integer without loss of generality. We use the proposed perturbation on consecutive pairs of variables: , , , to generate the feature vector y which is represented by new variables ; , respectively. Take as an example, we consider
(3.1) 
where and are unknown parameters, and determines the strength of noise degradation. To further simplify the process, we can assume and . The model reduces to the standard linear model when or . The nonlinear transformation defines the elliptical perturbation model and describes the nonlinear mutual interaction between the feature variables and .
On one hand, we can choose the value for such that the classification results using y and x are significantly close to each other (i.e., ). On the other hand, we can choose to minimize the absolute value of correlations between and . Meanwhile, noise strength will be tuned to achieve the intended goal (e.g. data privacy determined by ) of the perturbation model. In the model building process, we will use this correlationminimization to tune the parameter .
3.2 Elliptical Patterns Visualization
The visual interpretation of the studied model in equation (3.1) is presented in Fig. 1. We have illustrated the elliptical characteristics of the model by fixing the variable to a single value and varying the values of the parameters , , and . For simplicity, we have selected , and a set of values (0.22, 0.78, 0.03), (0.32, 0.68, 0.04), and (0.1, 0.9, 0.05) for the parameters , , , respectively. The model in equation (3.1), with these values, provides the three elliptical patterns with interference characteristics as illustrated in Fig. 1. In order to generate these elliptical patterns, we transform equation (3.1) as follows:
(3.2) 
It clearly shows the difficulty of finding a pair of for a given value of under a scaled noise degradation due to elliptical interference. To illustrate the strength of the model visually, we increased the values of from 0.03, 0.04, and 0.05 to 0.05, 0.1, and 0.15, respectively, and generated the values of . The results are presented in Fig. 2. It clearly displays a stronger interference (or cross talk) between the elliptical models with respect to the values of . The measure of this interference will help to determine parameters of the model for the protection of data privacy. We treat this interference as signal interference and apply blind signal separation approaches (Zarzoso and Nandi, 1999) to determine the strength of data privacy.
3.3 Blind Source Separation
The blind source separation (BSS) is one of the classical techniques that is capable of separating the original signals from their copies of modulated signals without having any prior information about the original signals (Zarzoso and Nandi, 1999). The recent studies show that BSS is even capable of handling multidimensional data, like images and video (or image sequences) (Sørensen and De Lathauwer, 2013). Therefore, we have adopted this technique as an attack approach (Liu et al., 2008) for the proposed perturbation model and derive robust parameters for the model. The standard measure used with BSS technique (or the attack) is called the Signal Interference Ratio (SIR), which is defined by the following fraction:
(3.3) 
where and stand for the power of modulated signal and the power of crosstalk between the cochannels, respectively. The ratio is measured in decibel . When the denominator  power of crosstalk  increases, the ratio decreases, and it is hard to recover the source signals from the modulated signals. This fraction is defined based on the information available at https://cran.rproject.org/web/packages/JADE/index.html. It means that lower the SIR the higher the strength of modulation. The BSS technique states that if the SIR value is greater than 20 dB then the source signals ( and ) are recoverable from , and if the SIR values is less than or equal to 20 dB then source signals are not recoverable (Boscolo et al., 2004), (Caiafa and Proto, 2005). We use this for the validation of proposed perturbation model.
3.4 Random Forest Classification
Among many classification techniques in a machine learning system, we have selected the random forest technique (Breiman, 2001)
for our research, because of its ability to address multiclass classification problem better than many other machine learning techniques, including support vector machine
(Jeyakumar et al., 2014; Suthaharan, 2016)and decision tree
(Murthy, 1998). The random forest classifiers divide the data domain efficiently using bootstrapping technique  used to generate random decision trees  and Gini index  used to split the tree nodes. Hence it is highly suitable for the classification objectives of a large and imbalanced data set with many features.3.5 Misclassification and OOB Errors
Several measures have been used to quantify the performance of classification techniques in machine learning; among them outofbag (OOB) error and misclassification errors are the most commonly used errors for the random forest classifiers (Breiman, 1996). OOB error is defined by the ratio between the total number of misclassified items from a set and the total number of items in the set. Similarly the misclassification error of a class is defined by the ratio between the number of misclassified items in the class and the total number of items in the class. We have used both of these quantitative measures to evaluate the performance of random forest classification algorithm in the input domain as well in the transform domain with the proposed perturbation model, and compare the results.
4 Experimental Results
We studied the performance degradation of random forest classifiers using the proposed elliptical perturbation model and the highly imbalanced NSLKDD data set (http://www.unb.ca/cic/research/datasets/nsl.html), which we downloaded and used it in a previous research (Suthaharan and Panchagnula, 2012). This data set has 25,192 observations with 41 network traffic features and 22 network traffic classes. We labeled the entire feature vector as (), and reduced it later to a lowerdimensional feature vector, based on their importance to random forest classification. This data set forms the original domain and we represented this data set as “datasetO”. In this data set, the normal traffic class and the Neptune attack class have large number of observations, compared to other attack classes; hence, it provides a highly imbalanced data set.
Label  Traffic  #Obs.  Label  Traffic  #Obs. 

0  Normal  13449  11  guess_pwd  10 
1  Neptune  8282  12  ftp_write  1 
2  back  196  13  multihop  2 
3  Warezclient  181  14  warezmaster  7 
4  ipsweep  710  15  loadmodule  1 
5  portsweep  587  16  spy  1 
6  teardrop  188  17  imap  5 
7  nmap  301  18  buf_ovrflow  6 
8  satan  691  19  land  1 
9  smurf  529  20  phf  2 
10  pod  38  21  rootkit  4 
The network traffic details of this data set presented in Table 1 clearly show the imbalanced nature of the data set between normal and attack traffic classes, and among the attack traffic classes. The first 11 traffic classes (labeled 0 to 10) presented in this table have more than 30 observations, and the next 11 traffic classes (labeled 11 to 21) have much less than 30 observations. One of the goals is to study the effect of the proposed perturbation model on the performance of random forest classifiers using the first 11 traffic classes only; however, we will use the other 11 traffic classes to understand imbalanced nature of the data and its significance to random forest classification.
4.1 Feature Selection using Random Forest
There are 41 features  as we denoted by (
) earlier  in the datasetO, and this feature vector determines the dimensionality 41 of the original domain; however, not necessarily all of these features contribute to the classification performance of random forest. To prepare the data set for our experiments and select the important features for classification, we first removed the categorical variables (or features) along with the features that overshadow the other features due to outliers. We then applied random forest classification to determine the importance of features by ordering them based on their misclassification errors.
Using the approach suggested by (Zumel et al., 2014), and by removing the least important feature from the feature vector onebyone, while performing random forest classification repeatedly until a change in misclassification error can be observed. This process resulted in a lowerdimensional data set with 16 features, (, , , , , , , , , , , , , , , ) in the decreasing order of importance. Hence, we have reduced the data set to a data set () with the most important feature vector that contributes to random forest classification. For simplicity, we represented these features by () respectively. Therefore, the dimension of the input domain of the proposed perturbation model is with 25,192 observations, 16 network traffic features, and 22 network traffic classes. Let’s represent this dimensionreduced data set for the input domain as “datasetI”.
4.2 Transform Domain Pattern Analysis
The next step is to build the perturbation model, using the datasetI as the input domain and construct the transform domain so that the random forest classifiers can be evaluated. Due to the pairing of features, multiple elliptical perturbation models were generated by selecting suitable parameters for the model, and they are discussed in the subsections below.
4.2.1 Multiple Model Generation
The proposed theoretical model for a single pair of features was presented in equation (3.1), which is applied to every consecutive pair of features: (), , …, () associated with the input domain; however, one can apply different techniques to select and combine the features. The pairing of these sixteen features of the input domain can give 8 models with new features for transform domain as follows:
(4.1) 
where ; hence, we have 8 different models with elliptical patterns that form the transform domain with dimension 8. It is obvious that the parameters , together, and contribute to the elliptical patterns and their distortion, and in turn contribute to the robustness of the proposed perturbation model to privacy attacks. They also contribute to the performance degradation of random forest classifiers in the transform domain. Therefore, a tradeoff mechanism is required to achieve a strong privacy protection and a low misclassification error. The SIR measure is a flexible quantifier that allows a wide range of values to quantify the strength of privacy protection against BSS attack. The next subsection describes the empirical approach where we utilized this measure to find a set of values for the parameters , by fixing .
4.2.2 Parameter Selection for the Models
We used Monte Carlo approach with the JADE implementation of SIR computation to assess BSS attack empirically. In this implementation, multiple copies of modulated source signals are generated using random weights, and then a SIR value is calculated to determine if the source signals are recoverable (if SIR is greater than 20dB then source signals are recoverable, otherwise they are not) from the multiple modulated signals. In our implementation, the feature pair (,), is considered as source signals, and is considered as their modulated signal. To create, multiple copies of modulated signal , using (,), we generated several values for
randomly from Uniform distribution, and used them in equation (
4.1). We then used the Monte Carlo approach to achieve desired results.The Monte Carlo approach, combined with the JADE application of SIR and BSS attack provided us with the three values 0.042, 0.021, 0.096, which we selected for , , and . To cut down the computational cost of Monte Carlo approach, we used them repeatedly for the parameters as follows: , , , , , , , and for the 8 models, respectively. We obtained the SIR values for these parameters: 14.289, 10.983, 7.873, 11.483, 11.758, 12.608, 14.675, 16.235, respectively  the values less than 20dB indicate the source signal separation is difficult; hence, BSS attack is not possible. We can also see, each model has different privacy strengths, for example, model is much stronger than model against BSS attack. Therefore, in this step, we generated a data set for the transform domain, and it has 25,192 observations with 8 newly defined traffic features (, ) and 22 network traffic classes. Let’s represent this transform domain data set as “datasetT”.
4.3 Performance Degradation Evaluation
We divided the performance degradation evaluation task into two experiments: “experiment with fullimbalanced data sets”, and “experiment with reducedimbalanced data sets”. In the first experiment, we used the data sets datasetI and datasetT to compare the performance of random forest in both the input domain and transform domain. These two data sets have all 22 network traffic types with their full imbalanced traffic nature. As listed in Table 1, there are 11 traffic types with much fewer than 30 observations (totaling 40 observation)  the removal of these traffic types may influence the classification results. Hence, for the second experiment, we created two new data sets, datasetIR and datasetTR, from datasetI and datasetT, respectively. We removed the 40 observations related to these 11 traffic types. Hence the datasetIR has 25,152 observations with dimension 16 and 22 traffic classes, and the datasetTR has 25,152 observations with dimension 8 and 22 traffic classes.
4.3.1 Experiment with fullimbalanced data sets:
We used both datasetI and datasetT to compare the performance of random forest classifiers in input domain and transform domain respectively. We conducted this experiment to evaluate the classification performance using random forest with the original (unprotected features) and transformed variables (protected features). The idea is to analyze the performance of random forest if the training is performed on these two fullimbalanced data sets. Therefore, we used both OOB error and misclassification error to compare the performances.
OOB error
The OOB errors and misclassification errors are presented in Tables 2 and 3 in their second and third columns, respectively. The tables also provide the information of the tuples, correctly classified and misclassified number of observations, for each class in input domain  denoted by (, )  and transform domain  denoted by (, ), respectively. In the tables, the OOB errors are calculated as a single measure for the classification performance on the set, thus we have a single value of 0.0098 for input variables (unprotected features), 0.0169 for transformed variables (protected features). If we round these values to the second decimal places, we get 0.01 and 0.02 OOB errors, making it 1% error difference in the performance degradation  input domain versus transform domain. We can see that the perturbation model increases the OOB error slightly while protecting data privacy.
Misclassification error
Similarly, by comparing misclassification errors presented in Table 2 and Table 3, we observed that the perturbation model has a higher misclassification errors as expected, showing the characteristics of a perturbation model. As we can observe, the misclassification errors are increased, except for the traffic types ipsweep, teardrop, and pod. However, the error differences are significantly lower; hence, the perturbation model helps achieve both the protection of data privacy and the classification performance of random forest.
Label  OOB errors  Misclassification errors 

Normal  0.0098  0.005 (13379, 70) 
Neptune  0.0098  0.003 (8256, 26) 
back  0.0098  0.025 (191, 5) 
warezclient  0.0098  0.127 (158, 23) 
ipsweep  0.0098  0.026 (691, 19) 
portsweep  0.0098  0.017 (577, 10) 
teardrop  0.0098  0.010 (186, 2) 
nmap  0.0098  0.086 (275, 26) 
satan  0.0098  0.041 (662, 29) 
smurf  0.0098  0.015 (521, 8) 
pod  0.0098  0.184 (31, 7) 
Label  OOB errors  Misclassification errors 

Normal  0.0169  0.009 (13322, 127) 
Neptune  0.0169  0.009 (8205, 77) 
back  0.0169  0.041 (188, 8) 
warezclient  0.0169  0.232 (139, 42) 
ipsweep  0.0169  0.021 (695, 15) 
portsweep  0.0169  0.063 (550, 37) 
teardrop  0.0169  0.005 (187, 1) 
nmap  0.0169  0.116 (266, 35) 
satan  0.0169  0.063 (647, 44) 
smurf  0.0169  0.045 (505, 24) 
pod  0.0169  0.053 (36, 2) 
Label  OOB errors  Misclassification errors 

Normal  0.0088  0.005 (13381, 68) 
Neptune  0.0088  0.003 (8253, 29) 
back  0.0088  0.025 (191, 5) 
warezclient  0.0088  0.127 (158, 23) 
ipsweep  0.0088  0.025 (692, 18) 
portsweep  0.0088  0.013 (579, 8) 
teardrop  0.0088  0.010 (186, 2) 
nmap  0.0088  0.093 (273, 28) 
satan  0.0088  0.044 (660, 31) 
smurf  0.0088  0.015 (521, 8) 
pod  0.0088  0.210 (30, 8) 
Label  OOB errors  Misclassification errors 

Normal  0.0156  0.009 (13322, 127) 
Neptune  0.0156  0.009 (8207, 75) 
back  0.0156  0.040 (188, 8) 
warezclient  0.0156  0.220 (141, 40) 
ipsweep  0.0156  0.022 (694, 16) 
portsweep  0.0156  0.061 (551, 36) 
teardrop  0.0156  0.005 (187, 1) 
nmap  0.0156  0.102 (270, 31) 
satan  0.0156  0.059 (650, 41) 
smurf  0.0156  0.039 (508, 21) 
pod  0.0156  0.053 (36, 2) 
4.3.2 Experiment with reducedimbalanced data sets
We used datasetIR and datasetTR to compare the performance of random forest classifiers in input and transform domains for the purpose of this experiment. It means only the 11 traffic types with more than 30 observations were classified to study if there was any significant effect due to the elimination of other traffic types that have significantly lower number of observations. The results are presented in Tables 4 and 5, and we can observe similar patterns between the input domain and transform domain results. Hence, comparing the results in Tables 2 and 4, we can see that the OOB error has slightly decreased due to the reducedimbalanced nature of traffic types, as expected. Similarly, comparing the results in Tables 3 and 5, we can see the reduction in the OOB error, and an overall reduction in the misclassification errors.
4.4 Overall Performance Degradation
Although, the results presented in the previous section provide information to compare the performance degradation of the random forest classifiers between the input domain and the transform domain, it is important to understand the overall performance degradation to conclude if the proposed perturbation is meaningful. Therefore, to estimate the percentage performance degradation, we defined a simple measure:
(4.2) 
For example, the transform domain misclassification () of traffic type “normal” is 127 (from Table 3), and the input domain misclassification () of traffic type “normal” is 70 (from Table 2). Also the total number of observations of “normal” traffic class is 13449 (Table 1). Therefore, the percentage degradation of random forest by the proposed perturbation model for the “normal” class is 0.4238233. Similarly, we calculated the percentage degradations for other 10 traffic types with fullimbalanced data sets, and listed all of them in Table 6 (column 2). We also calculated the same for reducedimbalanced data sets, and provided the results in column 3 of Table 6. Note that a positive value indicates it is a degradation over input domain to transform domain, whereas, a negative value indicates there is an improvement over input domain to transform domain. The average degradations over all the class types are 1.05% for fullimbalanced data sets, and 0.45% for reducedimbalanced data sets  indicating additional average degradation of 1.05% when the fullimbalanced data is used, additional average degradation of 0.45% when reducedimbalanced data is used, and the difference shows the use of additional imbalanced data affects the performance negatively.
Label (t)  FullImb. ()  ReducedImb. () 
Normal  0.4238233  0.4386943 
Neptune  0.6157933  0.5554214 
back  1.5306122  1.5306122 
warezclient  10.4972376  9.3922652 
ipsweep  0.5633803  0.2816901 
portsweep  4.5996593  4.7700170 
teardrop  0.5319149  0.5319149 
nmap  2.9900332  0.9966777 
satan  2.1707670  1.4471780 
smurf  3.0245747  2.4574669 
pod  13.1578947  15.7894737 
AVG. ERR.  1.054483  0.4532049 
5 Competing Methods and Discussion
We have selected PCA as the competing method to evaluate the performance of the proposed EPA approach. PCA is a classical linear transformation which transforms the original features to principal components (PCs), hence achieves effective dimension reduction (Du and Swamy, 2014). It has been extensively used in modern applications, including atmospheric science (Jolliffe and Cadima, 2016), neuroscience (Lee et al., 2016), and neuroimaging (Jones et al., 2007). It became popular in the last two decades because of the recent developments in computer technology that can help the application of PCA to high dimensional large data sets. However, it generally suffers from two major drawbacks as reported in (Bruce and Bruce, 2017). One of them is the strong statistical assumptions and the second one is the difficulty of selecting the number of PCs for dimensionality reduction and achieve data utility.
5.1 Comparative Analysis
The results of PCA transformation  applied to the fullimbalanced NSLKDD data  are presented in Table 7 and they can be compared with the results of the proposed EPA approach (applied to the same data) in the second column of Table 6. We adopted two criterion to extract number of PCs: eigenvalue greater than 1 criterion (i.e., KaiserGuttman criterion) as used in
(Hung et al., 2016)and 80% cumulative variance rule as stated in
(Bruce and Bruce, 2017). The number of PCs selected by these criterion are 5 and 6, respectively. The random forest classification results () using the first 5 PCs and 6 PCs of this data are presented in the second and third columns of Table 7.5.1.1 General Analysis
The results in the second columns of Tables 6 and 7 show that the average performance degradation caused by PCA with 5 PCs is higher (almost double) than the degradation caused by the proposed EPA approach. In contrast, the results in the third column suggests a smaller degradation is possible if 6 PCs are used. These results, with the use of higher number of PCs, PCA can achieve better classification accuracy; however, it also suggests the proposed approach can be competitive too.
5.1.2 Specific Analysis
In network security, DenialofService (DoS) attack is generally considered a major threat to network users and the servers. Therefore, the classification of Normal traffic and DoS attacks are very important. The DoS attack includes the attacks such as Neptune, Back, Teardrop, Smurf and Pod (Jin et al., 2007) and they are included in NSLKDD data set as well. Therefore, we calculated the performance degradation () for these attacks separately and obtained 1.35, 1.97, and 0.67 for EPA, PCA with 5 PCs, and PCA with 6PCs, respectively. The negative value, as stated earlier, indicates an improvement in the performance; thus, It shows the proposed EPA is superior than PCA when the classification of DoS attacks are considered.
In terms of invertible characteristics, according to (Geiger, 2014), it is possible to invert PCA with an estimate of the covariance matrix; hence, it is relatively weaker than the proposed EPA approach when the applications such as data privacy and security are considered. However, in terms of dimension reduction, PCA can be superior than the proposed method because it can reduce the dimension by more than 50%, whereas the proposed EPA approach has the fixed 50% dimension reduction.
Label (t)  FullImb. 5PC ()  FullImb. 6PC () 
Normal  0.3345974  0.1487099 
Neptune  0.4829751  0.4346776 
back  13.7755102  10.7142857 
warezclient  7.1823204  3.3149171 
ipsweep  0.4225352  0.7042254 
portsweep  1.8739353  1.7035775 
teardrop  0.5319149  0.5319149 
nmap  0.9966777  0.3322259 
satan  0.8683068  0.5788712 
smurf  5.6710775  6.4272212 
pod  7.8947368  13.1578947 
AVG. ERR.  2.107389  0.9699002 
5.2 Evaluation using IRIS plant data set
We also used the iris plant dataset to evaluate and compare EPA and PCA transformations. This dataset is a simple, yet effective dataset, which has been used in machine learning extensively for the last several decades (Chaudhary et al., 2016; Timón et al., 2016; Lin et al., 2017). We obtained this data from the UCI Machine Learning Repository (Lichman, 2013). Random forest is applied to the original iris data, OOB errors are calculated and presented in the second column of Table 8. The data is then transformed into PCs using PCA. The random forest classification is applied using all the PCs and the OOB results are presented in the third column of Table 8. We also transformed the data set using the proposed EPA transformation and the applied random forest classification. The OOB results of the proposed approach is presented in the fourth column of the table. Note that the first column of the table shows the three classes of the iris plant. Comparing the results in Table 8, we can say that the proposed transformation provides the classification results closer to the results of random forest applied to the original data than the principal components.
6 Conclusion
This study allowed us to understand the variations caused by the perturbation models between their input domain and transform domain characteristics or numerical patterns. This knowledge helped us construct a parametric perturbation model using an elliptical transformation along with an additive Gaussian noise degradation. The degradation performance analysis using random forest classifiers together with blind source separation attack and quantitative measures  signal interference ratio, OOB error, and misclassification error  showed that the parametric elliptical perturbation model performed very well in the classification of network intrusion and biological data, while protecting data privacy patterns of feature vectors of the data.
Compared with classical linear transformations such as PCA, the proposed method requires less statistical assumptions on the data and is highly suitable for the applications such as data privacy and security as a result of the difficulty of inverting the elliptical patterns from the transform domain to the input domain. In addition, we adopted a flexible blockwise dimension reduction step in the proposed method to accommodate the possible highdimensional data () in modern applications, in which PCA is not directly applicable. The empirical performance results also confirmed the superior performance of the proposed EPA approach over the widely used PCA.
Several future directions still remain of interest in our research agenda. First, the current paper mainly discusses pairing of two features (block size is 2) and fixed projections. It is possible to consider larger block sizes and random projections to reduce computation complexity. Second, model (3) can be extended by replacing the constraint with flexible alternatives, and by considering a diagonal elliptical models.
Class  OOB: RF  OOB: RFPCA  OOB: RFEPA 

Setosa  0.00  0.00  0.00 
Versicolor  0.08  0.12  0.12 
Virginica  0.06  0.10  0.06 
Acknowledgments
This research of the first author was partially supported by the Department of Statistics, University of California at Irvine.
References
 Aghion et al. (2005) Aghion, P., Bloom, N., Blundell, R., Griffith, R., Howitt, P., 2005. Competition and innovation: An invertedu relationship. The Quarterly Journal of Economics 120, 701–728.

Boscolo et al. (2004)
Boscolo, R., Pan, H.,
Roychowdhury, V.P., 2004.
Independent component analysis based on nonparametric
density estimation.
IEEE Transactions on Neural Networks 15, 55–65.
 Breiman (1996) Breiman, L., 1996. Bagging predictors. Machine learning 24, 123–140.
 Breiman (2001) Breiman, L., 2001. Random forests. Machine learning 45, 5–32.
 Bruce and Bruce (2017) Bruce, P., Bruce, A., 2017. Practical Statistics for Data Scientists: 50 Essential Concepts. ” O’Reilly Media, Inc.”.
 Caiafa and Proto (2005) Caiafa, C.F., Proto, A.N., 2005. A nongaussianity measure for blind source separation. Proc. of SPARS05 .
 Chaudhary et al. (2016) Chaudhary, A., Kolhe, S., Kamal, R., 2016. A hybrid ensemble for classification in multiclass datasets: An application to oilseed disease dataset. Computers and electronics in agriculture 124, 65–72.
 Du and Swamy (2014) Du, K.L., Swamy, M., 2014. Principal component analysis, in: Neural Networks and Statistical Learning. Springer, pp. 355–405.
 Fienberg and Steele (1998) Fienberg, S.E., Steele, R.J., 1998. Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14(4), 485–502.
 Geiger (2014) Geiger, B.C., 2014. Information loss in deterministic systems. Ph. D. Thesis, Graz University of Technology, Graz, Austria.
 Hung et al. (2016) Hung, C.C., Liu, H.C., Lin, C.C., Lee, B.O., 2016. Development and validation of the simulationbased learning evaluation scale. Elsevier. Nurse education today 40, 72–77.
 Jeyakumar et al. (2014) Jeyakumar, V., Li, G., Suthaharan, S., 2014. Support vector machine classifiers with uncertain knowledge sets via robust optimization. Optimization 63, 1099–1116.
 Jin et al. (2007) Jin, S., Yeung, D.S., Wang, X., 2007. Network intrusion detection in covariance feature space. Pattern Recognition 40, 2185–2197.
 Jolliffe and Cadima (2016) Jolliffe, I.T., Cadima, J., 2016. Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A 374, 20150202.
 Jones et al. (2007) Jones, D.G., Beston, B.R., Murphy, K.M., 2007. Novel application of principal component analysis to understanding visual cortical development. BMC neuroscience 8, P188.
 Lasko and Vinterbo (2010) Lasko, T.A., Vinterbo, S.A., 2010. Spectral anonymization of data. IEEE transactions on knowledge and data engineering 22, 437–446.
 Lee et al. (2016) Lee, S., Habeck, C., Razlighi, Q., Salthouse, T., Stern, Y., 2016. Selective association between cortical thickness and reference abilities in normal aging. NeuroImage 142, 293–300.
 Lichman (2013) Lichman, M., 2013. UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
 Lin et al. (2017) Lin, Z., Ma, D., Meng, J., Chen, L., 2017. Relative ordering learning in spiking neural network for pattern recognition. Neurocomputing , 1–13.
 Little (1993) Little, R.J., 1993. Statistical analysis of masked data. Journal of Official statistics 9(2), 407–426.
 Liu et al. (2008) Liu, K., Giannella, C., Kargupta, H., 2008. A survey of attack techniques on privacypreserving data perturbation methods. PrivacyPreserving Data Mining , 359–381.
 Muralidhar and Sarathy (2003) Muralidhar, K., Sarathy, R., 2003. A theoretical basis for perturbation methods. Statistics and Computing 13, 329–335.
 Murthy (1998) Murthy, S.K., 1998. Automatic construction of decision trees from data: A multidisciplinary survey. Data mining and knowledge discovery 2, 345–389.
 Oliveira and Zaïane (2004) Oliveira, S.R., Zaïane, O.R., 2004. Achieving privacy preservation when sharing data for clustering, in: Workshop on Secure Data Management, Springer. pp. 67–82.
 Qian and Xie (2015) Qian, Y., Xie, H., 2015. Drive more effective databased innovations: Enhancing the utility of secure databases. Management Science 61, 520–541.

Sørensen and De Lathauwer (2013)
Sørensen, M., De Lathauwer, L.,
2013.
Blind signal separation via tensor decomposition with vandermonde factor: Canonical polyadic decomposition.
IEEE Transactions on Signal Processing 61, 5507–5519.  Suthaharan (2015) Suthaharan, S., 2015. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning. volume 36. Springer.
 Suthaharan (2016) Suthaharan, S., 2016. Support vector machine, in: Machine Learning Models and Algorithms for Big Data Classification. Springer, pp. 207–235.

Suthaharan and Panchagnula (2012)
Suthaharan, S., Panchagnula, T.,
2012.
Relevance feature selection with data cleaning for intrusion detection system, in: Southeastcon, 2012 Proceedings of IEEE, IEEE. pp. 1–6.
 Timón et al. (2016) Timón, I., Soto, J., PérezSánchez, H., Cecilia, J.M., 2016. Parallel implementation of fuzzy minimals clustering algorithm. Expert Systems with Applications 48, 35–41.
 Whitworth and Suthaharan (2014) Whitworth, J., Suthaharan, S., 2014. Security problems and challenges in a machine learningbased hybrid big data processing network systems. ACM SIGMETRICS Performance Evaluation Review 41, 82–85.
 Zarzoso and Nandi (1999) Zarzoso, V., Nandi, A., 1999. Blind source separation, in: Blind Estimation Using HigherOrder Statistics. Springer, pp. 167–252.
 Zumel et al. (2014) Zumel, N., Mount, J., Porzak, J., 2014. Practical data science with R. Manning.
Comments
There are no comments yet.