1 Introduction
Recent advances in computer technologies have drastically increased the amount of data collected from cyber, physical and human worlds. Data collection at large scale can make sense only if they are actionable and they can be used in decision making Witten et al. (2016). Data mining helps at this point by investigating unsuspected relationships in the data and providing useful insights to the data owners. Moreover, such capabilities may often need to be shared with external parties for further analysis. In this process, various kinds of information may be revealed, which can lead to a privacy breach. The ability to share information while preventing the disclosure of personally identifiable information (PII) becomes an important aspect of information privacy, and it is one of the most significant technical, legal, ethical and social challenges. In fact, various governmental and commercial organizations collect vast amounts of user data, among others individual credit information, health, financial status, and personal preferences. Social networking, banking and healthcare systems are examples of systems that handle such private information Chamikara et al. (2018), and they often overlook privacy due to indirect use of private information. There are other information systems that use massive amounts of sensitive private information (also called big data) for modeling and prediction of human-related phenomena such as crimes Helbing et al. (2015), epidemics Jalili & Perc (2017) and grand challenges in social physics Capraro & Perc (2018). Hence, privacy preservation (a.k.a. sanitization) Vatsalan et al. (2017) can become a very complex problem and requires robust solutions Wen et al. (2018).
Privacy-preserving data mining (PPDM) offers the possibility of using data mining methods without disclosing private information. PPDM approaches include data perturbation (data modification) Chen & Liu (2005, 2011) and encryption Kerschbaum & Härterich (2017). Cryptographic methods are renowned for securing data. Literature provides many examples where PPDM effectively utilize cryptographic methods Li et al. (2017). For example, we can find applications of homomorphic encryption in the domains including but not limited to e-health, cloud computing and sensor networks Zhou et al. (2015). Secure sum, secure set union, scalar product and set intersection are few other operations that can be used as building blocks in distributed data mining Clifton et al. (2002). However, due to their high computational complexity, they cannot provide sufficient data utility Gai et al. (2016) and are impractical for PPDM. Data perturbation is known to have lower computational complexity compared to cryptographic methods for privacy preservation Chamikara et al. (2018). Data perturbation maintains individual record confidentiality by applying a systematic modification to the data elements in a database Chamikara et al. (2018). The perturbed dataset is often indistinguishable from the original dataset, e.g. age data maps to a reasonable number so that a third party cannot differentiate between original and perturbed age. Examples of perturbation techniques include adding noise to the original data (additive perturbation) Muralidhar et al. (1999), applying rotation to the original data using a random rotation matrix (random rotation) Chen & Liu (2005), applying both rotation and translation to the original data using a random rotation matrix and a random translation matrix (geometric perturbation) Chen & Liu (2011), and randomizing the outputs of user responses using some random algorithm (randomized response) Dwork et al. (2014). A major disadvantage of these techniques is that they can not process high volumes of data efficiently, e.g. random rotation and geometric perturbation consume a considerable amount of time to provide good results while enforcing sufficient privacy Chen & Liu (2005, 2011). Additive perturbation takes less time but provides a lower privacy guarantee Okkalioglu et al. (2015). The existing methods have issues in maintaining a proper balance between privacy and utility.
It is essential to define the effectiveness of a privacy-preserving approach using a privacy model, and identify the limitations of private information protection and disclosure Chamikara et al. (2018). , , , are some of the previous privacy models, and they show vulnerability to different attacks, e.g. minimality, composition and foreground knowledge attacks Chamikara et al. (2018). Differential privacy (DP) is a privacy model known to render maximum privacy by minimizing the chance of individual record identification Dwork et al. (2014). Local differential privacy (LDP), achieved by input perturbation Dwork et al. (2014), allows full or partial data release to analysts Kairouz et al. (2014) by randomizing the individual instances of a database Tang et al. (2017). Global differential privacy (GDP), also called the trusted curator model, allows analysts only to request the curator to run queries on the database. The curator applies carefully calibrated noise to the query results to provide differential privacy Dwork et al. (2014); Kairouz et al. (2014)
. However, GDP and LDP fail for small datasets, as accurate estimation of the statistics shows poor results when the number of tuples is small. Although differential privacy has been studied thoroughly, only a few viable solutions exist towards full/partial data release using LDP. Most of these are solutions for categorical data such as RAPPOR
Erlingsson et al. (2014) and Local, Private, Efficient Protocols for Succinct Histograms Qin et al. (2016). DP’s solid, theoretically appealing foundation for privacy protection has limited the practicality of implementing efficient solutions towards big data. Furthermore, existing LDP algorithms include a significant amount of noise addition (i.e. randomization), resulting in low data utility. Accordingly, utility and privacy often appear as conflicting factors, and improved privacy usually entails reduced utility.The main contribution of this paper is a new Privacy preservation Algorithm for Big Data Using Optimal geometric Transformations (PABIDOT). PABIDOT is an irreversible input perturbation mechanism with a new privacy model () which facilitates full data release. We prove that
provides an empirical privacy guarantee against the data reconstruction attacks. PABIDOT is substantially faster than comparable methods; it sequentially applies random axis reflection, noise translation, and multidimensional concatenated subplane rotation followed by randomized expansion and random tuple shuffling for further randomization. Randomized expansion is a novel method to increase the positiveness or the negativeness of a particular data instance. PABIDOT’s memory overhead is comparatively close to other solutions, and it provides better attack resistance, classification accuracy, and excellent efficiency towards big data. We tested PABIDOT by using nine generic datasets retrieved from the UCI machine learning data repository
111http://archive.ics.uci.edu/ml/index.php and the OpenML machine learning data repository222https://www.openml.org, the results were compared against two alternatives: random rotation perturbation (RP) Chen & Liu (2005) and geometric perturbation (GP) Chen & Liu (2011), which are known to provide high utility in terms of privacy-preserving classification. Our study shows that PABIDOT always ends up with approximately optimal perturbation. PABIDOT produces the best empirical privacy possible by determining the globally optimal perturbation parameters adhering to for the dataset of interest. The source code of the PABIDOT project is available at \(https://github.com/chamikara1986/PABIDOT\).The rest of the paper is organized as follows. Section 2 provides a summary of existing related work. The technical details of PABIDOT are described in Section 3. Section 3 further presents the basic flow of PABIDOT which we refer as PABIDOT_basic for convenience. The efficiency optimization of PABIDOT is discussed in Section 4. At the end of Section 4, the main algorithm (PABIDOT) with optimized efficiency is introduced. Section 4 presents experimental settings and provides a comparative analysis of performance and resistance of PABIDOT. The results are discussed in Section 6, and the paper is concluded in Section 7.
2 Literature Review
Privacy protection of individuals has become a challenging task with the proliferation of Internet-enabled consumer technologies. Literature shows different approaches to find solutions towards this challenge. While some approaches concentrate on increasing the awareness Buccafurri et al. (2016), others try to employ different techniques to enforce individual privacy Wei et al. (2018). Above all, the massive volumes of big data introduce many challenges to privacy preservation Cuzzocrea (2015). Although the security and privacy concerns of big data are not entirely new, they require attention due to the specifics of the environments and dynamics put forward by the devices used Kieseberg & Weippl (2018). The advancements of these environments and the diversity of devices always introduce increased complexity and make security and privacy preservation complex. To counter the diversified challenges and complexities, three different technological approaches can be observed: disclosure control, privacy-preserving data mining (PPDM) and privacy-enhancing technologies Torra (2017a). Attribute-based encryption, controlling access via authentication, temporal and location-based access control and employing constraint-based protocols are some mechanisms that are used for improving the privacy of systems in dynamic environments Chamikara et al. (2018). Among the various approaches to privacy-preserving data mining, data perturbation is often preferred due to its simplicity and efficiency Aldeen et al. (2015). Both input and output perturbation are often used: output perturbation is based on noise addition and rule hiding while input perturbation is conducted either by noise addition Muralidhar et al. (1999) or multiplication Chamikara et al. (2018). Input perturbation can be divided further into unidimensional perturbation and multidimensional perturbation Okkalioglu et al. (2015). Additive perturbation Muralidhar et al. (1999), randomized response Dwork et al. (2014), swapping Hasan et al. (2016) and microaggregation Torra (2017b) are examples of unidimensional input perturbation, whereas condensation Aggarwal & Yu (2004), random rotation Chen & Liu (2005), geometric perturbation Chen & Liu (2011), random projection Liu et al. (2006), and hybrid perturbation are multidimensional Aldeen et al. (2015).
In additive perturbation, random noise is added to the original data in such a way that the underlying statistical properties of the attributes are preserved. A significant problem with this approach is low utility of the resulting data Agrawal & Srikant (2000). Additionally, effective noise reconstruction techniques developed in response can significantly reduce the level of privacy Okkalioglu et al. (2015). Randomization techniques such as randomized response is another approach Dwork et al. (2014), e.g. randomizing the responses of interviewees in order to preserve the privacy of respondents. Due to the high randomization of input data, randomization techniques such as randomized response often provide high privacy whereas the utility in terms of estimating statistics or conducting analyses can be low Dwork et al. (2014). Microaggregation is based on confidentiality rules that allow the publication of micro datasets. It divides the dataset into clusters of elements and replaces the values in each cluster with the centroid of the cluster. Microaggregation to a single variable (univariate microaggregation) is vulnerable to transparency attacks when the published data includes information about the protection method and its parameters Torra (2017b). Multivariate microaggregation has also been proposed, but it is complex and has been proven to be NP-hard Torra (2017b)
. In condensation, the input dataset is divided into multiple groups of a pre-defined size in such a way that the difference between records in a particular group is minimal, and a certain level of statistical information about the different records is maintained in each group. Then sanitized data are generated using uniform random distribution based on the eigenvectors which are generated using the eigendecomposition of the characteristic covariance matrices of each homogeneous group
Aggarwal & Yu (2004). Condensation has a significant shortcoming in that it may degrade the quality of data significantly.Random rotation perturbation, geometric data perturbation, and random projection perturbation are three types of matrix multiplicative Okkalioglu et al. (2015)
methods. In random rotation, the original data matrix is multiplied using a random rotation matrix that has the properties of an orthogonal matrix. The application of rotation is repeated until the algorithm converges at the desired level of privacy
Chen & Liu (2005). In geometric data perturbation, a random translation matrix is incorporated in the process of perturbation in order to enhance privacy. The method accompanies three components: rotation perturbation, translation perturbation, and distance perturbation Chen & Liu (2011). The main idea of random projection perturbation is to project data from high-dimensional space to a randomly chosen low-dimensional subspace Liu et al. (2006). Due to the isometric nature of transformations, random rotation perturbation, geometric data perturbation, and random projection perturbation are capable of preserving the distances between tuples in a dataset Chen & Liu (2005, 2011); Liu et al. (2006). Accordingly, they provide high utility w.r.t. classification and clustering. In hybrid perturbation, both matrix multiplicative and matrix additive properties are used, which is quite similar to geometric perturbation Aldeen et al. (2015). These algorithms have high computational complexity and are time-consuming, which make them unsuitable to work with big datasets.Due to its explicit notion of strong privacy guarantee, differential privacy has attracted much attention. Although LDP permits full or partial data release, and the analysis of privacy-protected data Dwork et al. (2014); Kairouz et al. (2014), LDP algorithms are still at a fundamental stage when it comes to the privacy preservation of real-valued numerical data. The complexity of selecting the domain of randomization with respect to a single data instance is still a challenge Erlingsson et al. (2014). In GDP, the requirement of a trusted curator who enforces differential privacy by applying noise or randomization can be considered as a primary issue Dwork et al. (2014). The fundamental mechanisms used to obtain differential privacy include Laplace mechanism, Gaussian mechanism Dwork et al. (2014), geometric mechanism, randomized response, and staircase mechanisms Kairouz et al. (2014). The necessity of a trusted third party in GDP and the application of extremely high noise in LDP are inherent shortcomings that directly affect the balance between privacy and utility of these practical, differentially private approaches.
Many previously proposed privacy preservation methods, including data perturbation, perform poorly with high dimensional datasets. The necessary computing resources grow fast as the number of attributes and number of instances increase even though the performance is good for low dimensional data. This quality is often referred to as “The Dimensionality Curse” Chamikara et al. (2018). Large datasets also provide extra information to attackers, as higher dimensions help in utilizing background knowledge to identify individuals Bettini & Riboni (2015).
Most of the privacy-preserving algorithms have problems with balancing privacy and utility. Data privacy focuses on the difficulty of estimating the original data from the sanitized data, while utility concentrates on preserving application-specific properties/information Aggarwal (2015). A generic way of measuring the utility of a privacy-preserving method is to investigate perturbation biases Wilson & Rosen (2008). Data perturbation bias means that the result of a query on the perturbed data is significantly different from the result generated for the same query on the original data. Wilson et al. have examined different data perturbation methods against various bias measures Wilson & Rosen (2008), namely, A, B, C, D, and Data Mining (DM) bias. Type A bias occurs when the perturbation of a given attribute causes summary measures to change. Type B bias is the result of the perturbation changing the relationships between confidential attributes, while in case of Type C bias the relationship between confidential and non-confidential attributes changes. Type D bias means that the underlying distribution of the data was affected by the sanitization process. If Type DM bias exists, data mining tools will perform less accurately on the perturbed data than they would on the original dataset. It has been noted that privacy preservation mechanisms decrease utility in general, and finding a trade-off between privacy protection and data utility for big data is an important issue Xu et al. (2015).
In the literature, there is a dearth of efficient privacy preservation methods that provide reliable data utility and are scalable enough to handle the rapidly growing data. Existing methods also have problems with levels of uncertainty, biases and low level of resistance. To address the issues presented by big data, there is an urgent need for methods that are scalable, efficient and robust. New methods should overcome the aforementioned weaknesses of the existing PPDM methods and provide solutions towards large-scale privacy preserving data mining.
3 Proposed Algorithm: PABIDOT
PABIDOT perturbs a dataset by using multidimensional geometric transformations, reflection, translation, and rotation followed by randomized expansion (a new noise addition mechanism which is explained in Section 3) and random tuple shuffling. Figure 1 shows the basic flow and architecture of the proposed perturbation algorithm. Based on the proposed privacy model called , the algorithm aims at optimum privacy protection in terms of protection against data reconstruction attacks. PABIDOT achieves this by selecting the best possible perturbation parameters based on the properties of the input dataset. Figure 1 also demonstrates the position of PABIDOT in a privacy-preserving big data release scenario. PABIDOT assumes that the original data can be accessed only by the owner/ the administrator of that dataset. There can be complementary releases of the perturbed versions of the original dataset. The original dataset will not be released to a third party users under any circumstances.

Rationale and technical novelty
PABIDOT applies geometric transformations with optimal perturbation parameters and increases randomness using randomized expansion followed by a random tuple shuffle. It defines privacy in such a way that the resulting dataset has an optimal difference compared to the original dataset as a result of the privacy model () used by PABIDOT. This property helps in minimizing the search space and finding the best possible perturbation for a particular dataset. Consequently, efficiency and reliability of PABIDOT in big data perturbation increase while providing better resistance to data reconstruction. Figure 1 and Algorithm 3 depict the proposed perturbation algorithm, and Table 1 provides a summary of the notations used in Algorithm 2 and Algorithm 3
. As shown, the original dataset and the standard deviation to the normal random noise provided under randomized expansion are the only inputs to Algorithm
3, and the perturbed dataset is its only output.Data matrix (D)
The dataset to be perturbed is represented as a matrix () of size where the columns represent the attributes ( attributes), and rows represent the records ( records). For example, personal information of a patient can be represented as a record which may have attributes such as age, weight, height, and gender. The data matrix is assumed to contain numerical data only.
(1) |
In the process of perturbation, the data matrix is subjected to multidimensional geometric composite transformations. During these transformations, a record (row) in the data matrix will be considered as a point in the multi-dimensional Cartesian coordinate system.
Multidimensional isometric transformations
Geometric translation, rotation and reflection are considered to be isometric transformations in the space. A transformation is said to be isometric, if it preserves the distances so that Maruskin (2012)
(2) |
All matrices and Cartesian points are represented in homogeneous coordinate form in order to consider all the transformations in matrix multiplication form. A homogeneous coordinate point in space can be written as an
dimensional position vector
with the additional term . The introduction of homogeneous coordinates enables composite transformation between the coordinates and transformation matrices without having to perform a collection of transformations as a sequential process. Therefore, multidimensional geometric translation, reflection and rotation can be represented in their generalized forms of matrices Jones (2012).A composite operation is performed when several transformation matrices have to be applied in a particular transformation. If several transformation matrices are sequentially applied to a homogeneous matrix , the composite operation is given by
(3) |
Homogeneous data matrix
All records in the input data matrix () are converted to homogeneous coordinates by adding a new column of ones (i.e. ) after the column. The resulting homogeneous representation of the data matrix is given by Equation 4.
The input dataset will first be subjected to z-score normalization
Kabir et al. (2015) in order to provide equal weights for all attributes in the transformations. Next, the n-dimensional translational matrix is generated according to Equation 5, in which the translational coefficients are drawn from random noise with uniform distribution, and n equals to the number of attributes in the input dataset. Due to z-score normalization, the attribute mean becomes 0 while the standard deviation of the dataset becomes 1. Therefore, the noise generated by the uniform random noise function is bounded within
and follows the inequality where denotes a translational coefficient.(4) |
n-Dimensional translation matrix
n-Dimensional reflection matrix
Next, for each of the axes PABIDOT generates the corresponding matrix of reflection according to Equation 7. The homogeneous matrix for reflection across axis one can be derived as shown in Equation 6 Jones (2012).
(6) |
The (n+1) axis reflection matrix can be written as shown in Equation 7.
(7) |
n-Dimensional rotation matrix
After creating the matrix of reflection for one of the number of axes, PABIDOT generates the n-dimensional concatenated subplane rotation matrices ( for the current ) using Algorithm 1 for each where ()
In order to derive a single matrix that represents the entire orientation, the concatenated sub-plane rotation method can be used. Then, the rotations in the plane of a pair of coordinate axes for can be written as a block matrix as in Equation 8 Paeth (2014).
(8) |
Thus distinct should be concatenated in the preferred order to produce the final composite n-dimensional matrix which can be obtained using Equation 9 with degrees of freedom, parameterized by.
(9) |
Algorithm for generating the multidimensional concatenated sub-plane rotation matrix
Based on Equations 8 and 9, Algorithm 1 can be used to generate the multidimensional concatenated subplane rotational matrix of the desired rotation angle.
The resulting rotation matrix has the properties of an orthogonal matrix where the columns and rows of the resulting concatenated subplane rotation matrix () are orthonormal, and hence preserves the relationship, where be the transpose matrix of and
be the identity matrix.
number of attributes of the dataset | ||
rotation angle |
multidimensional rotation matrix of |
Privacy metric for generating the optimal perturbation parameters
Since the proposed method is based on multidimensional isometric transformations, it is important to use a multi-column privacy metric for evaluating the privacy of the proposed method. Assuming that all attributes of the dataset are equally important, z-score normalization is applied to the data matrix as the initial step in the perturbation process. The higher the privacy of the perturbed data, the more difficult it is to estimate the original data Chen & Liu (2005)
. To extend this idea, the variance of the difference between the perturbed and non-perturbed datasets (
) is considered, the higher the , the higher the privacy. Hence, provides a measure of privacy of the perturbed data, or the level of difficulty to estimate the original data without prior knowledge about the original data Chen & Liu (2005), which is often called naive inference/estimation Chen & Liu (2005). has long been used to measure the level of privacy of perturbed data Muralidhar et al. (1999). In the proposed method the attribute which returns the minimum variance for the difference is considered as the minimum privacy guarantee. If is a perturbed data series of attribute , the level of privacy of the perturbation method can be measured using , where . Therefore, can be written as in Equation 10.(10) |
We propose a new privacy definition called as follows. Assume we have a dataset of instances and each instance having attributes. We perturb this dataset in different ways, producing datasets for . The perturbed value of an attribute is denoted as . We then calculate the difference between the original data and the perturbed data - , which will contain values of ( - ) for . We calculate the variance of each attribute in - , and select the minimum of these variances.
(11) |
From the perturbations, we choose the one that has the largest value of , i.e. we choose the perturbation that produced the most significant difference between the original dataset and the perturbed dataset.
Definition 3.0 ().
Given the dataset and a perturbation algorithm, we shall denote a perturbed instance of as for , where represents the number of possible ways of perturbing . Then, the minimum privacy guarantee is defined as
(12) |
and the optimal privacy guarantee is
(13) |
A perturbed dataset that has the optimal privacy guarantee provides .
For a perturbed data matrix of of size , the minimum privacy guarantee () is the minimum of the variances calculated for attributes based on the differences between the original and perturbed z-score normalized values, as shown in Equation 14. Here is the minimum privacy guarantee and is the total number of attributes in or .
(14) |
Identifying the best perturbation parameters for
In each iteration the algorithm maximizes the value of to generate as given in Equation 17. With the axis of reflection varying from 1 to (number of attributes) and the angle of rotation () varying from 0 to 179 degrees, i.e. and , the perturbation will return a number of perturbed data matrices: with different levels of perturbations. This will form the matrix of local minimum privacy guarantees, as given in Equation 15.
(15) |
To get the global minimum guarantee ()values for each angle, we get the minimum of each column as given in Equation 16.
(16) |
The perturbed data matrix with the optimal privacy can be considered as the data matrix that returns the largest value for the minimum privacy guarantee . Therefore, the largest global minimum privacy guarantee () is selected from Equation 16 to obtain (Equation 17). In other words, the best perturbation parameters will be selected using the highest privacy guarantee , based on the most vulnerable attributes.
(17) |
Generate the rotation matrix and reflection matrix using the best perturbation parameters for
Next, PABIDOT records the angle of rotation () and axis of reflection () at . Now, the algorithm uses to generate the matrix of reflection according to Equation 7 and use to generate the matrix of rotation according to Algorithm 1. Composite transformation of reflection, translation, and rotation will then be applied to the z-score normalized matrix using the matrices of optimal reflection, translation, and optimal rotation to generate .
Application of z-score normalization and the transformations: reflection, translation and rotation
After generating the concatenated subplane rotation matrix, PABIDOT applies the composite transformation of reflection, translation, and rotation on the z-score normalized input data matrix. Rotation is applied after reflection and noise translation. This is because the effect of rotation is proportional to the distance from the origin, and we want to reduce the probability of points close to the origin getting attacked due to weaker perturbation
Chen & Liu (2011). An instance of the application of the transformation to the input data matrix () in the order of application can be represented using Equation 18.(18) |
Randomized Expansion
A privacy preserving algorithm is composable if it can keep satisfying the same privacy requirements of the privacy model in use, after repeated independent application of the algorithm Soria-Comas & Domingo-Ferrer (2016)
. To improve the composability and the randomness of the perturbation algorithm, noise drawn from random normal distribution (with a mean of 0, and a predefined standard deviation subjected to a default value of 0.3) is added to the data according to a novel approach named as randomized expansion. Here, we introduce noise in such a way that it further enhances the positiveness or the negativeness of a particular value where the zeros are not subjected to any change as depicted in Figure
2.
In order to generate the noise for randomized expansion, first, we generate the sign matrix () based on the values of . A value in will be 1 if the corresponding element of is greater than 0, it will be 0 if the corresponding element of equals 0, and it will be -1 if the corresponding element of is less than 0. if is complex. Next, the absolute values of and absolute values of the random normal noise matrix are added together and dot product with is calculated as denoted in Equation 19.
(19) |
Application of reverse z-score normalization and random tuple swapping
At this stage, the attributes of the perturbed data matrix will not be within the ranges of the attributes of the original dataset. Therefore, reverse z-score normalization will be applied to the current matrix, using the standard deviations () and means of the attributes () of the original dataset. Finally, the tuples of the resulting matrix will be randomly swapped to generate the final perturbed dataset.
Notation | Summary/Constraints |
is the optimal privacy guarantee | |
at | |
optimal axis for reflection selected at | |
number of tuples | |
number of attributes | |
z-score normalized dataset of | |
covariance matrix of | |
transformation matrix with uniform random noise | |
reflection matrix for the current axis | |
(generated according to Equation 7) | |
() and | |
concatenated subplane rotation matrix | |
generated using Algorithm 1 for | |
concatenate sub-plane rotation matrix | |
generated using Algorithm 1 for | |
is the reflection matrix, | |
generated according to Equation 7 for | |
vector of standard deviations of all the attributes of | |
vector of mean values of all the attributes of |
3.1 Overall process of PABIDOT
Algorithm 2 summarizes the overall set of steps of the basic perturbation algorithm. A summary of notations used in this algorithm is provided in Table 1. For convenience, we refer to Algorithm 2 as PABIDOT_basic in subsequent sections.
original dataset | ||
input noise standard deviation (default value=0.3) |
perturbed dataset |
4 Optimizing the efficiency of PABIDOT
During the execution of PABIDOT, the composite transformations in steps 12 and 13 in Algorithm 2 need to be applied to the whole dataset for a total of times. This can be infeasible for high dimensional datasets. To accelerate the computations, a method based on the covariance matrix of the input dataset () was used to generate , where . This eliminates the necessity of searching through a large number of tuples of a big dataset in each loop to find , as it only needs the of . The steps of this derivation are provided below (Section 4).
Poof:.
(20) | ||||
Since, X is z-score normalized is equal to . For large number of records .
Therefore,
(21) |
Hence,
(22) |
Let’s consider the following table as the table to be perturbed where are the columns of ,
X | Y | Z | |
The perturbation of a single tuple can be given as (where, , and are rotation, translation and reflection matrices respectively), which is found using
(23) |
Assuming that axis 1 has been selected for the reflection.
Therefore,
(24) | ||||
and
or
(25) |
Where,
So,
(26) | ||||
Similarly,
(27) |
(28) |
and
(29) |
Therefore,
(30) |
or
(31) |
Let be the dataset of attributes , and let be the variance-covariance of a size square matrix defined by,
(32) |
where the diagonal elements are
Let be a row vector of weight factors (in this case elements). Then
(33) |
written out in component form,
(34) |
Since, and
(35) |
Hence,
(36) |
Therefore,