Efficient privacy preservation of big data for accurate data mining

06/19/2019
by   M. A. P. Chamikara, et al.
RMIT University
0

Computing technologies pervade physical spaces and human lives, and produce a vast amount of data that is available for analysis. However, there is a growing concern that potentially sensitive data may become public if the collected data are not appropriately sanitized before being released for investigation. Although there are more than a few privacy-preserving methods available, they are not efficient, scalable or have problems with data utility, and/or privacy. This paper addresses these issues by proposing an efficient and scalable nonreversible perturbation algorithm, PABIDOT, for privacy preservation of big data via optimal geometric transformations. PABIDOT was tested for efficiency, scalability, resistance, and accuracy using nine datasets and five classification algorithms. Experiments show that PABIDOT excels in execution speed, scalability, attack resistance and accuracy in large-scale privacy-preserving data classification when compared with two other, related privacy-preserving algorithms.

READ FULL TEXT VIEW PDF

Authors

07/31/2019

An Efficient and Scalable Privacy Preserving Algorithm for Big Data and Data Streams

A vast amount of valuable data is produced and is becoming available for...
04/25/2020

Privacy Preserving Distributed Machine Learning with Federated Learning

Edge computing and distributed machine learning have advanced to a level...
01/06/2020

Clustering based Privacy Preserving of Big Data using Fuzzification and Anonymization Operation

Big Data is used by data miner for analysis purpose which may contain se...
11/18/2018

Privacy Preserving Utility Mining: A Survey

In big data era, the collected data usually contains rich information an...
07/04/2020

PPaaS: Privacy Preservation as a Service

Personally identifiable information (PII) can find its way into cyberspa...
08/13/2020

Zecale: Reconciling Privacy and Scalability on Ethereum

In this paper, we present Zecale, a general purpose SNARK proof aggregat...
07/26/2021

HySec-Flow: Privacy-Preserving Genomic Computing with SGX-based Big-Data Analytics Framework

Trusted execution environments (TEE) such as Intel's Software Guard Exte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in computer technologies have drastically increased the amount of data collected from cyber, physical and human worlds. Data collection at large scale can make sense only if they are actionable and they can be used in decision making  Witten et al. (2016). Data mining helps at this point by investigating unsuspected relationships in the data and providing useful insights to the data owners. Moreover, such capabilities may often need to be shared with external parties for further analysis. In this process, various kinds of information may be revealed, which can lead to a privacy breach. The ability to share information while preventing the disclosure of personally identifiable information (PII) becomes an important aspect of information privacy, and it is one of the most significant technical, legal, ethical and social challenges. In fact, various governmental and commercial organizations collect vast amounts of user data, among others individual credit information, health, financial status, and personal preferences. Social networking, banking and healthcare systems are examples of systems that handle such private information Chamikara et al. (2018), and they often overlook privacy due to indirect use of private information. There are other information systems that use massive amounts of sensitive private information (also called big data) for modeling and prediction of human-related phenomena such as crimes  Helbing et al. (2015), epidemics  Jalili & Perc (2017) and grand challenges in social physics  Capraro & Perc (2018). Hence, privacy preservation (a.k.a. sanitization)  Vatsalan et al. (2017) can become a very complex problem and requires robust solutions Wen et al. (2018).

Privacy-preserving data mining (PPDM) offers the possibility of using data mining methods without disclosing private information. PPDM approaches include data perturbation (data modification)  Chen & Liu (2005, 2011) and encryption  Kerschbaum & Härterich (2017). Cryptographic methods are renowned for securing data. Literature provides many examples where PPDM effectively utilize cryptographic methods  Li et al. (2017). For example, we can find applications of homomorphic encryption in the domains including but not limited to e-health, cloud computing and sensor networks Zhou et al. (2015). Secure sum, secure set union, scalar product and set intersection are few other operations that can be used as building blocks in distributed data mining  Clifton et al. (2002). However, due to their high computational complexity, they cannot provide sufficient data utility Gai et al. (2016) and are impractical for PPDM. Data perturbation is known to have lower computational complexity compared to cryptographic methods for privacy preservation  Chamikara et al. (2018). Data perturbation maintains individual record confidentiality by applying a systematic modification to the data elements in a database  Chamikara et al. (2018). The perturbed dataset is often indistinguishable from the original dataset, e.g. age data maps to a reasonable number so that a third party cannot differentiate between original and perturbed age. Examples of perturbation techniques include adding noise to the original data (additive perturbation)  Muralidhar et al. (1999), applying rotation to the original data using a random rotation matrix (random rotation)  Chen & Liu (2005), applying both rotation and translation to the original data using a random rotation matrix and a random translation matrix (geometric perturbation)  Chen & Liu (2011), and randomizing the outputs of user responses using some random algorithm (randomized response)  Dwork et al. (2014). A major disadvantage of these techniques is that they can not process high volumes of data efficiently, e.g. random rotation and geometric perturbation consume a considerable amount of time to provide good results while enforcing sufficient privacy Chen & Liu (2005, 2011). Additive perturbation takes less time but provides a lower privacy guarantee  Okkalioglu et al. (2015). The existing methods have issues in maintaining a proper balance between privacy and utility.

It is essential to define the effectiveness of a privacy-preserving approach using a privacy model, and identify the limitations of private information protection and disclosure  Chamikara et al. (2018). , , , are some of the previous privacy models, and they show vulnerability to different attacks, e.g. minimality, composition and foreground knowledge attacks Chamikara et al. (2018). Differential privacy (DP) is a privacy model known to render maximum privacy by minimizing the chance of individual record identification Dwork et al. (2014). Local differential privacy (LDP), achieved by input perturbation  Dwork et al. (2014), allows full or partial data release to analysts  Kairouz et al. (2014) by randomizing the individual instances of a database Tang et al. (2017). Global differential privacy (GDP), also called the trusted curator model, allows analysts only to request the curator to run queries on the database. The curator applies carefully calibrated noise to the query results to provide differential privacy  Dwork et al. (2014); Kairouz et al. (2014)

. However, GDP and LDP fail for small datasets, as accurate estimation of the statistics shows poor results when the number of tuples is small. Although differential privacy has been studied thoroughly, only a few viable solutions exist towards full/partial data release using LDP. Most of these are solutions for categorical data such as RAPPOR 

Erlingsson et al. (2014) and Local, Private, Efficient Protocols for Succinct Histograms Qin et al. (2016). DP’s solid, theoretically appealing foundation for privacy protection has limited the practicality of implementing efficient solutions towards big data. Furthermore, existing LDP algorithms include a significant amount of noise addition (i.e. randomization), resulting in low data utility. Accordingly, utility and privacy often appear as conflicting factors, and improved privacy usually entails reduced utility.

The main contribution of this paper is a new Privacy preservation Algorithm for Big Data Using Optimal geometric Transformations (PABIDOT). PABIDOT is an irreversible input perturbation mechanism with a new privacy model () which facilitates full data release. We prove that

provides an empirical privacy guarantee against the data reconstruction attacks. PABIDOT is substantially faster than comparable methods; it sequentially applies random axis reflection, noise translation, and multidimensional concatenated subplane rotation followed by randomized expansion and random tuple shuffling for further randomization. Randomized expansion is a novel method to increase the positiveness or the negativeness of a particular data instance. PABIDOT’s memory overhead is comparatively close to other solutions, and it provides better attack resistance, classification accuracy, and excellent efficiency towards big data. We tested PABIDOT by using nine generic datasets retrieved from the UCI machine learning data repository

111http://archive.ics.uci.edu/ml/index.php and the OpenML machine learning data repository222https://www.openml.org, the results were compared against two alternatives: random rotation perturbation (RP)  Chen & Liu (2005) and geometric perturbation (GP)  Chen & Liu (2011), which are known to provide high utility in terms of privacy-preserving classification. Our study shows that PABIDOT always ends up with approximately optimal perturbation. PABIDOT produces the best empirical privacy possible by determining the globally optimal perturbation parameters adhering to for the dataset of interest. The source code of the PABIDOT project is available at \(https://github.com/chamikara1986/PABIDOT\).

The rest of the paper is organized as follows. Section 2 provides a summary of existing related work. The technical details of PABIDOT are described in Section 3. Section 3 further presents the basic flow of PABIDOT which we refer as PABIDOT_basic for convenience. The efficiency optimization of PABIDOT is discussed in Section 4. At the end of Section 4, the main algorithm (PABIDOT) with optimized efficiency is introduced. Section 4 presents experimental settings and provides a comparative analysis of performance and resistance of PABIDOT. The results are discussed in Section 6, and the paper is concluded in Section 7.

2 Literature Review

Privacy protection of individuals has become a challenging task with the proliferation of Internet-enabled consumer technologies. Literature shows different approaches to find solutions towards this challenge. While some approaches concentrate on increasing the awareness Buccafurri et al. (2016), others try to employ different techniques to enforce individual privacy Wei et al. (2018). Above all, the massive volumes of big data introduce many challenges to privacy preservation Cuzzocrea (2015). Although the security and privacy concerns of big data are not entirely new, they require attention due to the specifics of the environments and dynamics put forward by the devices used Kieseberg & Weippl (2018). The advancements of these environments and the diversity of devices always introduce increased complexity and make security and privacy preservation complex. To counter the diversified challenges and complexities, three different technological approaches can be observed: disclosure control, privacy-preserving data mining (PPDM) and privacy-enhancing technologies  Torra (2017a). Attribute-based encryption, controlling access via authentication, temporal and location-based access control and employing constraint-based protocols are some mechanisms that are used for improving the privacy of systems in dynamic environments Chamikara et al. (2018). Among the various approaches to privacy-preserving data mining, data perturbation is often preferred due to its simplicity and efficiency  Aldeen et al. (2015). Both input and output perturbation are often used: output perturbation is based on noise addition and rule hiding while input perturbation is conducted either by noise addition Muralidhar et al. (1999) or multiplication  Chamikara et al. (2018). Input perturbation can be divided further into unidimensional perturbation and multidimensional perturbation Okkalioglu et al. (2015). Additive perturbation  Muralidhar et al. (1999), randomized response Dwork et al. (2014), swapping  Hasan et al. (2016) and microaggregation Torra (2017b) are examples of unidimensional input perturbation, whereas condensation Aggarwal & Yu (2004), random rotation  Chen & Liu (2005), geometric perturbation Chen & Liu (2011), random projection  Liu et al. (2006), and hybrid perturbation are multidimensional Aldeen et al. (2015).

In additive perturbation, random noise is added to the original data in such a way that the underlying statistical properties of the attributes are preserved. A significant problem with this approach is low utility of the resulting data  Agrawal & Srikant (2000). Additionally, effective noise reconstruction techniques developed in response can significantly reduce the level of privacy  Okkalioglu et al. (2015). Randomization techniques such as randomized response is another approach  Dwork et al. (2014), e.g. randomizing the responses of interviewees in order to preserve the privacy of respondents. Due to the high randomization of input data, randomization techniques such as randomized response often provide high privacy whereas the utility in terms of estimating statistics or conducting analyses can be low  Dwork et al. (2014). Microaggregation is based on confidentiality rules that allow the publication of micro datasets. It divides the dataset into clusters of elements and replaces the values in each cluster with the centroid of the cluster. Microaggregation to a single variable (univariate microaggregation) is vulnerable to transparency attacks when the published data includes information about the protection method and its parameters  Torra (2017b). Multivariate microaggregation has also been proposed, but it is complex and has been proven to be NP-hard  Torra (2017b)

. In condensation, the input dataset is divided into multiple groups of a pre-defined size in such a way that the difference between records in a particular group is minimal, and a certain level of statistical information about the different records is maintained in each group. Then sanitized data are generated using uniform random distribution based on the eigenvectors which are generated using the eigendecomposition of the characteristic covariance matrices of each homogeneous group  

Aggarwal & Yu (2004). Condensation has a significant shortcoming in that it may degrade the quality of data significantly.

Random rotation perturbation, geometric data perturbation, and random projection perturbation are three types of matrix multiplicative  Okkalioglu et al. (2015)

methods. In random rotation, the original data matrix is multiplied using a random rotation matrix that has the properties of an orthogonal matrix. The application of rotation is repeated until the algorithm converges at the desired level of privacy  

Chen & Liu (2005). In geometric data perturbation, a random translation matrix is incorporated in the process of perturbation in order to enhance privacy. The method accompanies three components: rotation perturbation, translation perturbation, and distance perturbation  Chen & Liu (2011). The main idea of random projection perturbation is to project data from high-dimensional space to a randomly chosen low-dimensional subspace  Liu et al. (2006). Due to the isometric nature of transformations, random rotation perturbation, geometric data perturbation, and random projection perturbation are capable of preserving the distances between tuples in a dataset  Chen & Liu (2005, 2011); Liu et al. (2006). Accordingly, they provide high utility w.r.t. classification and clustering. In hybrid perturbation, both matrix multiplicative and matrix additive properties are used, which is quite similar to geometric perturbation  Aldeen et al. (2015). These algorithms have high computational complexity and are time-consuming, which make them unsuitable to work with big datasets.

Due to its explicit notion of strong privacy guarantee, differential privacy has attracted much attention. Although LDP permits full or partial data release, and the analysis of privacy-protected data Dwork et al. (2014); Kairouz et al. (2014), LDP algorithms are still at a fundamental stage when it comes to the privacy preservation of real-valued numerical data. The complexity of selecting the domain of randomization with respect to a single data instance is still a challenge Erlingsson et al. (2014). In GDP, the requirement of a trusted curator who enforces differential privacy by applying noise or randomization can be considered as a primary issue Dwork et al. (2014). The fundamental mechanisms used to obtain differential privacy include Laplace mechanism, Gaussian mechanism  Dwork et al. (2014), geometric mechanism, randomized response, and staircase mechanisms  Kairouz et al. (2014). The necessity of a trusted third party in GDP and the application of extremely high noise in LDP are inherent shortcomings that directly affect the balance between privacy and utility of these practical, differentially private approaches.

Many previously proposed privacy preservation methods, including data perturbation, perform poorly with high dimensional datasets. The necessary computing resources grow fast as the number of attributes and number of instances increase even though the performance is good for low dimensional data. This quality is often referred to as “The Dimensionality Curse”  Chamikara et al. (2018). Large datasets also provide extra information to attackers, as higher dimensions help in utilizing background knowledge to identify individuals  Bettini & Riboni (2015).

Most of the privacy-preserving algorithms have problems with balancing privacy and utility. Data privacy focuses on the difficulty of estimating the original data from the sanitized data, while utility concentrates on preserving application-specific properties/information  Aggarwal (2015). A generic way of measuring the utility of a privacy-preserving method is to investigate perturbation biases  Wilson & Rosen (2008). Data perturbation bias means that the result of a query on the perturbed data is significantly different from the result generated for the same query on the original data. Wilson et al. have examined different data perturbation methods against various bias measures  Wilson & Rosen (2008), namely, A, B, C, D, and Data Mining (DM) bias. Type A bias occurs when the perturbation of a given attribute causes summary measures to change. Type B bias is the result of the perturbation changing the relationships between confidential attributes, while in case of Type C bias the relationship between confidential and non-confidential attributes changes. Type D bias means that the underlying distribution of the data was affected by the sanitization process. If Type DM bias exists, data mining tools will perform less accurately on the perturbed data than they would on the original dataset. It has been noted that privacy preservation mechanisms decrease utility in general, and finding a trade-off between privacy protection and data utility for big data is an important issue  Xu et al. (2015).

In the literature, there is a dearth of efficient privacy preservation methods that provide reliable data utility and are scalable enough to handle the rapidly growing data. Existing methods also have problems with levels of uncertainty, biases and low level of resistance. To address the issues presented by big data, there is an urgent need for methods that are scalable, efficient and robust. New methods should overcome the aforementioned weaknesses of the existing PPDM methods and provide solutions towards large-scale privacy preserving data mining.

3 Proposed Algorithm: PABIDOT

PABIDOT perturbs a dataset by using multidimensional geometric transformations, reflection, translation, and rotation followed by randomized expansion (a new noise addition mechanism which is explained in Section 3) and random tuple shuffling. Figure 1 shows the basic flow and architecture of the proposed perturbation algorithm. Based on the proposed privacy model called , the algorithm aims at optimum privacy protection in terms of protection against data reconstruction attacks. PABIDOT achieves this by selecting the best possible perturbation parameters based on the properties of the input dataset. Figure 1 also demonstrates the position of PABIDOT in a privacy-preserving big data release scenario. PABIDOT assumes that the original data can be accessed only by the owner/ the administrator of that dataset. There can be complementary releases of the perturbed versions of the original dataset. The original dataset will not be released to a third party users under any circumstances.

Figure 1: Basic flow and the architecture of PABIDOT. In this setting, the data owner is considered to be the trusted curator who owns the original dataset. The owner is located in the local edge of a cloud computing scenario. The orange boxes represent the main steps of the algorithm whereas the green boxes represent the intermediate data generative steps which support the appropriate main steps.
Rationale and technical novelty

PABIDOT applies geometric transformations with optimal perturbation parameters and increases randomness using randomized expansion followed by a random tuple shuffle. It defines privacy in such a way that the resulting dataset has an optimal difference compared to the original dataset as a result of the privacy model () used by PABIDOT. This property helps in minimizing the search space and finding the best possible perturbation for a particular dataset. Consequently, efficiency and reliability of PABIDOT in big data perturbation increase while providing better resistance to data reconstruction. Figure 1 and Algorithm 3 depict the proposed perturbation algorithm, and Table 1 provides a summary of the notations used in Algorithm 2 and Algorithm 3

. As shown, the original dataset and the standard deviation to the normal random noise provided under randomized expansion are the only inputs to Algorithm

3, and the perturbed dataset is its only output.

Data matrix (D)

The dataset to be perturbed is represented as a matrix () of size where the columns represent the attributes ( attributes), and rows represent the records ( records). For example, personal information of a patient can be represented as a record which may have attributes such as age, weight, height, and gender. The data matrix is assumed to contain numerical data only.

(1)

In the process of perturbation, the data matrix is subjected to multidimensional geometric composite transformations. During these transformations, a record (row) in the data matrix will be considered as a point in the multi-dimensional Cartesian coordinate system.

Multidimensional isometric transformations

Geometric translation, rotation and reflection are considered to be isometric transformations in the space. A transformation is said to be isometric, if it preserves the distances so that  Maruskin (2012)

(2)

All matrices and Cartesian points are represented in homogeneous coordinate form in order to consider all the transformations in matrix multiplication form. A homogeneous coordinate point in space can be written as an

dimensional position vector

with the additional term . The introduction of homogeneous coordinates enables composite transformation between the coordinates and transformation matrices without having to perform a collection of transformations as a sequential process. Therefore, multidimensional geometric translation, reflection and rotation can be represented in their generalized forms of matrices  Jones (2012).

A composite operation is performed when several transformation matrices have to be applied in a particular transformation. If several transformation matrices are sequentially applied to a homogeneous matrix , the composite operation is given by

(3)
Homogeneous data matrix

All records in the input data matrix () are converted to homogeneous coordinates by adding a new column of ones (i.e. ) after the column. The resulting homogeneous representation of the data matrix is given by Equation 4.

The input dataset will first be subjected to z-score normalization  

Kabir et al. (2015) in order to provide equal weights for all attributes in the transformations. Next, the n-dimensional translational matrix is generated according to Equation 5

, in which the translational coefficients are drawn from random noise with uniform distribution, and n equals to the number of attributes in the input dataset. Due to z-score normalization, the attribute mean becomes 0 while the standard deviation of the dataset becomes 1. Therefore, the noise generated by the uniform random noise function is bounded within

and follows the inequality where denotes a translational coefficient.

(4)
n-Dimensional translation matrix

The homogeneous translation matrix can be derived as shown in Equation 5  Jones (2012).

(5)
n-Dimensional reflection matrix

Next, for each of the axes PABIDOT generates the corresponding matrix of reflection according to Equation 7. The homogeneous matrix for reflection across axis one can be derived as shown in Equation 6  Jones (2012).

(6)

The (n+1) axis reflection matrix can be written as shown in Equation 7.

(7)
n-Dimensional rotation matrix

After creating the matrix of reflection for one of the number of axes, PABIDOT generates the n-dimensional concatenated subplane rotation matrices ( for the current ) using Algorithm 1 for each where ()

In order to derive a single matrix that represents the entire orientation, the concatenated sub-plane rotation method can be used. Then, the rotations in the plane of a pair of coordinate axes for can be written as a block matrix as in Equation 8  Paeth (2014).

(8)

Thus distinct should be concatenated in the preferred order to produce the final composite n-dimensional matrix which can be obtained using Equation 9 with degrees of freedom, parameterized by.

(9)
Algorithm for generating the multidimensional concatenated sub-plane rotation matrix

Based on Equations 8 and 9, Algorithm 1 can be used to generate the multidimensional concatenated subplane rotational matrix of the desired rotation angle.

The resulting rotation matrix has the properties of an orthogonal matrix where the columns and rows of the resulting concatenated subplane rotation matrix () are orthonormal, and hence preserves the relationship, where be the transpose matrix of and

be the identity matrix.

1:Inputs :
number of attributes of the dataset
rotation angle
2:Outputs:
multidimensional rotation matrix of
3: =
4: = and
5: = identity matrix of size
6: = Total number of and assignments necessary
7:
8:for   do
9:    where
10:   is the Coordinates of and assignments for an instance of
11:    Initializing with the identity matrix of size
12:   
13:   where, is the set number 1 of
14:   
15:   where, is the set number 2 of
16:   
17:   where, is the set number 3 of
18:   
19:   where, is the set number 4 of
20:    Iterative multiplication of and to form according to Equations 8 and 9
21:end for
22: follows Equations 8 and 9
23:End Algorithm
Algorithm 1 Algorithm for generating multidimensional concatenated subplane rotation matrix
Privacy metric for generating the optimal perturbation parameters

Since the proposed method is based on multidimensional isometric transformations, it is important to use a multi-column privacy metric for evaluating the privacy of the proposed method. Assuming that all attributes of the dataset are equally important, z-score normalization is applied to the data matrix as the initial step in the perturbation process. The higher the privacy of the perturbed data, the more difficult it is to estimate the original data  Chen & Liu (2005)

. To extend this idea, the variance of the difference between the perturbed and non-perturbed datasets (

) is considered, the higher the , the higher the privacy. Hence, provides a measure of privacy of the perturbed data, or the level of difficulty to estimate the original data without prior knowledge about the original data  Chen & Liu (2005), which is often called naive inference/estimation  Chen & Liu (2005). has long been used to measure the level of privacy of perturbed data  Muralidhar et al. (1999). In the proposed method the attribute which returns the minimum variance for the difference is considered as the minimum privacy guarantee. If is a perturbed data series of attribute , the level of privacy of the perturbation method can be measured using , where . Therefore, can be written as in Equation 10.

(10)

We propose a new privacy definition called as follows. Assume we have a dataset of instances and each instance having attributes. We perturb this dataset in different ways, producing datasets for . The perturbed value of an attribute is denoted as . We then calculate the difference between the original data and the perturbed data - , which will contain values of ( - ) for . We calculate the variance of each attribute in - , and select the minimum of these variances.

(11)

From the perturbations, we choose the one that has the largest value of , i.e. we choose the perturbation that produced the most significant difference between the original dataset and the perturbed dataset.

Definition 3.0 ().

Given the dataset and a perturbation algorithm, we shall denote a perturbed instance of as for , where represents the number of possible ways of perturbing . Then, the minimum privacy guarantee is defined as

(12)

and the optimal privacy guarantee is

(13)

A perturbed dataset that has the optimal privacy guarantee provides .

For a perturbed data matrix of of size , the minimum privacy guarantee () is the minimum of the variances calculated for attributes based on the differences between the original and perturbed z-score normalized values, as shown in Equation 14. Here is the minimum privacy guarantee and is the total number of attributes in or .

(14)
Identifying the best perturbation parameters for

In each iteration the algorithm maximizes the value of to generate as given in Equation 17. With the axis of reflection varying from 1 to (number of attributes) and the angle of rotation () varying from 0 to 179 degrees, i.e. and , the perturbation will return a number of perturbed data matrices: with different levels of perturbations. This will form the matrix of local minimum privacy guarantees, as given in Equation 15.

(15)

To get the global minimum guarantee ()values for each angle, we get the minimum of each column as given in Equation 16.

(16)

The perturbed data matrix with the optimal privacy can be considered as the data matrix that returns the largest value for the minimum privacy guarantee . Therefore, the largest global minimum privacy guarantee () is selected from Equation 16 to obtain (Equation 17). In other words, the best perturbation parameters will be selected using the highest privacy guarantee , based on the most vulnerable attributes.

(17)
Generate the rotation matrix and reflection matrix using the best perturbation parameters for

Next, PABIDOT records the angle of rotation () and axis of reflection () at . Now, the algorithm uses to generate the matrix of reflection according to Equation 7 and use to generate the matrix of rotation according to Algorithm 1. Composite transformation of reflection, translation, and rotation will then be applied to the z-score normalized matrix using the matrices of optimal reflection, translation, and optimal rotation to generate .

Application of z-score normalization and the transformations: reflection, translation and rotation

After generating the concatenated subplane rotation matrix, PABIDOT applies the composite transformation of reflection, translation, and rotation on the z-score normalized input data matrix. Rotation is applied after reflection and noise translation. This is because the effect of rotation is proportional to the distance from the origin, and we want to reduce the probability of points close to the origin getting attacked due to weaker perturbation  

Chen & Liu (2011). An instance of the application of the transformation to the input data matrix () in the order of application can be represented using Equation 18.

(18)
Randomized Expansion

A privacy preserving algorithm is composable if it can keep satisfying the same privacy requirements of the privacy model in use, after repeated independent application of the algorithm Soria-Comas & Domingo-Ferrer (2016)

. To improve the composability and the randomness of the perturbation algorithm, noise drawn from random normal distribution (with a mean of 0, and a predefined standard deviation subjected to a default value of 0.3) is added to the data according to a novel approach named as randomized expansion. Here, we introduce noise in such a way that it further enhances the positiveness or the negativeness of a particular value where the zeros are not subjected to any change as depicted in Figure

2.

Figure 2: Effect of . The red arrows of the right-hand side show a positive shift where a calibrated positive random value is added to the positive value to increase the positiveness of the original value. The left-hand side which is represented by the blue arrows show a negative shift where a calibrated negative random value is added to the negative value to increase the negativeness of the original value.

In order to generate the noise for randomized expansion, first, we generate the sign matrix () based on the values of . A value in will be 1 if the corresponding element of is greater than 0, it will be 0 if the corresponding element of equals 0, and it will be -1 if the corresponding element of is less than 0. if is complex. Next, the absolute values of and absolute values of the random normal noise matrix are added together and dot product with is calculated as denoted in Equation 19.

(19)
Application of reverse z-score normalization and random tuple swapping

At this stage, the attributes of the perturbed data matrix will not be within the ranges of the attributes of the original dataset. Therefore, reverse z-score normalization will be applied to the current matrix, using the standard deviations () and means of the attributes () of the original dataset. Finally, the tuples of the resulting matrix will be randomly swapped to generate the final perturbed dataset.

Notation Summary/Constraints
is the optimal privacy guarantee
at
optimal axis for reflection selected at
number of tuples
number of attributes
z-score normalized dataset of
covariance matrix of
transformation matrix with uniform random noise
reflection matrix for the current axis
(generated according to Equation 7)
() and
concatenated subplane rotation matrix
generated using Algorithm 1 for
concatenate sub-plane rotation matrix
generated using Algorithm 1 for
is the reflection matrix,
generated according to Equation 7 for
vector of standard deviations of all the attributes of
vector of mean values of all the attributes of
Table 1: Summary of the notations used in the proposed algorithm.

3.1 Overall process of PABIDOT

Algorithm 2 summarizes the overall set of steps of the basic perturbation algorithm. A summary of notations used in this algorithm is provided in Table 1. For convenience, we refer to Algorithm 2 as PABIDOT_basic in subsequent sections.

1:Inputs :
original dataset
input noise standard deviation (default value=0.3)
2:Outputs:
perturbed dataset
3:
4:
5:
6:generate by applying z-score normalization on
7:generate according to Equation 5, using uniform random noise as the translational coefficients
8:for each in  do assume there are number of attributes in
9:   generate according to Equation 7
10:   for each  do
11:      generate using Algorithm 1 refer Equations 8, and 9
12:       follows Equation 18
13:       and according to Equation 14
14:   end for
15:end for
16:for each  do
17:    where, follows Equation 16
18:end for
19: where, according to Equation 17
20: at
21: at
22:generate according to Algorithm 1, using
23:generate according to Equation 7, using
24: follows Equation 18
25:) according to Equation 19
26:=+ reverse z-score normalization
27:randomly swap the tuples of
28:End Algorithm
Algorithm 2 Basic steps of PABIDOT

4 Optimizing the efficiency of PABIDOT

During the execution of PABIDOT, the composite transformations in steps 12 and 13 in Algorithm 2 need to be applied to the whole dataset for a total of times. This can be infeasible for high dimensional datasets. To accelerate the computations, a method based on the covariance matrix of the input dataset () was used to generate , where . This eliminates the necessity of searching through a large number of tuples of a big dataset in each loop to find , as it only needs the of . The steps of this derivation are provided below (Section 4).

Poof:.
(20)

Since, X is z-score normalized is equal to . For large number of records .

Therefore,

(21)

Hence,

(22)

Let’s consider the following table as the table to be perturbed where are the columns of ,

X Y Z

The perturbation of a single tuple can be given as (where, , and are rotation, translation and reflection matrices respectively), which is found using

(23)

Assuming that axis 1 has been selected for the reflection.

Therefore,

(24)

and

or

(25)

Where,

So,

(26)

Similarly,

(27)
(28)

and

(29)

Therefore,

(30)

or

(31)

Let be the dataset of attributes , and let be the variance-covariance of a size square matrix defined by,

(32)

where the diagonal elements are

Let be a row vector of weight factors (in this case elements). Then

(33)

written out in component form,

(34)

Since, and

(35)

Hence,

(36)

Therefore,