PPaaS: Privacy Preservation as a Service

Personally identifiable information (PII) can find its way into cyberspace through various channels, and many potential sources can leak such information. To preserve user privacy, researchers have devised different privacy-preserving approaches; however, the usability of these methods, in terms of practical use, needs careful analysis due to the high diversity and complexity of the methods. This paper presents a framework named PPaaS (Privacy Preservation as a Service) to maintain usability by employing selective privacy preservation. PPaaS includes a pool of privacy preservation methods, and for each application, it selects the most suitable one after rigorous evaluation. It enhances the usability of privacy-preserving methods within its pool; it is a generic platform that can be used to sanitize big data in a granular, application-specific manner by employing a suitable combination of diverse privacy-preserving algorithms to provide a proper balance between privacy and utility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

06/19/2019

Efficient privacy preservation of big data for accurate data mining

Computing technologies pervade physical spaces and human lives, and prod...
11/05/2017

Distribution-Preserving k-Anonymity

Preserving the privacy of individuals by protecting their sensitive attr...
01/06/2020

Clustering based Privacy Preserving of Big Data using Fuzzification and Anonymization Operation

Big Data is used by data miner for analysis purpose which may contain se...
12/08/2020

Deterministic Privacy Preservation in Static Average Consensus Problem

In this paper we consider the problem of privacy preservation in the sta...
10/17/2017

Privacy by typing in the π-calculus

In this paper we propose a formal framework for studying privacy in info...
04/18/2021

Why Should I Trust a Model is Private? Using Shifts in Model Explanation for Evaluating Privacy-Preserving Emotion Recognition Model

Privacy preservation is a crucial component of any real-world applicatio...
08/09/2019

Making GDPR Usable: A Model to Support Usability Evaluations of Privacy

We introduce a new perspective on the evaluation of privacy, where right...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cyberspace users cannot easily avoid the possibility of their identity being incorporated in data that exposes various aspects of their lives [1, 2]. Our day to day life activities are constantly tracked by smart devices, and the unavoidable exposure of personally identifiable information (PII) such as fingerprint, facial features can lead to massive privacy loss. The heavy use of PII in social networks, in the health-care industry, and by insurance companies, in smart grids makes privacy protection of PII extremely complex. Literature shows more than a few methods to address the growing concerns related to user privacy. Among these methods, disclosure control of microdata has become widely popular in the domain of data mining [1, 3]; it works by applying different privacy-preserving mechanisms to the data before releasing them for analysis. Privacy-preserving data mining (PPDM) applies disclosure control to data mining in order to preserve privacy while generating knowledge [1].

The main approaches to PPDM are data perturbation (modification) and encryption; literature shows a plethora of privacy preservation approaches under these two categories [4]. There has been more interest in data perturbation due to its lower complexity compared to encryption. Additive perturbation, random rotation, geometric perturbation, randomized response, random projection, microaggregation, hybrid perturbation, data condensation, data wrapping, data rounding, and data swapping are some examples of basic data perturbation algorithms, which show different behavior on different applications and datasets [5, 6, 7, 8, 9, 10, 11]. We can also find a number of hybrid approaches that combine basic perturbation approaches.

Fig. 1: Complexity of selecting the best privacy preservation approach for a particular application/ database

As shown in Figure 1, the availability of many privacy preservation approaches has its drawback: the selection of the optimal perturbation algorithm for a particular problem can be quite complex. The figure shows different constraints that need to be considered in choosing the best possible privacy preservation algorithm for a particular application and dataset. The different characteristics of privacy models (e.g. k-anonymity, l-diversity, t-closeness, differential privacy ([4]

)), the different properties of privacy preservation algorithms (e.g. geometric perturbation, data condensation, randomized response), the different dynamics of the input data (e.g. the statistical properties, the dimensions), and the different types of applications at hand (e.g. data clustering, deep learning), are examples for the attributes that influence the effectiveness of privacy preservation and the usability of the results. At the same time, this diversity enables the selection of the privacy preservation algorithm that best suits a particular application. There is no generic approach to identify the exact levels of privacy loss vs. utility loss, given a list of privacy preservation algorithms on particular applications and datasets.

Furthermore, many privacy preservation approaches fall out of favour because their applicability is not properly identified. We introduce a new approach named “Privacy Preservation as a Service ” (PPaaS) that employs a novel strategy to apply customized perturbation based on the requirements of the problem at hand and the characteristics of the input dataset.

PPaaS presents a unified service that understands data requesters’ needs and data owners’ (who have full access privileges to the raw input databases which are represented by the lowest layer Figure 2) requirements; it can facilitate privacy-preserving data sharing and can identify the best privacy preservation approach. An appropriate set of performance and security metrics describes the quality of such a service, which is used to tailor the best privacy preservation to stakeholders’ needs. The proposed framework collects efficient privacy preservation methods into a pool and applies the approach that best suits both data owner and data requester to the data before making the data available.

1.1 Rationale and technical novelty

Developing generic privacy-preserving methods for data mining and statistics purposes is challenging due to the large number of constraints that need to be considered. As the complexity of the applications increases, generic approaches often end up with low utility or low privacy ([12]). Many researchers try to overcome this by focusing on a distinct objective (e.g privacy in deep learning) ([13, 14, 12]). As a result, there are a number of algorithms for some areas such as deep learning, with many viable privacy preservation solutions ([15]). The algorithms having unique features and characteristics, choosing the best one for a particular case can be highly complex.

PPaaS reduces the burden of choosing the optimal privacy-preserving algorithm and providing the best protection for the application and dataset at hand by introducing a unified service for the purpose. Since there can be more than one method appropriate for a particular application and dataset, empirical evaluation is utilised in this process. PPaaS manages a pool of algorithms suitable for particular applications. When a certain application/dataset is presented, PPaaS assesses the privacy-preserving algorithms and produces a unified metric named fuzzy index (FI) derived from a fuzzy model (which can be used to model the vagueness and impreciseness of information in a real-world problem using fuzzy sets.). We use quantitative definitions of utility and privacy as inputs to the Fuzzy model. The higher the fuzzy index, the better the balance between privacy and utility under the given circumstances. The release of a particular output depends on a configurable threshold value of the corresponding FI. If the required threshold is not reached, the application of the corresponding pool is assessed until one of the privacy preservation algorithms in the pool generates a satisfactory FI ( threshold ) for an application and dataset. With this approach, users are guaranteed to be given the best possible privacy preservation while providing an optimal utility.

2 Literature

Data privacy focuses on impeding the estimation of the original data from the sanitized data, while utility concentrates on preserving application-specific properties and information (

[16]). It has been noted that privacy preservation mechanisms decrease utility in general, i.e. they reduce utility to improve privacy, and finding a trade-off between privacy protection and data utility is an important issue  ([17]). In fact, privacy and utility are often conflicting requirements: privacy-preserving algorithms provide privacy at the expense of utility. Privacy is often preserved by modifying or perturbing the original data, and a common way of measuring the utility of a privacy-preserving method is to investigate perturbation biases  ([18]). This bias is the difference between the result of a query on the perturbed data and the result of the same query on the original data. Wilson et al. examined different data perturbation methods and identified Type A, B, C, and D biases, along with an additional bias named Data Mining (DM) bias ([18]). Type A bias occurs when the perturbation of a given attribute causes summary measures to change. Type B bias is the result of the perturbation changing the relationships between confidential attributes, while in case of Type C bias, the relationship between confidential and non-confidential attributes changes. Type D bias means that the underlying distribution of the data was affected by the sanitization process. If Type DM bias exists, data mining tools will perform less accurately on the perturbed data than they would on the original dataset.

An investigation of existing privacy preservation approaches also suggests that they often suffer from utility or privacy issues when they are considered for generic applications ([4]). Methods such as additive perturbation with noise can produce low utility due to the highly randomized nature of added noise ([19, 8]). Randomized response, another privacy preservation approach, has the same issue and produces low utility data due to high randomization ([9]). Methods such as multivariate microaggregation provide low usability due to the complexity introduced by its NP-hard nature ([5]). Data condensation provides an efficient solution to privacy preservation of data streams; however, the quality of data degrades as the data grows, eventually leading to low utility ([20]). Many of the multi-dimensional approaches, such as rotation perturbation and geometric perturbation, introduce high computational complexity and take unacceptably long time to execute ([21, 22]

). This means that such methods in their default settings are not feasible for high dimensional data such as big data and data streams. A structured approach is needed, which can provide a practically applicable solution for selecting the best privacy preservation approach for a given application or dataset.

Several works have looked at the connection between privacy, utility, and usability. Bertino et al. proposed a framework for evaluating privacy-preserving data mining algorithms; for each algorithm, they focused on assessing the quality of the sanitized data  ([20]). Other frameworks aim at providing environments for dealing with sensitive data. Sharemind is a shared multi-party computation environment allowing secret data-sharing  ([23]). FRAPP is a matrix-theoretic framework aimed at helping the design of privacy-preserving random perturbation schemes  ([24]). Thuraisingham et al. went one step further; they provide a vision for designing a framework that measures both the privacy and utility of multiple privacy-preserving techniques. They also provide insight into balancing privacy and utility in order to provide better privacy preservation  ([25]). However, these frameworks neither provide a solution to the problem of dealing with numerous privacy preservation algorithms and nor provide proper quantification of their utility and privacy against a particular application and dataset at hand.

Fig. 2: Privacy preservation as a service (PPaaS) for big data.

3 Privacy Preservation as a Service

We propose a novel approach named “Privacy Preservation as a Service (PPaaS)”, a generic framework that can be used to sanitize big data in a granular and application-specific manner. In this section, we give a detailed outline of the concept. The high diversity and specificity of privacy preservation methods presents complexities, such as finding a trade-off between security, utility, and usability. As noted previously, privacy preservation algorithms can suffer from different types of biases. For example, a particular sanitization algorithm used for privacy-preserving classification may not have DM bias, but it may suffer from Type B and D biases, while another one has only Type B bias, and a third one has DM bias. Different applications may tolerate different types of bias, and there is no general rule. This means that different privacy preservation algorithms are suitable for different data owner requirements (privacy and performance) and different data requester needs (utility and usability).

Fig. 3: Flow of events in application specific privacy preservation of PPaaS

A unified service of data sanitization for big data can provide an interactive solution for this problem. PPaaS can choose the most suitable privacy preservation algorithm for a particular analysis at hand. The architecture of PPaaS is presented in Figure 2. It is implemented as a web-based framework that can operate in a web service cluster. The scalability necessary for big data processing is achieved using APIs such as Spark/PySpark ([26]) (as the primary language was Python) with a clean build design adapted with a Model-View-Controller (MVC) web framework. As the figure shows, the framework consists of three distinct components: (1) the raw datasets/databases, (2) PPaaS privacy preservation module, and (3) the users (e.g. analysts), who work with the sanitized (perturbed) data.

The privacy preservation module consists of pools of application logic (e.g. classification and association mining), and pools of privacy preservation algorithms (e.g. matrix multiplication, additive perturbation). The PPaaS privacy preservation module integrates a collection of privacy preservation algorithms into a collection of pools where each pool represents a particular class of data mining/analysis algorithms. The enlargement of the red circle in Figure 2 shows a possible collection of sub-pools of privacy preservation algorithms for classification. For instance, rotation perturbation (RP) ([27]) can be integrated into the ”Generic” sub-pool of pool1: Classification (refer to red circle in Figure 2), as it provides better accuracy towards a collection of classification algorithms. A particular pool may have several subdivisions to enable the synthesis of new data sanitization methods that are tailored to more specific requirements. The database management layer provides the necessary services for uniform data formatting. It also represents a common platform for the application of different privacy preservation algorithms (In the proposed concept, privacy preservation is discussed in terms of data perturbation. The following sections use ”privacy preservation” and ”perturbation” interchangeably, referring to the same objective). The blue arrows in Figure 2 show the data flow from data owners through the database management layer to the sanitization algorithm.

A data owner/curator can utilize the framework to impose privacy on a particular dataset for a particular application by using the best privacy preservation approach from a pool of available algorithms. In the proposed setting, PPaaS requires a trusted curator to identify the query or the analysis requests for a given dataset, and run the PPaaS logic for the corresponding application (e.g. deep learning ([28])). The curator/data owner accesses the data and applies privacy preservation (perturbation) to the data or dataset according to the users’ requirements.

Fig. 4: Fuzzy membership functions of the input/output variables

The proposed framework has three key aspects: (1) understanding the data owner/producer requirements (privacy), (2) understanding the data requester/consumer needs (utility), and (3) selecting and applying the optimum privacy-preserving algorithm to the data. Finally, the progress of applying privacy preservation to a particular dataset is assessed using a fuzzy metric (named the fuzzy index or FI), which is a single metric to evaluate the balance between privacy and utility provided by the corresponding privacy preservation algorithm. Fig 3 shows the main flow of PPaaS in releasing a perturbed dataset with a customized application of privacy-preservation. The data curator will receive a request for a certain operation on the underlying dataset. For example, this request can be for deep learning on a medical dataset that is maintained by the corresponding data owner. The data owner forwards the request to the PPaaS framework, which will select the corresponding pool/sub-pool of privacy preservation algorithms allocated under deep learning. In the example, this pool may include the following algorithms: local differentially private approaches, geometric data perturbation approaches, random projection-based data perturbation approaches, which are suitable for producing high utility for deep learning. Next, PPaaS sequentially applies the corresponding pool of privacy preservation algorithms and generates a fuzzy index for each perturbation algorithm. If a particular pool has four privacy preservation algorithms, PPaaS will produce 4 FI values. Next, the PPaaS will select the perturbed dataset with the highest FI, because the corresponding dataset provides the best balance between privacy and utility. The data curator is able to handle different data sources and sanitize them for requests based on the specific needs of a particular requester.

PPaaS uses the fuzzy inference system (FIS) to generate the fuzzy index. Privacy and utility are the only inputs to the FIS that generates a final score that is, the fuzzy index ().

is a quantitative rank that rates the complete process of privacy preservation upon a particular dataset for a given application. A heuristic approach was followed in defining the fuzzy rules which focused on the characteristics of maintaining a balance between privacy and utility. The universe of discourse of the inputs and output ranges from 0 to 1. A higher FI value suggests that the final dataset has high privacy and utility with a good balance between them. The PPaaS dispatcher investigates the value of

corresponding to a particular process of sanitization, compares it with a user-defined balance guarantee, that is taken as an input parameter from the data owner. If , the dataset will be released to the data requester, where is the maximum generated by the pool. Otherwise, the PPaaS will reapply the random perturbation algorithm to find a better solution that satisfies requirement.

(1)

A fuzzy inference system (FIS) takes several inputs and generates a certain output based on evaluating a collection of specified rules, which are named as fuzzy rules. In the proposed framework (PPaaS), we define a FIS to take the two inputs: privacy and utility to produce an output named fuzzy index (). provides an impression of the quality of the balance between privacy and utility generated after perturbing a dataset using a privacy preservation algorithm. According to the domain knowledge, we already know that a good privacy preservation algorithm should enforce high privacy while producing good utility (e.g. accuracy). Following this notion, should ideally provide high values only when both privacy and utility are high. In case one is high and the other is low, the should be a lower value. Hence, the fuzzy model should produce a rule-surface as presented in Figure 5. Considering all these dynamics between privacy, utility, and , we introduced three membership functions (LOW, MEDIUM, HIGH) for each variable. Next, we considered Gaussian functions for all the membership functions in the two input variables and output variables, as shown in Figure 4. Finally, we defined the nine rules given in Equation 1 to obtain the rule-surface depicted in Figure 5.

Figure 5 shows the rule surface of the fuzzy inference system (FIS), which is used to generate . As shown in the figure, FIS generates higher values for when both utility and privacy are high, whereas for lower values of privacy and utility also stays at a lower level. As shown in the figure, the rule surface makes sure that a higher value of one parameter (privacy or utility) does not result in a higher value for . This property guarantees that the proposed PPaaS framework maintains a good balance between privacy and utility.

Fig. 5: Rule surface of the FIS

Privacy Quantification.

During the application of each privacy preservation algorithm, the privacy will be quantified empirically using a multi-column privacy metric, considering that the input datasets are n-dimensional matrices. In the proposed setting, we assume that all the attributes of a particular dataset are equally important, and we make ensure it by applying z-score normalization to the input datasets. Then we calculate the variance (

) of the difference between the perturbed and non-perturbed datasets. The higher the the higher the privacy, as indicates the difficulty of estimating the original data from the perturbed data ([4]). is a well-established approach used to measure the level of privacy of perturbed data ([4]). If is a perturbed data series of attribute , the level of privacy of the perturbation method can be measured using , where . can be given by Equation 2.

(2)

Given that there are attributes in a particular dataset; we consider the minimum privacy guarantee to be the minimum variance () across all the attributes in the corresponding dataset. is the level of privacy of the weakest attribute in a perturbed dataset. Equation 3 shows the generation of minimum privacy guarantee (the minimum variance, ) for a particular dataset.

(3)

Assuming that a particular pool has privacy preservation algorithms, we scaled the values within 0 and 1, by applying Equation 4 to the corresponding pool. The value returned from Equation 4 is considered as the input to the FIS (which accepts inputs of range: ).

(4)

Utility Quantification.

The accuracy of the results produced by the requested service is evaluated experimentally to generate the empirical utility. If the application being examined is classification, the classification accuracy is generated for all the privacy preservation algorithms in the pool for the corresponding type of data classification. All the accuracy (utility) values are scaled between 0 and 1 as the range of inputs accepted by the FIS is bounded by the window of .

Algorithm for generating FI

Algorithm 1 is used for generating for a particular pool of privacy preservation algorithms.

Input: input dataset pool of privacy algorithms Output: selected perturbed dataset selected privacy preserving algorithm 1 perturb using the pool of algorithms to generate ; 2 generate privacy inputs () using Equation 4; 3 generate utility inputs () by running the corresponding application on ; 4 generate the fuzzy indices () by inserting the privacy inputs and utility inputs to the fuzzy model; 5 select the dataset () and the privacy preserving algorithm ( that returns the highest ; Algorithm 1 Algorithm for generating for a pool of algorithms

4 Case Studies and Results

In this section, we demonstrate how PPaaS selects the best perturbation algorithm and a perturbed dataset from a particular pool of algorithms. During the experiments, we consider five classification algorithms: Multilayer perceptron (MLP), k-nearest neighbor (IBK), Sequential Minimal Optimization (SVM), Naive Bayes, and J48  (

[29]). We use four privacy preservation algorithms: rotation perturbation (RP), geometric perturbation (GP), PABIDOT, and SEAL ([4]), which are benchmarked for utility for the selected classification algorithms ([4]

). The algorithms were tested on five different datasets retrieved from the UCI machine learning data repository

111http://archive.ics.uci.edu/ml/index.php. Table I provides a summary of the datasets. All the experiments were run on a Windows 7 (Enterprise 64-bit, Build 7601) computer with an Intel(R) i7-4790 (4 generation) CPU (8 cores, 3.60 GHz) and 8GB RAM.

Dataset Abbreviation Number of Records Number of Attributes Number of Classes
Wholesale customers222https://archive.ics.uci.edu/ml/datasets/Wholesale+customers WCDS 440 8 2
Wine Quality333https://archive.ics.uci.edu/ml/datasets/Wine+Quality WQDS 4898 12 7
Page Blocks Classification 444https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification PBDS 5473 11 5
Letter Recognition555https://archive.ics.uci.edu/ml/datasets/Letter+Recognition LRDS 20000 17 26
Statlog (Shuttle)666https://archive.ics.uci.edu/ml/datasets/Statlog+%28Shuttle%29 SSDS 58000 9 7
TABLE I: A summary of the datasets used for the experiments.

In the proposed experimental setting, we consider 25 case studies where each case study considers one of the five classification algorithms and one of the five datasets. We consider a pool of four data perturbation algorithms: RP, GP, PABIDOT, and SEAL under each of the case studies represented as CS (CS stands for ”case study”) in Table II. Next, we evaluated the performance of each privacy preservation algorithm in each case to generate the ranks (Fuzzy Indices: FIs) and recorded them in Table III. Table II shows the classification accuracy and the minimum privacy guarantee produced for each pool of privacy preservation algorithms. In each pool, the input datasets were perturbed using the four privacy preservation algorithms. Then the perturbed data were analysed by each classification algorithm to generate classification accuracy (utility) values. Table II, includes the

values generated as explained before. We considered the standard deviation (

) of the difference between the original normalized data and the perturbed data. To keep the values within the 0 to 1 range for the fuzzy input (for privacy), we applied Equation 4 on the values, where the () is the maximum standard deviation value returned by the corresponding pool of privacy preservation algorithms.

Dataset
Privacy
-preserving
algorithm
Utility after privacy preservation Privacy guarantee
MLP
CS 1
IBK
CS 2
SVM
CS 3
Naive Bayes
CS 4
J48
CS 5
min(std(P)) Scaled
LRDS RP 0.7404 0.8719 0.7107 0.4841 0.6489 0.8750 0.6223
GP 0.7912 0.9305 0.7792 0.5989 0.7054 1.3248 0.9422
PABIDOT 0.7822 0.9224 0.7848 0.6280 0.7262 1.4046 0.9989
SEAL 0.8059 0.9367 0.8171 0.6310 0.8528 1.4061 1.0000
PBDS RP 0.9200 0.9552 0.8999 0.3576 0.9561 0.7261 0.5149
GP 0.9024 0.9567 0.8993 0.4310 0.9549 0.2845 0.2017
PABIDOT 0.9583 0.9476 0.9209 0.8968 0.9492 1.4102 1.0000
SEAL 0.9634 0.9673 0.9559 0.8697 0.9634 1.3900 0.9857
SSDS RP 0.9626 0.9980 0.8821 0.6904 0.9951 1.2820 0.8847
GP 0.9873 0.9981 0.7841 0.7918 0.9959 1.4490 1.0000
PABIDOT 0.9865 0.9867 0.9280 0.9134 0.9874 1.4058 0.9702
SEAL 0.9970 0.9921 0.9851 0.8994 0.9987 1.4065 0.9707
WCDS RP 0.8909 0.8500 0.8227 0.8455 0.8682 1.0105 0.6912
GP 0.9182 0.8659 0.8500 0.8432 0.8886 1.4620 1.0000
PABIDOT 0.9045 0.8545 0.8841 0.8886 0.8841 1.3680 0.9357
SEAL 0.8932 0.8682 0.8909 0.8841 0.8659 1.3130 0.8981
WQDS RP 0.4765 0.5329 0.4488 0.3232 0.4553 1.2014 0.8570
GP 0.4886 0.5688 0.4488 0.3216 0.4643 1.3463 0.9603
PABIDOT 0.5412 0.6182 0.5147 0.4657 0.4916 1.4019 1.0000
SEAL 0.5392 0.6402 0.5202 0.4783 0.8415 1.3834 0.9868
TABLE II: Classification accuracies returned by four privacy-preserving algorithms and five different classification algorithms, and the minimum privacy guarantees generated according to Equations 3 and 4 using the differences between original and perturbed data. (CS: case study)

The values in Tables II are evaluated using the proposed fuzzy model to generate the ranks for each privacy preservation algorithm and perturbed dataset as given in Table II. The highest ranks generated in each pool of algorithms are in bold and highlighted in colour. Although SEAL has the best performance results in many cases, the table clearly shows that the input dataset and the choice of application (e.g. classification) plays a major role in selecting the best privacy preservation approach.

5 Discussion

In this paper, we proposed a new paradigm named privacy preservation as a service (PPaaS), to improve the application of privacy on a dataset or application, eventually improving the utility of existing and new privacy preservation approaches. The domain of data privacy contains a plethora of different privacy preservation approaches that are proposed for different types of applications. Consequently it is a highly complex process to identify the best possible privacy preservation approach for a particular application. PPaaS provides a solution by introducing a service-oriented framework that collects existing privacy preservation approaches and semantically categorizes them into pools of applications. Developers of new privacy preservation algorithms can introduce their methods to the PPaaS framework and add to the corresponding pools of applications. When a data owner/curator wants to apply privacy-preservation to a particular dataset, PPaaS will rank the methods in the relevant pools of applications with respect to the dataset. The ranks are expressed in the form of a Fuzzy Index (). is generated using a fuzzy inference system that takes two inputs, privacy, and utility. PPaaS quantifies privacy in terms of the variance of the difference between the input data and perturbed data (). PPaaS considers the concept of minimum privacy guarantee (), where the minimum of to is considered. is the strength of the weakest attribute in a perturbed dataset, and is called the minimum privacy guarantee. The utility is the accuracy measured under the corresponding application. For example, when the application is data classification, PPaaS considers classification accuracy as the utility measurement. PPaaS will select the privacy preservation approach or the perturbed dataset that returns the highest , which represents the case with the best balance between privacy and utility.

Dataset
Privacy
-preserving
algorithm
rank values returned under each CS
MLP
CS 1
IBK
CS 2
SVM
CS 3
Naive Bayes
CS4
J48
CS 5
LRDS RP 0.5107 0.5068 0.5091 0.4999 0.5072
GP 0.6382 0.8156 0.6203 0.5036 0.5391
PABIDOT 0.6247 0.8093 0.6286 0.5078 0.5560
SEAL 0.6608 0.8201 0.6782 0.5083 0.7315
PBDS RP 0.5001 0.5001 0.5001 0.4891 0.5001
GP 0.3509 0.3509 0.3509 0.3509 0.3509
PABIDOT 0.8334 0.8272 0.8081 0.7856 0.8282
SEAL 0.8360 0.8379 0.8321 0.7541 0.8360
SSDS RP 0.7723 0.7723 0.7693 0.5296 0.7723
GP 0.8462 0.8499 0.6275 0.6391 0.8492
PABIDOT 0.8393 0.8393 0.8137 0.8016 0.8393
SEAL 0.8395 0.8395 0.8395 0.7882 0.8395
WCDS RP 0.5301 0.5301 0.5301 0.5301 0.5301
GP 0.8058 0.7492 0.7275 0.7178 0.7767
PABIDOT 0.7933 0.7339 0.7716 0.7767 0.7716
SEAL 0.7818 0.7522 0.7793 0.7716 0.7492
WQDS RP 0.4998 0.5003 0.4992 0.4773 0.4994
GP 0.5000 0.5014 0.4993 0.4765 0.4997
PABIDOT 0.5004 0.5061 0.5001 0.4997 0.5000
SEAL 0.5004 0.5103 0.5001 0.4999 0.7153
TABLE III: The best choice of perturbation in each pool based on the highest rank values returned.

We ran experiments with PPaaS using five different datasets, five different classification algorithms, and four different privacy-preservation algorithms that are benchmarked to produce good utility over the corresponding classification algorithms. Our experiments show that the four privacy preservation algorithms are ranked differently based on the application and the input dataset. The highest values of indicate the highest privacy and utility with the best balance between them. After comparing the values (available in Table III) generated using the values available in Table II, we can conclude that provides high values, if and only if, both utility and privacy returned by the corresponding method is high. In all other cases, fuzzy inference system () produces lower values for the . Hence, enables PPaaS to identify the best-perturbed dataset generated by the most suitable privacy preservation algorithm for the corresponding pool of algorithms and for the input dataset.

6 Conclusion

This paper introduced a novel framework named Privacy Preservation as a Service (PPaaS), which tailors privacy preservation to stakeholders’ needs. PPaaS reduces the complexity of choosing the best data perturbation algorithm from a large number of privacy preservation algorithms. The ability to apply the best perturbation while preserving enough utility makes PPaaS an excellent solution for big data perturbation. In order to select the best privacy preservation method, PPaaS uses a fuzzy inference system (FIS) that enables PPaaS to generate ranks that are expressed as fuzzy indices for the privacy preservation algorithms applied to a dataset for a given application. The experimental results show that the fuzzy indices are a good indication of the capability of a particular privacy preservation algorithm to maintain a good balance between privacy and utility.

References

  • [1] M. A. P. Chamikara, P. Bertok, D. Liu, S. Camtepe, and I. Khalil, “Efficient data perturbation for privacy preserving and accurate data stream mining,” Pervasive and Mobile Computing, vol. 48, pp. 1–19, 2018.
  • [2] M. A. P. Chamikara, P. Bertok, I. Khalil, D. Liu, S. Camtepe, and M. Atiquzzaman, “A trustworthy privacy preserving framework for machine learning in industrial iot systems,” IEEE Transactions on Industrial Informatics, vol. 16, no. 9, pp. 6092–6102, 2020.
  • [3]

    M. A. P. Chamikara, P. Bertok, I. Khalil, D. Liu, and S. Camtepe, “Privacy preserving face recognition utilizing differential privacy,”

    Computers & Security, 2020.
  • [4] M. A. P. Chamikara, P. Bertok, D. Liu, S. Camtepe, and I. Khalil, “Efficient privacy preservation of big data for accurate data mining,” Information Sciences, 2019.
  • [5] V. Torra, “Fuzzy microaggregation for the transparency principle,” Journal of Applied Logic, vol. 23, pp. 70–80, 2017.
  • [6] A. Hasan, Q. Jiang, J. Luo, C. Li, and L. Chen, “An effective value swapping method for privacy preserving data publishing,” Security and Communication Networks, vol. 9, no. 16, pp. 3219–3228, 2016.
  • [7] Y. A. A. S. Aldeen, M. Salleh, and M. A. Razzaque, “A comprehensive review on privacy preserving data mining,” SpringerPlus, vol. 4, no. 1, p. 694, 2015.
  • [8] B. D. Okkalioglu, M. Okkalioglu, M. Koc, and H. Polat, “A survey: deriving private information from perturbed data,” Artificial Intelligence Review, vol. 44, no. 4, pp. 547–569, 2015.
  • [9] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
  • [10] M. A. P. Chamikara, P. Bertok, I. Khalil, D. Liu, and S. Camtepe, “Privacy preserving distributed machine learning with federated learning,” arXiv preprint arXiv:2004.12108, 2020.
  • [11] M. A. P. Chamikara, P. Bertók, D. Liu, S. Camtepe, and I. Khalil, “An efficient and scalable privacy preserving algorithm for big data and data streams,” Computers & Security, vol. 87, p. 101570, 2019.
  • [12] M. A. P. Chamikara, P. Bertok, I. Khalil, D. Liu, S. Camtepe, and M. Atiquzzaman, “Local differential privacy for deep learning,” IEEE Internet of Things Journal, 2019.
  • [13] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2016, pp. 308–318.
  • [14] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security.   ACM, 2015, pp. 1310–1321.
  • [15] J. Zhao, Y. Chen, and W. Zhang, “Differential privacy preservation in deep learning: Challenges, opportunities and solutions,” IEEE Access, vol. 7, pp. 48 901–48 911, 2019.
  • [16] C. C. Aggarwal, “Privacy-preserving data mining,” in Data Mining.   Springer, 2015, pp. 663–693.
  • [17] L. Xu, C. Jiang, Y. Chen, Y. Ren, and K. R. Liu, “Privacy or utility in data collection? a contract theoretic approach,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 7, pp. 1256–1269, 2015.
  • [18] R. L. Wilson and P. A. Rosen, “Protecting data through’perturbation’techniques: The impact on knowledge discovery in databases,” in Information Security and Ethics: Concepts, Methodologies, Tools, and Applications.   IGI Global, 2008, pp. 1550–1561.
  • [19] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in ACM Sigmod Record, vol. 29, no. 2.   ACM, 2000, pp. 439–450.
  • [20] E. Bertino, I. N. Fovino, and L. P. Provenza, “A framework for evaluating privacy preserving data mining algorithms,” Data Mining and Knowledge Discovery, vol. 11, no. 2, pp. 121–154, 2005.
  • [21] K. Chen and L. Liu, “A random rotation perturbation approach to privacy preserving data classification,” The Ohio Center of Excellence in Knowledge-Enabled Computing, 2005. [Online]. Available: https://corescholar.libraries.wright.edu/knoesis/916/
  • [22] ——, “Geometric data perturbation for privacy preserving outsourced data mining,” Knowledge and Information Systems, vol. 29, no. 3, pp. 657–695, 2011.
  • [23] D. Bogdanov, S. Laur, and J. Willemson, “Sharemind: A framework for fast privacy-preserving computations,” Computer Security-ESORICS 2008, pp. 192–206, 2008.
  • [24] S. Agrawal and J. R. Haritsa, “A framework for high-accuracy privacy-preserving mining,” in Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on.   IEEE, 2005, pp. 193–204.
  • [25] B. Thuraisingham, M. Kantarcioglu, E. Bertino, and C. Clifton, “Towards a framework for developing cyber privacy metrics: A vision paper,” in Big Data (BigData Congress), 2017 IEEE International Congress on.   IEEE, 2017, pp. 256–265.
  • [26] T. Drabas and D. Lee, Learning PySpark.   Packt Publishing Ltd, 2017.
  • [27] K. Chen and L. Liu, “Privacy preserving data classification with rotation perturbation,” in Data Mining, Fifth IEEE International Conference on.   IEEE, 2005, pp. 4–pp.
  • [28] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [29] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques.   Morgan Kaufmann, 2016. [Online]. Available: https://books.google.com.au/books?isbn=0128043571