Privacy Preserving Distributed Machine Learning with Federated Learning

04/25/2020
by   M. A. P. Chamikara, et al.
RMIT University
0

Edge computing and distributed machine learning have advanced to a level that can revolutionize a particular organization. Distributed devices such as the Internet of Things (IoT) often produce a large amount of data, eventually resulting in big data that can be vital in uncovering hidden patterns, and other insights in numerous fields such as healthcare, banking, and policing. Data related to areas such as healthcare and banking can contain potentially sensitive data that can become public if they are not appropriately sanitized. Federated learning (FedML) is a recently developed distributed machine learning (DML) approach that tries to preserve privacy by bringing the learning of an ML model to data owners'. However, literature shows different attack methods such as membership inference that exploit the vulnerabilities of ML models as well as the coordinating servers to retrieve private data. Hence, FedML needs additional measures to guarantee data privacy. Furthermore, big data often requires more resources than available in a standard computer. This paper addresses these issues by proposing a distributed perturbation algorithm named as DISTPAB, for privacy preservation of horizontally partitioned data. DISTPAB alleviates computational bottlenecks by distributing the task of privacy preservation utilizing the asymmetry of resources of a distributed environment, which can have resource-constrained devices as well as high-performance computers. Experiments show that DISTPAB provides high accuracy, high efficiency, high scalability, and high attack resistance. Further experiments on privacy-preserving FedML show that DISTPAB is an excellent solution to stop privacy leaks in DML while preserving high data utility.

READ FULL TEXT VIEW PDF

Authors

06/19/2019

Efficient privacy preservation of big data for accurate data mining

Computing technologies pervade physical spaces and human lives, and prod...
10/20/2021

Distributed Reinforcement Learning for Privacy-Preserving Dynamic Edge Caching

Mobile edge computing (MEC) is a prominent computing paradigm which expa...
04/24/2020

A Review of Privacy Preserving Federated Learning for Private IoT Analytics

The Internet-of-Things generates vast quantities of data, much of it att...
09/25/2021

Local Learning at the Network Edge for Efficient Secure Real-Time Predictive Analytics

The ability to perform computation on devices, such as smartphones, cars...
10/25/2019

Substra: a framework for privacy-preserving, traceable and collaborative Machine Learning

Machine learning is promising, but it often needs to process vast amount...
07/08/2019

Privacy-Preserving Classification with Secret Vector Machines

Today, large amounts of valuable data are distributed among millions of ...
09/04/2019

Big Data Intelligence Using Distributed Deep Neural Networks

Large amount of data is often required to train and deploy useful machin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The amalgamation of different technologies such as edge computing, IoT, cloud computing, and machine learning has contributed to a rapid proliferation of technological development in many areas such as healthcare and banking tegegne2014enriching; kim2017information; serban2019real. The increase of cheap pervasive sensing devices has contributed to the rapid growth of IoT, becoming one of the main sources of big data khan2018iot. In the broader spectrum of sensor systems, cyber-physical systems and advanced analysis tools are converged together to provide consolidated services. As a result, a particular system (e.g. healthcare, banking) can now be benefited from multiple sources of data, additionally to what is accumulated by conventional means arachchige2020trustworthy. This growing availability of different sources of data has been able to revolutionize leading fields such as healthcare technologies to achieve excellent achievements in many areas such as drug discovery, early outbreak detection, epidemic control analysis, which were once considered to be complicated kim2017information; serban2019real. However, data related to the fields such as healthcare, banking and policing are massively convoluted with sensitive private data arachchige2019local; chamikara2019efficient; chamikara2016fuzzy. It is essential to go for extreme measures to protect sensitive data while analyzing them to generate meaningful insights alabdulatif2018real; alabdulatif2019secure. However, it is an extremely challenging task as systems related to fields such as healthcare and banking are often densely distributed. This paper examines the issues related to distributed data sharing and analysis in order to devise an optimal privacy preservation solution towards distributed machine learning bonawitz2019towards in environments such as presented in Figure 1, which represents a typical distributed industry setup (e.g. smart healthcare, open banking) that runs on IoT, edge, fog, and cloud computing.

Privacy violations in fields such as healthcare and banking can be catastrophic due to the availability of highly sensitive person-specific data chamikara2019efficient. Among different definitions, privacy for data sharing and analysis can be defined as “Controlled Information Release” bertino2008survey. It has been shown before that it is easy to identify patients in a database by combining several quasi-identifiers such as age, postcode, and sex samarati2001protecting. Removing just the identifiers from the dataset before releasing is not enough to protect the individuals’ privacy, and leaking personal information to untrusted third parties can be catastrophic chamikara2018efficient; lopez2013privacy; bilge2013scalable; li2020voluntary. Privacy-preserving data mining (PPDM) is the area that applies privacy-preserving approaches to data mining methods to protect the private information of the users of the underlying input data during the data mining processes chamikara2018efficient. In this paper, we investigate the PPDM solutions that can be applied to limit privacy leaks in distributed machine learning (DML) under big data settings. The area of homomorphic encryption is widely explored for PPDM. However, in terms of big data and DML, homomorphic encryption cannot address the three challenges, (i) efficiency, (ii) high volume, and (iii) massive distribution of data. Furthermore, homomorphic encryption increases the data size during the encryption ( e.g. single bit can be multiplied to 16 bits), which is unreliable for big data and increases the data storage burdens  zhou2017security. Compared to encryption, data perturbation (data modification) can provide efficient solutions towards privacy preservation with a predetermined error that can result due to the data modification chamikaraprocal; yargic2019privacy.

Federated Learning (FedML) is a distributed machine approach that is developed to provide efficient privacy-preserving machine learning in a distributed environment yang2019federated. In FedML, the machine learning model generation is done at the data owners’ computers, and a coordinating server (e.g. a cloud server) is used to generate a global model and share the ML knowledge among the distributed entities (e.g. edge devices). Since the original data never leave the data owners’ devices, FedML is assumed to provide privacy to the raw data. However, ML models show vulnerability to privacy inference attacks such as model memorizing attacks and membership inference, which focus on retrieving sensitive data from trained ML models even under black-box settings song2017machine; shokri2017membership

. Model inversion attacks that recover images from a facial recognition system 

fredrikson2015model is another example that shows the vulnerability of ML to advanced adversarial attacks. If adversaries gain access to the central server/coordinating server, they can deploy attacks such as model memorizing attacks, which can memorize and extract raw data from the trained models song2017machine. Hence, DML approaches, such as FedML, need additional measures to guarantee that there will not be any unanticipated privacy leaks. Differential privacy is a privacy definition (privacy model) that offers a strong notion of privacy compared to previous models arachchige2019local. Due to the application of heavy noise, the previous attempt to enforce differential privacy on big data has resulted in low utility in terms of advanced analytics, which can be catastrophic for applications such as healthcare akgun2015privacy

. A major disadvantage of other techniques such as random rotation and geometric perturbation is their incapability to process high dimensional data (big data) efficiently. These perturbation approaches spend an excessive amount of time to generate better results with good utility and privacy 

chen2005random; chen2011geometric. In terms of efficiency, additive perturbation provides good performance. However, the perturbed data end up with a low privacy guarantee  okkalioglu2015survey. Another issue that is often ignored when developing privacy-preserving mechanisms for big data is data capacity issues. The application of privacy preservation (the specific algorithms based on encryption on perturbation) on a large database can be extensive and impossible if the exact resource allocation scenarios are not implemented correctly. A recently developed algorithm named PABIDOT for big data privacy preservation promises to provide high efficiency towards big data. The proposed method shows high classification accuracy for big data while providing high privacy. However, PABIDOT cannot be used for DML as it does not address the data distribution and data perturbation using resource-constrained devices (as depicted in Figure 1).

Figure 1: An example for a distributed organizational setting: a healthcare ecosystem which is geographically distributed among multiple locations. A healthcare system may have many distributed branches to it, facilitating and collecting many healthcare data including IoT sensor data. The central body coordinates the distributed hospitals in terms of maintaining data integrity to support a wide range of analytics. The central authority/research centre is also responsible for sharing data to cloud-based third parties for enhanced intelligence and quality of service towards their patients.

We propose a new DIStributed Privacy-preserving Approach for distributed machine learning in geographically distributed data and systems (DISTPAB). DISTPAB is a distributed privacy-preserving algorithm that employs a data perturbation mechanism. The distributed scenario of data perturbation of DISTPAB allows the perturbation of extensive datasets that need to be shared among distributed entities without leaking privacy. The actual data perturbation is conducted in the distributed entities (in edge/fog devices) locally using the global perturbation parameters generated in a central coordinating node before conducting FedML. This way, DISTPAB restricts original data to be communicated (before perturbation) via the network, which can be attacked by adversaries. The global perturbation parameter generation of DISTPAB ensures that there is no degradation of accuracy or attack resistance of the perturbed data. DISTPAB was first tested using six datasets obtained from the data repository named ”UCI Machine Learning Repository”111http://archive.ics.uci.edu/ml/index.php. The results show that DISTPAB provides excellent classification accuracy, attack resistance, and excellent efficiency towards distributed machine learning under big data settings.

The following sections of the paper are set out as follows. Section 2 includes a summary of related work. Section 3 provides a summary of the fundamentals used in developing DISTPAB. Section 4 provides background information related to DISTPAB. Section 5 presents the experimental evaluations of the the performance and the attack resistance of DISTPAB. Section 6 provides a discussion on the results provided in Section 5. The paper is concluded in Section 7.

2 Literature Review

Distributed systems such as available in healthcare have become vastly complex due to the amalgamation of different technologies such as IoT, edge computing, and cloud computing. Due to these advanced capabilities, a modern system can utilize a myriad of data sources to facilitate improved capabilities towards essential services. The nature of distributed data platforms introduces a plethora of complexities towards preserving user privacy of users without compromising data utility. Extremely high dimensions and massive distribution of data sources are two of the main complexities that need to be addressed when designing robust privacy-preserving approaches for DML systems. Due to these reasons, privacy-preserving data mining (PPDM) approaches should be efficient and should be able to work in distributed settings. Approaches based on secure multi-party computation oleshchuk2009internet, attribute access control  khan2018iot, lightweight cryptographic procedures  zhou2017security, homomorphic encryption zhou2017security are few examples for some encryption approach which can provide good privacy. However, the high computational complexity of such encryption scenarios can seriously jeopardize the performance of a distributed ML setup. The inherent high computational complexity of the cryptographic approaches requires excessive amounts of computational resources, which include large storage facilities such as cloud computing that are often controlled by third-parties. Furthermore, the encryption approaches such as homomorphic encryption for PPDM increase the size of the data after the application of encryption, which is not suitable for the domain of big data as the data sizes can already be extensive. Hence, the implementation of cryptographic scenarios for distributed databases can be unrealistic and unaffordable akgun2015privacy. Compared to encryption, data perturbation/modification provides efficient and lightweight solutions for big data privacy okkalioglu2015survey, and hence, data perturbation is a better fitting solution for PPDM of distributed data.

Previous data perturbation approaches include swapping  hasan2016effective, additive perturbation  muralidhar1999general, condensation aggarwal2004condensation, randomized response fox2015randomized, microaggregation soria2015t, random rotation  chen2005random, random projection  liu2006random, geometric perturbation chen2011geometric, and hybrid perturbation  aldeen2015comprehensive. Due to the modifications, the data end up with reduced utility. The relationship between utility and privacy granted by a particular perturbation algorithm is defined via a privacy model machanavajjhala2015designing. More than a few privacy models have been introduced where one model tries to overcome the defects of another.  niu2014achieving; navarro2012user,  machanavajjhala2006diversity,  li2007t,  wong2006alpha,  carpineto2015ktheta are some examples for privacy models. However, these models exhibit vulnerabilities to attacks such as composition attacks ganta2008composition, minimality attacks  zhang2007information, and foreground knowledge wong2011can attacks. Moreover, the perturbation approaches that use these models do not scale for big data with high dimensions. When the dimensions of the input dataset grow, the computational cost necessary to conduct data perturbation grows exponentially. The literature refers to this phenomenon as ”The Dimensionality Curse”  aggarwal2008privacy. Another issue in high dimensional data is the leak of extra information that can be effectively used by attackers to misuse sensitive private information bettini2015privacy. Differential privacy (DP) is another privacy model, which was developed more recently to render a strong privacy guarantee compared to prior approaches. Nevertheless, due to the high noise levels imposed by DP algorithms, DP might not be the optimal choice for big data privacy akgun2015privacy.

The distribution of data sources and infrastructures also introduces a massive complexity towards the development of robust privacy-preserving approaches for distributed databases. Modern distributed privacy-preserving approaches include federated (FedNN) machine learning and split learning (SplitNN), which can provide a certain level of privacy to distributed databases. In FedML, the distributed clients use their local data to train local ML models and communicate the locally trained ML model parameters with the central server to train a global representation of the local models. The server then distributes the global model parameters with the clients so that the clients can now generalize their local models based on the model parameters federated by the server. SplitML has a similar distributed setup. However, instead of communicating the model parameters, SplitML splits the ML model between the clients and the server. During the training process, a client will hold a portion of the ML model, whereas the server will hold the other portion of the ML model. As a result, a client will transfer activations from its split layer to the split layer of the server during both forward and backward passes of the training process of the ML model. When many clients need to connect to the server for ML model training, the clients will either have a peer to peer communication or a client-server communication for the model parameter communications in order for each client to have a synchronous ML model training process. However, both FedML and SplitML have the same problem of the central point of failure as the server in both cases has too much control over the model learning process. If the server is attacked, the whole framework becomes vulnerable, and as a result, user privacy cannot be guaranteed. Especially the attack methods such as membership inference and model memorization can exploit the vulnerabilities of the central point of failure to retrieve raw data used for the model training process. The other distributed approaches which work on horizontally and vertically partitioned data also tend to produce issues towards efficiency or privacy hardy2017private. As most of these methods use the approaches based on homomorphic encryption to provide privacy, such methods become inefficient when working with big data, and the need to share too much information with the server makes most of them untrusted and yields the same issue of the central point of failure hardy2017private.

Among data perturbation approaches, matrix multiplicative approaches proved to provide high utility towards data clustering and classification chamikara2018efficient. Examples for matrix multiplicative perturbation approaches include random rotation, geometric, and projection perturbation okkalioglu2015survey

methods. For example, random rotation perturbation repeatedly multiplies the input data by a randomly generated matrix with the properties of an orthogonal matrix, until the perturbed data satisfy a pre-anticipated level of privacy  

chen2005random. An additional matrix of translation and distance perturbation are combined with random rotation to produce geometric data perturbation. The added randomness of the random translation matrix improves the privacy of the perturbed data. However, geometric data perturbation also follows the same repeated approach until the perturbed data satisfy an expected level of privacy chen2011geometric. Random projection perturbation follows a different approach by projecting the high dimensional input data to a low dimensional space  liu2006random. One of the important properties of matrix multiplicative data perturbation approaches is their ability to preserve the distance between tuples of the input dataset chen2005random; chen2011geometric; liu2006random. As a result of this property, matrix multiplicative approaches provide high utility for data classification and clustering, which are based on distance calculations. However, due to the inefficient approaches utilized for the optimal perturbation parameter generation, rotation perturbation, geometric perturbation, and projection fail to provide enough scalability. PABIDOT is a recently developed perturbation approach for the privacy preservation of big data classification, using the properties of geometric data perturbations. PABIDOT provides high efficiency towards big data while maintaining high privacy and classification accuracy. However, for distributed healthcare scenarios such as shown in Figure 1, PABIDOT cannot be applied as it is not a distributed algorithm. As shown in Figure 1, a distributed system can be composed of branches that are geographically dispersed. The data coming from each branch should be perturbed before they leave the local network. A distributed data perturbation algorithm which can provide high utility while maintaining high privacy for distributed healthcare is essential.

3 Fundamentals

The proposed method uses multidimensional transformations and  chamikara2018efficient

, which improves the randomness of the data. DISTPAB considers the input dataset as a data matrix in which each tuple is regarded as a column vector for applying the transformations. This section explains how the data transformation is for perturbation and how the randomization of the final output is improved using

.

3.1 Data matrix (D)

We consider the input dataset as a data matrix. The row vectors represent the data records of the input dataset. Each row vector will be subjected to multidimensional transformations on an -dimensional Cartesian coordinate ( = number of attributes) system.

3.2 Multidimensional isometric transformations

If a multidimensional transformation, holds Equation 1. Reflection, translation and rotation are examples for multidimensional isometric transformation maruskin2012essential.

(1)

3.3 Homogeneous coordinate form and composite operations

In space, we can write a homogeneous coordinate point as an dimensional position vector , where is an additional term.

An dimensional position vector , represents a homogeneous coordinate point in the space when is an additional term. This coordinate form allows representing all transformations in matrix multiplication form of size of  jones2012computer.

A transformation operation is called a composite transformation when it involves the operations between several matrices. Equation 2 shows the composite transformation of the sequential application of to a homogeneous matrix .

(2)

We can use a new column of ones, where , to convert a data matrix into its homogeneous matrix form chamikara2018efficient.

3.4 z-score normalization of the input dataset,

The attributes of an input dataset can have different scales, which requires different levels of perturbation to each attribute. Consequently, applying the same levels of perturbation would not provide equal protection to all the attributes. As a result, the final version of the perturbed dataset will have attributes with insufficient levels of perturbation. We apply z-score normalization (also referred to as standardization) to impose equal weights to all the attributes during the transformations. The z-score normalization scales the input dataset with a standard deviation equals to

and a mean equals to  kabir2015novel.

3.5 Generating transformation matrices for perturbation

For the perturbation of the input data matrix, we need to generate the n-dimensional homogeneous translation, reflection, and rotation matrices. We can generate the translational matrix according to  jones2012computer (refer .1). Since the input data matrix is z-score normalized, the translational coefficients of the translational matrix are sampled from a uniform random distribution that lies in the interval  chamikara2018efficient. We can generate the (n+1) axis reflection matrix utilizing n-dimensional reflection matrix (refer to .2)  chamikara2018efficient. The rotational matrix can be generated using the concept of concatenated subplane rotation (refer .3chamikara2018efficient.

3.6 Randomized expansion

Randomized expansion is a noise addition approach which improves the randomness of the perturbed data chamikara2018efficient. Equation 3 shows the randomized expansion noise generation where is the sign matrix generated based on the values of (intermediate perturbed data matrix). As the equation shows, the positiveness or the negativeness is improved by randomized expansion to provide high utility with improved randomization chamikara2018efficient.

(3)

3.7 Quantification of privacy

is one approach to measure the level of privacy of the perturbed data, where is original data and is perturbed data muralidhar1999general. The higher the , the higher the difficulty, hence the higher the privacy. Take to be a perturbed data series of attribute . Now, can be written as in Equation 4, where .

(4)

3.8

is a privacy model that allows the selection of optimal perturbation parameters for a perturbation algorithm in a given instance of data. allows the determination of the best perturbation parameters while providing an optimal empirical privacy guarantee chamikara2018efficient.

Definition 1 ().

Apply a perturbation algorithm to the dataset to generate the perturbed instances of as for . If represents the number of all feasible ways to apply to to generation . Then,

(5)

where is the minimum privacy guarantee.

We can maximize all the possible minimum privacy guarantees () to obtain the optimal privacy guarantee as

(6)

A perturbed dataset provides if it satisfies Equation 6.

3.9 Federated Learning

Federated learning provides the capability to train a machine learning model using distributed data without sharing the original data between the participating entities. Assume that are distributed datasets, which are distributed among data owners . In a federated learning setup, each of these datasets () never have to leave the corresponding owner () and each owner trains ML models locally without exposing the corresponding data to any external parties. The model parameters of the locally trained models are then collected in a server (a central entity/authority), which federate the parameters to generate a global representation of all the models . The accuracy of should be very close to the accuracy of the model trained centrally with all the data  yang2019federated. This relationship can be represented using Eq. 7, where is a non-negative real number.

(7)

3.9.1 Horizontal and vertical federated learning

In horizontal federated learning, all the clients share the same feature space but different space in samples. Consider a feature space , a sample ID space of , and a label space of constitutes the complete training dataset . Then a particular distributed client will have the dataset . Several regional banks, with different user groups with similar business characteristics having the same feature spaces, can be an example of a horizontally partitioned scenario. In vertical federated learning, all the clients share the same sample ID () space and different feature spaces (). Then a particular distributed client will have the dataset . Different companies with different businesses having different feature spaces and the same user group, is an example of a vertically partitioned scenario. In this work, we consider only the horizontal federated learning scenario.

In the next section, we provide a detailed description of how these fundamentals are used in developing the proposed approach (DISTPAB) for distributed machine learning.

4 Our Approach: DISTPAB

This section explains the steps of developing a distributed data perturbation algorithm (named as DISTPAB) for a distributed data and machine learning. Figure 2 shows the application of privacy to a distributed healthcare ecosystem shown in Figure 1. As shown in the figure, the main goal of DISTPAB is to shift the perturbation to the distributed branches before the data leave the local edge and fog layer. However, in doing so, the algorithm should not lose global utility. To achieve that, the main branch (the research center) and the distributed branches will conduct a perturbation parameter exchange. As the perturbation is done using the globally optimal parameters, DISTPAB can maintain a proper balance between utility and privacy. During our explanations, we will be using the words node and entity alternatively, representing the same concept. However, more specifically, we consider an entity to be a fully populated institute, whereas a node to be one or more computing/processing devices within an entity.

Figure 2: An example scenario where DISTPAB is integrated to a distributed healthcare system to preserve the privacy of data. The lowest layer represents the distributed set of entities (hospitals) and their interaction with DISTPAB within the fog and edge bounds. As the figure shows, the distributed components of DISTPAB are integrated into each of the distributed hospitals for performing the data perturbations. The central controlling entity (research centre) will be communicating with the cloud layer, which supports third-party analytical services. In this setup, we assume that no party saves original IoT data, and will only save the perturbed data.

The proposed algorithm delegates data maintenance and perturbation to the distributed entities, leaving only the global perturbation parameter generation to the central entity. Figure 3 shows the distributed architecture of DISTPAB, which can be used for the perturbation of healthcare data. In this way, the distributed branches do not have to share the original data with any untrusted third party. However, due to the coordination of the proposed algorithm in generating the global perturbation parameters, the perturbed data can provide high utility.

4.1 Centralized perturbation paradigm

DISTPAB first applies geometric transformations in the order of (1) reflection, (2) translation, and (3) rotation to an input dataset. Next, DISTPAB performs randomized expansion and random tuple shuffling to enhance the randomness of the perturbed data chamikara2018efficient. The perturbation is repeated until the perturbed dataset satisfies . In order to devise the distributed paradigm in DISTPAB, we discuss the centralized approach, which is given in Algorithm 1. Algorithm 1 takes only two inputs: (1) the original dataset, (2) the standard deviation of the normal random noise generated for randomized expansion.

Before applying the composite geometric transformations, the algorithm derives the best perturbation parameters using the z-score normalized input data matrix. However, obtaining the best perturbation parameters at for a big dataset can be extremely inefficient, as it involves running and for multiple perturbation instances (under the loops in Step 1 and 1 of Algorithm 1 ) of the original datasets. We can prove that these two steps are equal to  chamikara2018efficient which is much simpler in time complexity as it uses the corresponding covariance matrix to determine the best perturbation parameters instead of browsing through the whole dataset in each iteration.

The algorithm maximizes the value in each iteration until it obtains as given in Equation 10. The number of perturbed instances of can be given as, where represents the number of attributes, consequently, the axis of reflection varies from 1 to n. , where . This will result in a matrix of values (local minimum privacy guarantees) as represented in Equation 8.

(8)

As given in Equation 9, we obtain the minimum value from each column of Equation 8 to identify the best for each instance of .

(9)

Now we choose , as the largest of global minimum privacy guarantees () as represented in Equation 10. This can also be explained as finding the highest privacy guarantee that can be rendered by the most vulnerable attribute in the database.

(10)

After determining the best perturbation parameters, the geometric transformations are conducted on the normalized input dataset following the order of reflection, noise translation, and rotation. We follow this order of transformations to avoid the points (in the n-dimensional space) closer to the origin, getting lower levels of perturbation. Rotation has a larger effect on the points that are far away from the origin compared to those which are close to the origin. As a result, the points which are close to the origin can get easily attacked. Equation 11 shows the order of the application of the three transformations on the normalized input data matrix, where rotation is conducted as the last transformation for the highest perturbation possible. As the next step, we add noise using randomized expansion. Reverse z-score normalization on the perturbed data is used to generate an output in the attribute ranges similar to the original dataset. Next, the tuples are randomly swapped in order to limit vulnerability to data linkage attacks.

(11)
Input: original dataset input noise standard deviation (default value=0.3) Output: perturbed dataset 1 ; 2 ; 3 ; 4 generate ; 5 generate ; 6 generate ; 7 for each in  do 8       generate ; 9       for each  do 10             generate using Algorithm 4; 11             12       13 for each  do 14       where, ; 15       16 where, ; 17 at ; 18 at ; 19 generate ; 20 generate ; 21 ; 22 ) ; 23 =+; 24 randomly swap the tuples of ; Algorithm 1 Centralized perturbation algorithm

4.2 Distributing the perturbation

In order to distribute the perturbation among distributed entities, we break the steps of Algorithm 1 into two main phases, as shown in Figure 3

. (1) Generate the global perturbation parameters using the local properties forwarded to the central entity (research centre). (2) Conduct perturbation of data at the distributed entities using the global parameters. (3) Conduct machine learning on perturbed data. In order to achieve these three goals, first, the variance-covariance matrix of each partition of the dataset is passed to the central node, which will run Algorithm

3. Assume that there are distributed branches which have the data partitions . Take to be the dataset created by merging all the datasets, as given in Equation 12.

(12)
Figure 3: Distributed architecture of the perturbation algorithm. The figure shows the abstract view of the distributed perturbation scenario over distributed machine learning. There can be number of distributed branches that are communicating with the central entity (coordinating server). There are two phases of parameter communication. Phase 1 involves the distributed branches sending local parameters sending to the central entity, whereas Phase 2 involves the central node sending the optimal perturbation parameters (which are calculated based on the local parameters which were sent by the distributed branches) to the distributed branches for perturbation.

To merge the covariance matrices, the pairwise covariance update formula introduced in bennett2009numerically is adapted. The pairwise covariance update formula for the two merged two column ( and ) data partitions, and , can be written as shown in Equation 13 where the merged dataset is denoted as .

(13)

where are means of and of the two data partitions and respectively. and

are the co-moments of the two data partitions

and where the co-moment of a two column ( and ) dataset is represented as

(14)

Therefore, the variance-covariance matrix update formula of the two data partitions and can be written as shown in Equation 15.

(15)

where are the covariance matrices of the merged dataset and the data partitions and respectively. and are mean vectors of and respectively and

(16)
(17)

After generating the covariance matrix () based on the global parameters (refer Equation 18), we can produce the vector of standard deviations () based on the diagonal vector of . We can generate the vector of means () using Equation 19 where .

(18)
(19)

Each distributed branch needs to generate the covariance matrix () and the mean vector () of the corresponding data partition and pass it to the central entity for the execution of Algorithm 3. After completing the execution of Algorithm 3, the central node will pass the required parameters to the distributed entities according to Algorithm 2, as shown in Figure 3. Table 1 shows the parameter exchange between the central entity and the distributed entities during the perturbation process.

Phase 1: from distributed
nodes to central node
(Local parameters related to
each data partition)
Phase 2: from centralized node
to distributed nodes
Global parameters related to
all the partitions
1. covariance matrix 1. random reflection matrix
2. vector of means 2. random translation matrix
3. Number of attributes 3. random rotation matrix
4. standard deviation for randomized
expansion
Table 1: Data (parameter values) passed between the central entity and the distributed entities in the two phases of communications. As shown in Figure 3, in phase 1, the distributed entities send the local perturbation parameters to the central entity. In phase2, the central entity sends the global parameters (which are calculated using the local parameters) to the distributed entities.

4.3 Workload of a distributed entity

Algorithm 2 shows the workload of a distributed node (hospital) where the actual data perturbation is conducted. As the algorithm shows, the data will not be transmitted to the central entity as part of the perturbation process. First, the local parameters which are determined based on the local dataset will be sent to the central node. Next, the global parameters which are calculated and sent to the distributed entities will be used for the perturbation of the local dataset. However, the step of randomized expansion (refer step 2 of Algorithm 2) doesn’t involve any interactions with the central node as randomized expansion needs to evaluate the noise based on each individual instance. Consequently, in the distributed setting, randomized expansion increases the randomization of the perturbed data over the centralized approach.

Input: , , , , , Output: perturbed data partition 1 generate ; 2 generate ; 3 generate ; 4 send Phase 1 data to the central entity; 5 receive Phase 2 data from the central entity; 6 ; 7 ; 8 =+; 9 randomly swap the tuples of ; Algorithm 2 Task of a distributed entity

4.4 Workload of the central node

Algorithm 3 shows the workload of the central entity. As shown in the algorithm, the central entity calculates the global perturbation parameters that need to be transmitted to the distributed entities. We assume that the distributed nodes are selected through a prior handshake mechanism that confirms the validity and reliability of the distributed nodes. However, since the proposed algorithm does not share the original data with any party before the perturbation, there will not be any privacy leak during the communication of the parameters.

Input: Covariance matrices of the data partitions Mean vectors of the data partitions Output: , , , , 1 receive Phase 1 data from the distributed entiteis; 2 ; 3 ; 4 generate ; 5 ; 6 for each covariance matrix,  do 7       if  then 8             ; 9             , according to Equation 15; 10             ; 11             12       13; 14 ; 15 , using, , 16 where, ; 17 for each in  do 18       generate ; 19       for each  do 20             generate using Algorithm 4; 21             22       23 for each  do 24       where,  ; 25 where, ; 26 at ; 27 at ; 28 generate ; 29 generate ; 30 send Phase 2 data to the distributed entities; Algorithm 3 Task of the central entity

4.5 Federated learning over perturbed data

As shown in Figure 3

, the federated module comes into action after the perturbation is completed (in the first two phases). Each of the distributed entities will use the local data (perturbed) to train a local ML model (e.g. deep neural network) for a certain number of local epochs. After finishing the local epochs, the distributed entities will send the model parameters to the central repository to generate a global representation of the model by aggregating the model parameters (refer to section

3.9). The server (central authority) then passes the aggregated parameters back to the distributed entities to update the local models to generalize the models. This is called one round of federation. The federated learning setup will conduct a sufficient number of local epochs and federation rounds to train the ML models based on the requirements of the organization (e.g. production of data streams or big data), ML model architecture, and the properties of the input dataset.

5 Results

This section provides the experimental results of the performance of DISTPAB. We tested DISTPAB on six datasets to compare and evaluate its performance against three algorithms: RP (rotation perturbation), GP (geometric perturbation), and PABIDOT. We considered as the default value of for the experiments unless specified otherwise. Next, we measured the performance of DISTPAB on FedML to examine the utility loss due to the perturbation. More details on the FedML setup is provided in section 5.4.1. For the experiments, we used a Windows 7 (Enterprise 64-bit, Build 7601) computer with an Intel(R) i7-4790 (4 generation) CPU (8 cores, 3.60 GHz) and 8GB RAM. We declared a virtual distributed environment (refer Section 5.1) to run DISTPAB under the computer settings mentioned above. Table 2 has a summary of the datasets used for the experiments. We selected the datasets by considering a diverse range of domains. We first perturbed the data using RP, GP, PABIDOT, and DISTPAB. Next, we used the perturbed data to evaluate and compare the attack resistance and the classification accuracy of RP, GP, PABIDOT, and DISTPAB. For classification accuracy analysis, we used Weka 3.6 witten2016data

, which is a data mining tool that packages a collection of data mining algorithms. We used the following classification algorithms: Naive Bayes, k-nearest neighbor (kNN), Sequential Minimal Optimization (SMO), Multilayer perceptron (MLP), and J48  

witten2016data to investigate the utility of the perturbed data.

Dataset Abbreviation Number of Records Number of Attributes Number of Classes
Wholesale customers222https://archive.ics.uci.edu/ml/datasets/Wholesale+customers WCDS 440 8 2
Wine Quality333https://archive.ics.uci.edu/ml/datasets/Wine+Quality WQDS 4898 12 7
Page Blocks Classification 444https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification PBDS 5473 11 5
Letter Recognition555https://archive.ics.uci.edu/ml/datasets/Letter+Recognition LRDS 20000 17 26
Statlog (Shuttle)666https://archive.ics.uci.edu/ml/datasets/Statlog+%28Shuttle%29 SSDS 58000 9 7
HEPMASS777https://archive.ics.uci.edu/ml/datasets/HEPMASS# HPDS 3310816 28 2
Table 2: Information about the generic datasets used for the experiments. We selected the data based on varying dimensions from small to large in order to test the behavior of DISTPAB on different dynamics of the input data dimensions.

5.1 Distributed computing setup

To test DISTPAB in a distributed setting, we used the “parpool” function in MATLAB leon2016controlling. “parpool(N)” creates N number of parallel processors (named as parallel workers) which can perform distributed computing. For performance testing and comparison, we distributed an input dataset among four workers by equally dividing the input dataset.

5.2 Time Complexity

We used both theoretical and runtime analysis to evaluate the computational complexity of DISTPAB. First, we conducted theoretical evaluations of the computational complexity of DISTPAB. Next, we conducted runtime analyses to provide empirical evidence on the estimated time complexities. The section from line

3 to line 3 (in Algorithm 1) has two loops that have a significant effect on the computational complexity of Algorithm 1 as one is constrained by the number of attributes whereas the other is controlled by the range of rotation angle. We can estimate that the runtime complexity of the corresponding loops is ( = the number of attributes, = the number of tuples, and is a constant for the constant range of angle). We can estimate that step 1 (of Algorithm 1) is . The loop segment (starts from step 1 to 1) contributes a complexity. In can also be estimated that step 1 introduces the highest computational complexity of as . In terms of worst-case computational complexity, we can state that Algorithm 1 has a computational complexity when . The steps 1 to 1 of Algorithm 1 are moved to Algorithm 3 which runs on the central node, and step 1 of Algorithm 1 is moved Algorithm 2 which runs on the distributed nodes. Consequently, the central node will have a computational complexity of . The distributed nodes will have a computational complexity of . However, the communication delay can increase the data perturbation time in a distributed setting.

Figures 6 and 9 show the empirical trends of time consumption by the central node and the distributed nodes in perturbing the LRDS dataset. During the time calculations, we considered only the time consumed for the perturbations by the corresponding processing unit (central or distributed) to get an absolute understanding of time consumption. We ignored all the other factors, such as communication delays and I/O operation delays. For this experiment, we considered four distributed nodes communicating with the central node (refer Section 5.1). We used the LRDS dataset for the analysis. During the study on how the time consumption is affected by the number of attributes, we distributed the LRDS dataset equally among the distributed nodes (each processing 5000 tuples). Next, we evaluated the effect of the number of distributed nodes on the classification accuracy and attack resistance by gradually increasing the number of distributed nodes from 1 to 20. For this experiment also, we used the LRDS dataset, and for each case, the dataset was distributed equally among the distributed nodes.

(a) The time consumption by the central entity for perturbation against the number of attributes. The plot confirms the theoretical evaluation of the central entity’s time complexity, which is , where is the number of attributes.
(b) The time consumption by the central entity for perturbation against the number of tuples. The plot confirms the theoretical evaluation of the central entity’s time complexity, which is as is constant. The central node consumes a constant amount of time when (the number of attributes) is constant.
Figure 6: Time consumption by the central entity. The plots do not include communication delays, and they show the time consumption for the perturbation of the LRDS dataset.
(a) The average time consumption of the distributed nodes for the perturbation against the number of attributes. According to the time complexity of , as remains constant, the plot should show a time complexity of where and represent the number of tuples and the number of attributes respectively.
(b) The average runtime consumption of the distributed nodes for the perturbation against the number of instances. The plot confirms the theoretical evaluation of the time complexity, which is , where is the number of tuples.
Figure 9: Average time consumption by the central entity (for 4 nodes, refer Section 5.1 for the specifications of the distributed setup). The plots do not include communication delays, and they show the time consumption for the perturbation of the LRDS dataset.

5.3 Impact of communication delay on distributed data perturbation by DISTPAB

In this section, we try to investigate the impact of communication delay on data perturbation. For this analysis, we use a large enough portion (24055028) of the HPDS dataset to obtain a significant enough trend while avoiding extensive time consumption. Figure (a)a shows the time consumption of the DISTPAB and the centralized version for an increasing number of attributes (the number of instances is kept constant). Figure (b)b shows the time consumption of DISTPAB for an increasing number of instances (while the number of attributes is kept constant). In both cases, DISTPAB consumes more time than the centralized version due to communication delays.

(a) Time consumption of the centralized version and DISTPAB against the number of attributes. Both algorithms show similar trends (exponential) for the number of attributes.
(b) Time consumption of the centralized algorithm and DISTPAB against the number of tuples.The plots confirm the linear complexity of both versions for the number of tuples.
Figure 12: Impact of communication delay over distributed perturbation in DISTPAB. In a real-world scenario, network communication can be influenced by many factors, such as network latency and network congestion. It is complex to get an exact estimation over the communication delays. As the figures show, DISTPAB follows the expected time complexity (as available in the plot for the centralized algorithm) while adding the communication delays to its patterns.

5.4 Classification Accuracy

Although the perturbation is distributed, DISTPAB always generates optimal global perturbation parameters similar to the perturbation parameter generation of the centralized algorithm (Algorithm 1). However, the step is carried out by each distributed entity separately to improve the randomness of data. Consequently, the distributed version produces data with slightly higher randomization than its centralized version. This feature also reduces the utility of the data produced by the distributed version compared to the centralized version. Figure 13 shows the box plots of average accuracies produced by each of the algorithms against the five classification algorithms. The classification accuracy was generated using 10-fold cross-validation. As shown in the figure, DISTPAB and PABIDOT produce similar classification accuracies. The figure also shows that both DISTPAB and PABIDOT outperform RP and GP.

Figure 13: Classification Accuracy. The figure shows the variation of the classification accuracy produced by the original dataset and the datasets perturbed by the four perturbation approaches.

5.4.1 Federated learning setup

We simulated the federated learning setup using the Python socket programming interface and the _thread interface. In the default configuration, we considered 4 distributed clients. Each client trained a fully connected neural network; activation=‘relu’, batch size=64, hidden layer sizes=(10, 200, 200), final layer size = num of classes, final layer activation = ’softmax’, shuffle=True, optimizer=‘SGD’, learning rate=‘constant’, learning rate init=0.0001, momentum=0.5, verbose=True. The models were trained for 20 federation rounds, while each model was locally trained for 3 epochs. For this experiment, we used the SSDS dataset that has a sufficient number of tuples to see a noticeable impact on the performance while making a low burden on the computer resources (mentioned at the beginning of section

5). We used a train/test split of 75%/25% during the experiments.

Figure 14: Classification accuracy during federated learning. The plots show the performance of federated learning on DISTPAB perturbed data under different numbers of distributed clients. The higher the number of clients, the higher the amount of time taken for ML model convergence.

Fig 14 shows the performance of the federated model using data perturbed by DISTPAB. The figure shows the performance against different number of distributed branches (clients). During the experiment, the input dataset was equally divided between distributed entities. Consequently, the higher the number of clients, the lower the number of tuples in each client. Thus, the model needs more federation rounds for convergence when the number of clients is higher. We can also notice that there is a sudden drop in accuracy in the second round, and the accuracy increases as the federation continues. This is due to the extensive parameter modification that takes place during the second round. The higher the number of clients, the more significant the effect that results in a more substantial drop in the accuracy.

5.5 Attack Resistance

We investigated the attack resistance of DISTPAB against Known I/O (IO) attacks, naive inference (NI), and Independent Component Analysis (ICA)  

okkalioglu2015survey which are considered to be three potential attacks on matrix multiplicative approaches. We considered the default values of ten iterations and a sigma (noise factor) of 0.3 for the experiments on GP and RP. We obtained )) (std represents the standard deviation, the input data matrix, represents the perturbed data of D), which provides an estimate to the resistance against NI. Next, we applied ICA and IO on the perturbed data to generate reconstructed data. We assumed that 10% of the original data as the background knowledge of an adversary during the IO attack investigation. We obtained the minimum values (NImin, ICAmin, and IOmin) of the ) under each attack type and plotted them in Figure 15. As shown in the figure, both PABIDOT and DISTPAB provide better attack resistance compared to RP and GP. We can see a slight increment in the minimum attack resistance of DISTPAB compared to PABIDOT due to the enhanced randomization as the step is carried out by each distributed entity separately to improve the randomness of data.

Figure 15: Attack Resistance. The figure shows the plots of the minimum values (NImin, ICAmin, and IOmin) of the ) under NI, ICA, and IO attacks. The higher the bar, the better the resistance to attacks.

6 Discussion

This paper proposed an efficient distributed privacy preservation mechanism (named as DISTPAB) for distributed machine learning. DISTPAB applies randomized geometric transformations followed by randomized expansion, which is a noise application mechanism that improves the positiveness and negativeness of input data while improving randomization (used to further improve the randomization without harming the utility chamikara2018efficient). DISTPAB uses  chamikara2018efficient as the underlying privacy model to obtain the optimal perturbation parameters and generate optimal privacy for the input data. We tested and compared the performance of DISTPAB against PABIDOT, RP, and GP for the classification accuracy, time complexity, and attack resistance. Additionally, we investigated the impact of communication delay over distributed perturbation in DISTPAB, and further analyzed the impact of the number of distributed nodes on the classification during federated machine learning.

According to our time complexity analysis, DISTPAB introduces a time complexity of to the central entity, where represents the number of attributes. However, , as remains a constant for a given setting. Therefore, the central entity would consume a constant amount of time, no matter how many instances (tuples) are introduced. A distributed entity shows a time complexity of = as (the number of attributes) is a constant for a given dataset (

represents the number of tuples). When we consider a fixed distributed setup, new parameters/sensors are rarely added at a speed in which the data grows; and more probably it will remain constant. Consequently, for a given scenario, the amount of time consumed by the central node will grow linearly, which is optimal for privacy-preserving distributed machine learning.

The empirical evidence on data classification shows that DISTPAB provides similar classification accuracy to the centralized algorithm and better performance compared to RP and GP. DISTPAB provides slightly lower classification accuracy compared to the centralized approach as DISTPAB imposes an increased level of randomization as the step is carried out by each distributed entity separately to improve the randomness of data. The empirical evidence (refer to Figure 14) proved that the number of distributed clients doesn’t have a noticeable impact on classification accuracy during federated learning (distributed machine learning). However, the amount of time necessary for model convergence increases with the number of clients. This feature makes DISTPAB be the perfect solution for privacy-preserving federated learning (privacy-preserving distributed machine learning) where there is a large number of distributed entities involved.

DISTPAB uses  chamikara2018efficient as its underlying privacy model, which allows optimal data perturbation for a given instance. Geometric data transformations and randomized expansion noise addition followed by random shuffling allow DISTPAB to impose high privacy by reducing the probability of data reconstruction attacks. Reverse z-score normalization of the final dataset takes the data back to their original attribute value ranges, making the attackers unable to distinguish the original data from the perturbed data. This feature reduces the chance of success of an attack trying to reconstruct the original input data from a perturbed dataset. DISTPAB provides better attack resistance compared to RP and GP. We can observe that the DISTPAB provides slightly better attack resistance compared to PABIDOT because DISTPAB adds an increased level of randomization as the randomized expansion step is carried out by each distributed entity separately to improve the randomness of data. However, in a federated learning setup, we do not share the perturbed data among the distributed clients or the server. Consequently, there will not be any data reconstruction attacks on the perturbed data, and due to the strong notion of privacy of the perturbed data, the privacy of the trained models will also be high.

In essence, DISTPAB can be an optimal privacy preservation solution for data privacy and machine learning that control extensive amounts of data  manogaran2017big, which are deployed in geographically distributed systems.

7 Conclusions

Many modern systems, such as healthcare and open banking, are often geographically distributed and constrained with proper mechanisms for privacy-preserving data sharing for analytics. This paper proposed a distributed perturbation algorithm named DISTPAB that can enforce privacy for distributed machine learning. In the proposed setup of DISTPAB, a central/coordinating entity controls the global perturbation parameter generation, whereas the distributed entities can conduct local data perturbation. The computational complexity of the algorithm that runs in the central entity is for the number of instances ( is a constant). The computational complexity of the algorithm, which runs on a distributed entity, is for the number of tuples as the number of attributes often remains constant in a given setting where is the number of instances. Consequently, the operations on distributed entities have a low computational complexity resulting in excellent efficiency. DISTPAB provides high classification accuracy, which is close to that of the accuracy of classification performed with the original data. DISTPAB provides high attack resistance outperforming rotation perturbation and geometric perturbation. It was also shown that the data produced by DISTPAB is not subjected to utility degradation, against the number of distributed entities. DISTPAB can be an excellent privacy preservation algorithm for distributed machine learning.

As future work, we are interested in looking at improving the efficiency of the proposed work for the number of attributes. To achieve this, we will investigate on the vertical federated learning scheme where the distributed clients have different feature spaces. Vertical federated learning allows dividing a particular dataset into partitions by the attributes leaving a specific client to work only on a fewer number of attributes, which can improve efficiency.

.1 n-Dimensional translation matrix generation

Equation 20 represents the homogeneous translation matrix  jones2012computer. The translational coefficients are drawn from a uniform random distribution which is bounded within . The uniform random noise is restricted to as the input dataset’s attribute standard deviations and means become 1 and 0 after the z-score normalization.

(20)

.2 n-Dimensional reflection matrix

Equation 21 shows the homogeneous reflection matrix . Equation 21 represents the reflection across axis one  jones2012computer. The (n+1) axis reflection for the matrix given in Equation 21 can be written as shown in Equation 22, which provides a low level of bias in the perturbation.

(21)

Equation 22 represents The (n+1) axis reflection matrix matrix.

(22)

.3 Generating the rotational matrix

Algorithm 4 provides the steps in generating the rotational matrix using the concept of concatenated subplane rotation. To represent the entire orientation, we use the concept of the concatenated sub-plane rotation. The block matrix given in Equation 23, shows the rotations on the plane represented by a pair of coordinate axes for  paeth2014graphics. Thus distinct should be concatenated in a particular order to generate the composite n-dimensional orthonormal matrix as represented in Equation 24 with degrees of freedom, parameterized by. We generate , the n-dimensional concatenated subplane rotation matrix for an angle, using Algorithm 4 (in .3) for each where (chamikara2018efficient.

(23)
(24)
Input: number of attributes of the input dataset angle of rotation Output: multidimensional rotation matrix of 1 = ; 2 = and ; 3

= identity matrix of size

;
4 = ; 5 ; 6 for   do 7       where is the ; 8       ; 9       where, is the set number 1 of ; 10       where, is the set number 2 of ; 11       where, is the set number 3 of ; 12       where, is the set number 4 of ; 13       14;
Algorithm 4 Rotation matrix generation

References