AnomiGAN: Generative adversarial networks for anonymizing private medical data

01/31/2019
by   Ho Bae, et al.
Seoul National University
0

Typical personal medical data contains sensitive information about individuals. Storing or sharing the personal medical data is thus often risky. For example, a short DNA sequence can provide information that can not only identify an individual, but also his or her relatives. Nonetheless, most countries and researchers agree on the necessity of collecting personal medical data. This stems from the fact that medical data, including genomic data, are an indispensable resource for further research and development regarding disease prevention and treatment. To prevent personal medical data from being misused, techniques to reliably preserve sensitive information should be developed for real world application. In this paper, we propose a framework called anonymized generative adversarial networks (AnomiGAN), to improve the maintenance of privacy of personal medical data, while also maintaining high prediction performance. We compared our method to state-of-the-art techniques and observed that our method preserves the same level of privacy as differential privacy (DP), but had better prediction results. We also observed that there is a trade-off between privacy and performance results depending on the degree of preservation of the original data. Here, we provide a mathematical overview of our proposed model and demonstrate its validation using UCI machine learning repository datasets in order to highlight its utility in practice. Experimentally, our approach delivers a better performance compared to that of the DP approach.

READ FULL TEXT VIEW PDF

Authors

page 2

page 3

page 4

page 5

page 6

page 7

page 8

page 9

03/31/2020

Towards Effective Differential Privacy Communication for Users' Data Sharing Decision and Comprehension

Differential privacy protects an individual's privacy by perturbing data...
12/26/2018

Protecting Sensitive Attributes via Generative Adversarial Networks

Recent advances in computing have allowed for the possibility to collect...
06/09/2021

Near-Optimal Privacy-Utility Tradeoff in Genomic Studies Using Selective SNP Hiding

Motivation: Researchers need a rich trove of genomic datasets that they ...
06/14/2020

Adversarial representation learning for synthetic replacement of private attributes

The collection of large datasets allows for advanced analytics that can ...
02/23/2020

PrivGen: Preserving Privacy of Sequences Through Data Generation

Sequential data is everywhere, and it can serve as a basis for research ...
02/26/2019

An Abstract View on the De-anonymization Process

Over the recent years, the availability of datasets containing personal,...
09/15/2021

A Systematic Literature Review on Wearable Health Data Publishing under Differential Privacy

Wearable devices generate different types of physiological data about th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To restrain the use of personal data for illegal practices, the right to privacy has been introduced and is being adaptively amended. The right to privacy of medical data should be enforced because each individual’s genetic information is static; therefore, a leak of such information would be very dangerous. Genetic markers, which are short DNA sequences, constitute a very sensitive piece of information. Using a genetic marker, it is feasible to uniquely identify individuals and their relatives. If proper security toward genetic information is not achieved, there is a risk of genetic discrimination such as denial of insurance (e.g., denial of insurance) or blackmail (e.g., planting fake evidence at crime scenes) (Wagner and Eckhoff, 2018).

The advent of next-generation sequencing technology has progressed DNA sequencing at an unprecedented rate, thereby enabling impressive scientific achievements (Schuster, 2007). Using information gathered from the Human Genome Project, an international effort has been made to identify the hereditary component of the disease, which will allow for earlier detection and more effective treatment strategies (Collins and Mansoura, 2001).

Data sharing between medical institutions is essential for the development of novel treatments for rare genetic diseases. Researchers are capable of identifying similar occurrence patterns of certain rare diseases on the basis of shared, more general data (Weir et al., 2004). Therefore, seamless progress in genomic research largely depends on the ability to share data among different institutions (Oprisanu and De Cristofaro, 2018). A public database GeneBank has been generated by international contributors under and is known as the Human Genome Project. Various institutions are responsible for collecting health and genetic data in several countries. In the US, Research Program was launched to collect health and genetic data in 2015. In the UK, Genomics England sequenced the genomes of 100,000 patients with rare diseases and cancer, and the NIH’s Genomic Data Commons (GDC) serves as a unified data repository from the cancer research community (https://gdc.cancer.gov/).

Patients portals and telehealth programs have gained popularity to patients allowing them to interact with their healthcare system by online health services (Crotty and Slack, 2016). Although these online health services provide convenience since patients can order prescriptions at home or remotely, patients are required to transmit their private data over the Internet. Most health services follow the guidelines of accountability act of 1996 (HIPPA111*The HIPAA statute that Genetic Data, by definition linked to an identifiable person, should not be disclosed or made accessible to third parties, in particular, employers, insurance companies, educational institutions, or government agencies, except as required by law or with the separate express consent of the person concerned.) to protect patient records, but these guidelines may not be upheld when data is shared to a third party.

Online health services are extremely useful tools, but have introduced vulnerabilities related to privacy issues. There are two significant privacy threats involved in online health services: a) hijacking during transmission and b) privacy leak on storage. To encourage the use of online health services and the provision of sensitive genetic information, a strong privacy shield should be guaranteed for all users and donors, both during transmission and storage.

There are two approaches to improve protection of personal medical data: 1) statistical-based anonymization and 2) encryption. The statistical-based approach relies on strong assumptions of the background population (Simmons and Sahinalp, 2019). Differential privacy (DP) (Dwork, 2011) is a state-of-the-art method used to provide strong privacy guarantees. In addition to DP, a number of methods have been suggested to enable the sharing of aggregate personal medical data while preserving participants privacy (Erlich and Narayanan, 2014; Homer et al., 2008; Zhou et al., 2011; Sankararaman et al., 2009; Simmons and Berger, 2015). However, the first approach is limited to a few genomic loci, which can lead to inaccuracies when the size of the genomic loci is scaled up (Simmons and Berger, 2016; Simmons et al., 2016).

Second, cryptography-based methods (homomorphic encryption) enables the computation of encrypted data via simple operations, such as summation and multiplication. The property of homomorphic encryption allows for sharing of personal medical data with accurate results. However, the latency of the computation is in the order of hundreds of seconds (Gilad-Bachrach et al., 2016)

. In addition, a simple operation of homomorphic properties limits adaptation to a complex model such as neural networks 

(Bae et al., 2018).

As a number of deep learning-based disease prediction tools have been developed and demonstrate outstanding prediction results 

(Min et al., 2017), the deep learning techniques are extended to design privacy-preserving deep neural networks (Gilad-Bachrach et al., 2016; Hesamifard et al., 2017; Sanyal et al., 2018; Kim et al., 2016). For example, cryptoNets (Gilad-Bachrach et al., 2016)

apply neural networks to infer encrypted data such as DNA sequences. However, the accuracy and efficiency of this model are very low because the activation functions are replaced by non-polynomial activation functions and the converted precision of the weights 

(Bae et al., 2018). Recently, Sanyal et al. modified CryptoNets in a way that allows for parallelization (Sanyal et al., 2018).

The first approach to privacy preserved sharing does not provide a computational overhead, but this method has a significant trade-off between privacy and accuracy. The second approach guarantees prediction accuracy while preserving privacy, but contains a bottleneck for the server-side computation-complexity. To address the aforementioned issues, we propose a method based on generative adversarial networks (GANs) that preserves similar prediction performance.

1.1 Problem Statement

We focus on privacy issues related two types of adversaries that may be encountered through the use of online health services: active and passive adversaries. The active adversary will target an individuals private data while users are interacting on the fly with the service, and the passive adversary will target any information that is stored via online or offline services. For the passive adversary, two major privacy issues may be involved: a) a private leak occurring because of the level of security of the service system, and b) upon patient’s consent for the use of their medical information for the research purposes. The patients record may be propagated to a third party with minimum anonymization following the deidentification guideline (Berhane Russom, 2012).

Figure 1: The scenario of this study. The scenario consists of a trusted zone and an untrusted zone; the untrusted zone can be further divided into two groups. The first group is comprised of online medical services, and the second group includes third parties (Google, Dropbox, and Amazon). User’s medical data is transferred to the online medical service and the services provide diagnosis results to the user. Upon user consent for data sharing, the user’s data may be propagated to the third parties.

1.2 Solution Intuition

Our proposed framework, AnomiGAN, allows a user to control the anonymization level. For strong anonymization, one parameter that controls the privacy level can be minimized to zero. Then, the model will anonymize the data purely depending on a target classifier that guarantees the same prediction results as those for original data while preserving participant’s privacy.

AnomiGAN also provides functionality that mitigates the level of confidentiality. This functionality can be used to ensure minimal anonymity when sharing data between institutions that follow the same privacy guidelines. The confidence level will depend on the acting like the -differential privacy method (Dwork and Pottenger, 2013).

Unlike other deep neural networks, AnomiGAN’s design is a non-deterministic model developed by adding variance which is estimated by the parameters that are stored during the training process to the random layer of the trained model.

1.3 Contributions

In this paper, we propose the anonymous generative adversarial networks to protect an individual’s disease information from the institutions with the ability to aggregate data among different institutions. Our framework is a generic method that exploits a target classifier simulating prediction efforts to preserve the original prediction result. We explore whether a generative model can be constructed to produce meaningful synthetic data, which simultaneously preserves original information while protecting private disease information. We evaluated the proposed method on prediction classifiers for two diseases (breast cancer, chronic kidney disease), and found that prediction performance degradation is minimal compared to the original prediction result. Finally, we compared our proposed method to the state-of-the-art privacy preserving technique, and provide analysis regarding privacy parameter and accuracy. Our analysis of the anonymous generative adversarial networks reveals that the model becomes unstable if the model employs a non-robust, target classifier.

The remainder of this paper is organized as follows. In the following section, we define entities, operations and adversary’s objectives. In Section 3, we present a mathematical overview of our proposed model with architecture. In Section 4, we present our experimental results and compare with DP. Finally, we conclude our paper in Section 5.

2 Background

We assume that the service providers use supervised machine learning classifiers to make predictions on personal medical data. A machine learning-based classifier attempts to find a function that maps a medical data points into either benign or malicious. Details of the target classifiers are described in Section 4.3. In the following section, we define the goal and capabilities of adversaries, and their associations to security bound. We provide preliminary security definition in Section 2.7 for the proof of our model described in Section 3.2.

2.1 Scenario of Our Study

Figure 1 illustrates the overall workflow of our study. The scenario consists of a trusted zone and an untrusted zone. In the trusted zone, patients have not involved any threats from an adversary. Within the untrusted zone, the scenario assumes that two groups exist: a) one group that follows the health insurance portability and accountability act of 1996 (HIPPA) guidelines to protect patient’s records, and b) the group that follows guidelines applicable to Cloud services, Google, Dropbox, Amazon, etc. Upon the user’s consent for data sharing, the user’s data may be propagated to third parties. Two possible types of adversaries exist when private information is sent from the trusted zone to an untrusted zone for online health services. Private information is a candidate target for an active adversary during transmission, and any portion of an individual’s private information that is stored is also a suitable target for a passive adversary. To prevent any compromise of sensitive information, AnomiGAN can be deployed in the trusted zone to anonymize personal medical data.

2.2 Adversarial Goal and Capabilities

It is important to define the capability of a particular adversary to measure a relevant privacy aspect. The adversarial’ goal in our case is to compromise an individual’s private medical data including any sensitive information. An adversary can be anywhere in both online/offline health services, and any third parties that closely work with medical institutions.

An adversary often makes an effort to estimate a posterior probability distribution with the resources available for breaching privacy. The available resources can be multiple combinations of computation power, time, bandwidth, or physical nodes 

(Wagner and Eckhoff, 2018). The adversary’s success probability can be quantified with the many trials of adversary’s choice of input. The probability is then used to quantify privacy, with low probability correlating to high privacy.

2.3 Security Bound

Modern cryptography introduces two relaxations to the notion of perfect security (Lindell and Katz, 2014). The first notion limits the adversary to the polynomial time adversaries indicating that security is only guaranteed against polynomial-time adversaries. The second notion relaxes the adversary’s success probability to allow for a small (negligible) advantage. As such, we have designed our scheme as a probabilistic polynomial-time algorithm, whereby the output of the model must have randomness to prevent an adversary who repeated procedure with same input to observe an information leak. We assume that adversary run in polynomial time with additional negligible winning probability. For example, let be an algorithm that runs in polynomial time for every input . The computation of is then within at most steps. Note the probabilistic algorithm is in the cryptographic system can be viewed as having a capability of tossing coins. Tossing coins means that the algorithm has access to the random oracle such that each time of tossing coins will independently equal to 1 with probability and to 0 with probability .

2.4 Generative Adversarial Networks

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are designed to solve other generative models by introducing a new concept of adversarial learning between a generator and a discriminator instead of maximizing a likelihood. The generator produces real-like samples by transformation function mapping a prior distribution from latent space into the data space. The discriminator acts as an adversary to distinguish whether samples generates from the generator derive from the real data distribution. Notably, the application of GANs has extended to various fields of studies. For example, GANs have recently been employed in cryptography and steganography (Baluja, 2017).

2.5 Differential Privacy

DP is a privacy-preserving model (Dwork, 2008) that guarantees to protect individuals from privacy loss from an adversary. Intuitively, DP promises that the probability of harm can be minimized by adding noise to the output as follows:

(1)

where is a random function that takes a noise to the output, is the target database, and is the deterministic original-query response.

Definition 2.1.

(Differential Privacy). A random algorithm with domain is -differentially private if for all Range and for all such that :

(2)

where and are the absolute value of the privacy loss that is bounded by with probability at least  (Dwork et al., 2014).

2.6 Notations

The notations used in this paper are as follows:

  • [leftmargin=5.5mm, labelindent=7.5mm,labelsep=3.3mm]

  • is the input.

  • is the anonymized output given input .

  • is an encryption function. takes input and returns the encrypted output .

  • is a random generator. takes a seed and returns a random output string

  • is a trained model.

  • is an output score given by the trained model given input .

  • is an output score given by the trained model given input .

  • is a probabilistic polynomial-time adversary. The adversary is an attacker that queries input to the oracle model.

  • is the standard deviation value of

    score .

  • is a privacy parameter that controls confidence levels.

2.7 Threat Model

A random oracle model (Canetti et al., 2004) posits the randomly chosen function , which can be evaluated only by querying an oracle that returns given input . The security of the random oracle is based on an experiment involving an adversary , as well as ’s indistinguishability of the encryption. Assume that we have the random oracle that acts like a current anonymous scheme with only a negligible success probability.

The experiment can be defined for any encryption scheme over input space X and for adversary . The experiment is defined as follows:

  1. [leftmargin=*,label=(),labelindent=1.5mm,labelsep=1.3mm]

  2. The random oracle chooses a random anonymous scheme . Scheme modifies or extends the process of mapping a medical data with length to transformed medical medical data as the output. The process of mapping sequences can be considered as a table that indicates for each possible input the corresponding output value .

  3. Adversary then chooses a pair of medical data .

  4. The random oracle selects a bit and sends encrypted medical data to the adversary.

  5. The adversary outputs a bit .

  6. The output of the experiment is defined as 1 if , and 0 otherwise. succeeds in the experiment in the case of distinguishing .

Figure 2: Architecture of the model presented in this study.

With the experiment, the definition of perfect security for takes the following general form (Canetti et al., 2004):

Definition 2.2.

The scheme is perfectly secure over input space X if for every adversary it satisfies

(3)

In the encryption scheme , cannot distinguish between and . Furthermore, obtains no information about the presence of a hidden message. In the real world, most systems do not have access to a random oracle. Thus, pseudorandom functions are typically applied by replacing the random function (Canetti et al., 2004). With this assumption, the oracle is replaced by a fixed encryption scheme , which corresponds to the transformation of a real system (implementation of the encryption scheme). The implementation of a random oracle is deemed secure if the probability of the success of a random oracle attack is negligible. Moreover, the encryption scheme is soundness secure if adversary has a probability such that

(4)

3 Methods

Figure 3: Model training. The encoder accepts and

as input and that are fed into the neural network. The discriminator takes an original input and output of the encoder to output probabilities from the logits (last fully connected layer). The target classifier takes an input

and outputs the prediction score.

Our training involves three parties: an encoder, a discriminator, and a target classifier (pre-trained discriminator). The encoder generates synthetic data close to the form of input data, and the target classifier places a prediction score of a synthetic data generated by the encoder. The discriminator then generates a confidence score of whether a piece of data is a synthetic or original data. The encoder is trained with random noise to learn to generate a synthetic data such that the prediction result of a synthetic data from a target model is same as the original data.

3.1 Anonymization using GANs

The architecture of this model is illustrated in Figure 2. The encoder takes an input and gives the output , which are given to both the discriminator and the target classifier. The discriminator outputs the probability that , given input . The target classifier outputs a prediction score given input . The learning objective of the encoder is to minimize discriminator’s probability to and maximize the prediction score of the target classifier.

The encoder accepts the length of messages as input and generates the length of random strings, that are fed into the neural network. The first layer consists of 64 filters with a length of 4; The second layer consists of 32 filters with a length of 2; The third layer consists of 16 filters with a length of 2. The fourth layer consists of 8 filters with a length of 2. Additional layers are then additional layers are added in the reverse order to the number of input length

. Batch normalization 

(Ioffe and Szegedy, 2015) is used at each layer and tanh (LeCun et al., 2012)

is used as the activation at each layer, except for the final layer where ReLU 

(Nair and Hinton, 2010) is used for the activation function. The discriminator takes the output of the encoder as an input to determine whether the output is real or generated. A sigmoid activation function is used to output probabilities from the logits.

To define the learning objective, let , , and denote parameters of the encoder, discriminator and target classifier. Let be the output on , be the output on , and be the output on and . Let denote the loss of encoder, discriminator, and target classifier. Let and denote the weight parameters for the encoder, the discriminators. The parameter can be used to control anonymization level. For strong anonymization level, the parameter can be the minimum value. The encoder then has following objective functions:

(5)

where is the Euclidean distance between synthetic and original data and controls confidence level. A discriminator has the sigmoid cross entropy loss of:

(6)

where if and if , where is the score of given input and . A target classifier has a classification score of:

(7)

where is a cost function of pre-defined classifier.

3.2 Security Principle of Anonymized GANs

In this section, we show that the AnomiGAN has a scheme that is indistinguishable for an adversary. The AnomiGAN has a similar structure to the one-time pad encryption scheme, except that a probabilistic model is used to generate the output of the one-time pad. Since a probabilistic model generates a pseudorandom output that appears to be random any polynomial-time adversary, the AnomiGAN can be demonstrated as computationally-secure. A simple intuition of the indistinguishable scheme is that the adversary is allowed to choose from multiple data from the synthesized data. An adversary can freely interact with an encryption oracle, which is regarded as a black-box that anonymize data chosen by the adversary. For the AnomiGAN, synthesized data can be viewed as encrypted data. Formally, we have constructed as input a seed

and a medical record , where is chosen uniformly at random and outputs the synthesized data. We define our anonymization scheme as follows for medical record length :

  • [leftmargin=5.5mm, labelindent=7.5mm,labelsep=3.3mm]

  • Select and outputs the seed.

  • On input a seed , pseudorandom generator outputs a random string .

  • AnomiGAN: on the input a random string , and a medical record , the model output the synthesized medical record

    (8)

A random is replaced for each learning steps, and the variances of each layer are stored during the learning process. The variance of each layer is added to randomly selected layers in inference time to ensure that the generator does not produce the same output from the same input. Intuitively, a generative model is a probabilistic model; thus appears completely random to an adversary who observes a medical record given operating similarly to the one-time pad. Note that similar operations can be expected upon the replacement of to mod .

Theorem 3.1.

If is a pseudorandom generator and is a probabilistic model, then has a scheme that is indistinguishable to an adversary.

Proof.

The rationale for the proof is that if is a probabilistic and the is the pseudorandom generator; then the resulting scheme is identical to the one-time pad encryption scheme hold a Definition 2.2. We know that has the same length of and the output length of is also equal to both and . The same length of , , and with the operation of is identical to the one-time pad encryption scheme, which has a formal Definition 2.2. Let polynomial-time adversary constructs a distinguisher for such that has the success probability as defined in Equation 10. The distinguisher is given an input , and the goal is to determine whether is a truly random or is generated by . The distinguisher emulates the described in Section 2.7, and the distinguisher has two observations. If input is truly random, then the distinguisher has a success probability of:

(9)

which follows Equation 2.2. If input is equal to , where is chosen at uniformly random, then the distinguisher has a success probability of:

(10)

By the assumption that is a pseudorandom generator and is a probabilistic, then must be negligible. ∎∎

4 Results

Figure 4: Learning performance of AnomiGAN. (a) Loss of encoder and discriminator for breast cancer dataset. (b) Loss of encoder and discriminator for chronic kidney disease dataset.

4.1 Experimental Environment

We performed experiments using Ubuntu 14.04 (3.5GHz Intel i7-5930K and GTX Titan X Maxwell(12GB)). For the implementation, we exploited the Scikit-learn library package (version 0.18) for convolutional neural networks, the Keras library package (version 2.0.6) for neural networks, tensorflow (1.11.0) for generative adversarial networks, and bio-python (1.72) used for input representation.

4.2 Datasets

We simulated our approach using the Wisconsin breast cancer dataset from the UCI machine learning repository (Blake, 1998), and the chronic kidney disease dataset from the UCI machine learning repository (Rubini and Eswaran, 2015). The Wisconsin breast cancer and chronic kidney disease datasets consist of 30 and 24 features, respectively. The datasets are randomly partitioned to training and test sets of 90% and 10%, respectively.

4.3 Target Classifiers

Many services are incorporate disease classifiers using machine learning techniques. For our experiment, we selected breast cancer and chronic kidney disease model from the kaggle competitions as the target classifiers. The classifiers will be used as black-box access to our target classifier in our method. We selected these classifiers for two reasons: a) both classifiers achieve high accuracy in disease detection to their testing datasets, and b) these classifiers are open source implementations, which allows easily accessible as our target classifier.

4.4 Model Training

For the training model, we used optimizer of multi-class logarithmic loss function Adam

(Kingma and Ba, 2014)

with a learning rate of 0.001, a beta rate of 0.5, the epoch of 50000, and mini-batch size of 10. The objective function

that must to be minimized as described in Eq (5). Most of these parameters and networks structure were experimentally determined to achieve optimal performance. Figure 4 show the training loss of each model. For the breast cancer dataset, the discriminator achieves the optimal loss after 3000 steps, while the encoder requires more steps to generate original data like the sample. For the chronic kidney disease dataset, the optimal loss achieved after 10,000 steps.

4.5 Evaluation Process

We exploited DP, in particular, the Laplacian mechanism (Dwork and Pottenger, 2013)

, to compare the anonymization performance against corresponding accuracy and AUC. For the evaluation metric, the accuracy

222, where , , , and represent the numbers of true positives, false positives, false negatives, and true negatives, respectively. and the AUC are used to measure performance between original samples and anonymized samples according to the model’s parameter changes. The correlation coefficient is used to measure the linear relationship between the original samples and anonymized samples by changing the privacy parameters. We generated the anonymized data according to privacy parameter and by randomly selecting 1000 cases, and obtained the average prediction of accuracy, AUC, and correlation coefficient against corresponding original data. In the next step, we fixed test data and generated anonymized data to validate the probabilistic behavior of our model. As shown in Table LABEL:table:cc_layers, a variance of each encoder layers is added to the corresponding encoder layers in the inference time. The process was repeated 1000 times with the fixed test data.

Figure 5: Anonymization performance using a breast cancer dataset: a fixed test dataset was selected from the UCI machine learning repository. The correlation coefficient, accuracy, and AUC were measured by changing 0.1 of the privacy parameter for fixed test data, .
Figure 6: Anonymization performance using a chronic kidney dataset: a fixed test dataset was selected from the UCI machine learning repository. The correlation coefficient, accuracy, and AUC were measured by changing 0.1 of the privacy parameter for fixed test data, .

4.6 Comparison to Differential Privacy

DP achieves plausible privacy by adding Laplacian noise to a statistics (Dwork and Pottenger, 2013). The parameter has a minimal effect on privacy and the risk of privacy increases as the parameter increases. The amount of noise presents a trade-off between accuracy and privacy. Note that standard DP of unbounded noise version of Laplacian was applied for the experiments.

Figure 5 shows an experiment for our proposed algorithm and DP algorithm using a fixed breast cancer tests. The experiments were conducted by increasing 0.1 of the parameter . DP showed good performance in the coefficient correlation but showed a significant drop in both accuracy and AUC. However, our proposed methodology showed that coefficient correlation does drops slowly, but maintains good performance in accuracy and AUC.

Figure 6 shows an experiment for our proposed algorithm and DP algorithm with respect to kidney disease tests. Similar behavior was observed for the breast cancer dataset, but the coefficient correlation did not degrade after . In the case of AnomiGAN, we noticed from Figure 5 that coefficient correlation only drops until certain level due to additional loss from discriminator as described in Equation 5.

Figure 7 details features and its correlation. Figure 7 (a) shows the feature correlation of breast cancer, and Figure 7 (2) shows the feature correlation of chronic kidney disease. In the case of DP, we noticed that trade-off between privacy and accuracy are more strongly correlated if the features are strongly correlated with each other.

Figure 7: Feature correlation (best viewed in color). Correlations between features of breast cancer (a) and chronic kidney disease datasets (b).
Figure 8: Comparison of privacy parameter . The correlation coefficient, accuracy, and AUC are measured by changing 0.1 of the privacy parameter for fixed test data.

[ caption = Performance results of the model upon adding variance to the layer., label = table:cc_layers, doinside = , width = ] lccccccc & Layer1 & Layer2 & Layer3 & Layer4 & Layer5 & Layer6 & Layer7

(Breast cancer)
Correlation coefficient & 0.88& 0.86 & 0.87 & 0.83 & 0.83 & 0.86 & 0.86
Accuracy (%) & 88.00 & 90.83 & 88.33 & 91.67 & 85.83 & 91.67 & 90.00
ACU & 0.864 & 0.916 & 0.858 & 0.906 & 0.931 & 0.921 & 0.906

(Chronic kidney disease)
Correlation coefficient & 0.86 &0.89 & 0.88 & 0.89 & 0.88 & 0.89 & 0.86
Accuracy (%) & 96.70 & 100 & 100 & 100 & 100 & 100 & 100
ACU & 1.000 & 1.000 & 1.000 & 1.000 & 1.000 & 1.000 & 1.000

4.7 Performance Comparison

We evaluated the performance of our proposed method based on two classifiers (breast cancer, chronic kidney disease) to measure prediction performance and coefficient values between original data and anonymized data. The experiments were conducted by changing 0.1 of the privacy parameters , and . Figure 8 shows an experiment for privacy parameter . The term is directly associated with the Euclidean distance indicating that the privacy level should be decreased as increases. As shown in Figure 8 (a) the coefficient correlation value indicates that strong association between original and anonymized data as the parameter is increased for both datasets. Figure 8 (b) and (c) indicate that the averaged accuracy and AUC does not degrade as the parameter is increased. This is expected behavior, as is proportional to , maximizing the discriminator loss to a target classifier.

To validate the probabilistic behavior of our model with respect to variance added for each layer, we measured the mean of the coefficient correlation for each of the 7 encoder layers as shown in Table LABEL:table:cc_layers. The results indicate that adding variance to each of layers has a difference in the coefficient correlation with limited effects on both accuracy and AUC.

Table LABEL:table:computation_time

shows the training and running time for each datasets. The training time varied depending on the hyperparameters. The training time was measured with the optimal hyperparameters. The loss weight parameters of

and are set to 0.5, and the privacy parameter was set to 0.3. The training time includes the operation of measuring the variance of each layer which is then used at running time.

[ caption = Training and running time of the proposed method based for two classifiers., label = table:computation_time, doinside = , width = ] lccc Dataset & No. of features & Training Time (Hours) & Running Time (Secs)

Breast cancer & 30 & 2 & 1.00
Chronic kidney disease & 24 & 1 & 1.00

5 Discussion

Here, we have introduced a novel approach for anonymizing private data while preserving the original prediction accuracy. We showed that under a certain level of privacy parameters, our approach preserves privacy while maintaining a better performance of accuracy and AUC compared to the differential privacy. Moreover, we provide a mathematical overview showing that our model is secure against an efficient adversary, demonstrating that the estimated behavior of the model, and finally evaluated the performance compared to the state-of-the-art privacy preserving method.

One of our primary motivations for this study was that many companies are providing new services based on deep neural networks, and we believe this will extend to online medical services. Potential risks regarding the security of medical information (including genomic data) are higher compared to the current risks to private information security, as demonstrated by Facebook’s recent privacy scandal. In addition, it is difficult to notice a privacy breach even when there are privacy policies in place. For example, when a patient consents to the use of medical diagnostic techniques, the propagation of that information to a third party cannot guarantee that the same privacy policies will be adhered to by them. Finally, machine learning as a service (MLaaS) is mostly provided by Google, Microsoft, or Amazon due to hardware constraints, and it is even more challenging to maintain user data privacy when using such services.

Exploiting traditional security in the deep learning requires encryption and decryption phases, which make its use impractical in the real world due to a vast amount of computation complexity. As a result, other privacy preserving techniques such as DP will be exploited in deep learning. Towards this objective, we developed a new approach of privacy-preserving technique based on deep learning. Our method is not limited to the medical data. Our framework can be extended in many various ways to the concept of exploiting a target classifier as a discriminator.

Unlike a statistics-based approach, our method does not require a background population to achieve good prediction results. AnomiGAN also provides the ability to share data while minimizing privacy risks. We believe that online medical services using the deep neural network technology will be available in our daily lives, and it will no longer be possible to overlook issues regarding the privacy of medical data. We believe that our methodology will encourage the anonymization of personal medical data. As part of future studies, we plan to extend our model to genomic data. The continuous investigation of privacy in medical data will benefit human health and enable the development of various diagnostic tools for early disease detection.

References

  • Bae et al. (2018) Bae, H., et al. (2018). Security and Privacy Issues in Deep Learning. arXiv preprint arXiv:1807.11655.
  • Baluja (2017) Baluja, S. (2017). Hiding images in plain sight: Deep steganography. In Advances in Neural Information Processing Systems, pages 2069–2079.
  • Berhane Russom (2012) Berhane Russom, M. (2012). Concepts of Privacy at the Intersection of Technology and Law. Ph.D. thesis.
  • Blake (1998) Blake, C. (1998). UCI repository of machine learning databases. http://www. ics. uci. edu/~ mlearn/MLRepository. html.
  • Canetti et al. (2004) Canetti, R., et al. (2004). The random oracle methodology, revisited. Journal of the ACM (JACM), 51(4), 557–594.
  • Collins and Mansoura (2001) Collins, F. S. et al. (2001). The human genome project: revealing the shared inheritance of all humankind. Cancer: Interdisciplinary International Journal of the American Cancer Society, 91(S1), 221–225.
  • Crotty and Slack (2016) Crotty, B. H. et al. (2016). Designing online health services for patients. Israel journal of health policy research, 5(1), 22.
  • Dwork (2008) Dwork, C. (2008). Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pages 1–19. Springer.
  • Dwork (2011) Dwork, C. (2011). Differential privacy. In Encyclopedia of Cryptography and Security, pages 338–340. Springer.
  • Dwork and Pottenger (2013) Dwork, C. et al. (2013). Toward practicing privacy. Journal of the American Medical Informatics Association, 20(1), 102–108.
  • Dwork et al. (2014) Dwork, C., et al. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211–407.
  • Erlich and Narayanan (2014) Erlich, Y. et al. (2014). Routes for breaching and protecting genetic privacy. Nature Reviews Genetics, 15(6), 409.
  • Gilad-Bachrach et al. (2016) Gilad-Bachrach, R., et al. (2016). Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning, pages 201–210.
  • Goodfellow et al. (2014) Goodfellow, I., et al. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
  • Hesamifard et al. (2017) Hesamifard, E., et al. (2017). CryptoDL: Deep Neural Networks over Encrypted Data. arXiv preprint arXiv:1711.05189.
  • Homer et al. (2008) Homer, N., et al. (2008). Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS genetics, 4(8), e1000167.
  • Ioffe and Szegedy (2015) Ioffe, S. et al. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
  • Kim et al. (2016) Kim, J., et al. (2016). Collaborative analytics for data silos. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pages 743–754. IEEE.
  • Kingma and Ba (2014) Kingma, D. et al. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • LeCun et al. (2012) LeCun, Y. A., et al. (2012). Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer.
  • Lindell and Katz (2014) Lindell, Y. et al. (2014). Introduction to modern cryptography. Chapman and Hall/CRC.
  • Min et al. (2017) Min, S., et al. (2017). Deep learning in bioinformatics. Briefings in bioinformatics, 18(5), 851–869.
  • Nair and Hinton (2010) Nair, V. et al. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814.
  • Oprisanu and De Cristofaro (2018) Oprisanu, B. et al. (2018). AnoniMME: Bringing Anonymity to the Matchmaker Exchange Platform for Rare Disease Gene Discovery. bioRxiv, page 262295.
  • Rubini and Eswaran (2015) Rubini, L. J. et al. (2015). Generating comparative analysis of early stage prediction of Chronic Kidney Disease. International Journal of Modern Engineering Research (IJMER), 5(7), 49–55.
  • Sankararaman et al. (2009) Sankararaman, S., et al. (2009). Genomic privacy and limits of individual detection in a pool. Nature genetics, 41(9), 965.
  • Sanyal et al. (2018) Sanyal, A., et al. (2018). TAPAS: Tricks to Accelerate (encrypted) Prediction As a Service. arXiv preprint arXiv:1806.03461.
  • Schuster (2007) Schuster, S. C. (2007). Next-generation sequencing transforms today’s biology. Nature methods, 5(1), 16.
  • Simmons and Sahinalp (2019) Simmons, Sean, B. B. et al. (2019). Protecting Genomic Data Privacy with Probabilistic Modeling. Proceedings of the 24th Pacific Symposium on Biocomputing.
  • Simmons and Berger (2015) Simmons, S. et al. (2015). One size doesn’t fit all: measuring individual privacy in aggregate genomic data. In Proceedings. IEEE Symposium on Security and Privacy. Workshops, volume 2015, page 41. NIH Public Access.
  • Simmons and Berger (2016) Simmons, S. et al. (2016). Realizing privacy preserving genome-wide association studies. Bioinformatics, 32(9), 1293–1300.
  • Simmons et al. (2016) Simmons, S., et al. (2016). Enabling privacy-preserving GWASs in heterogeneous human populations. Cell systems, 3(1), 54–61.
  • Wagner and Eckhoff (2018) Wagner, I. et al. (2018). Technical privacy metrics: a systematic survey. ACM Computing Surveys (CSUR), 51(3), 57.
  • Weir et al. (2004) Weir, R. F., et al. (2004). The stored tissue issue: Biomedical research, ethics, and law in the era of genomic medicine. Oxford University Press.
  • Zhou et al. (2011) Zhou, X., et al. (2011). To release or not to release: evaluating information leaks in aggregate human-genome data. In European Symposium on Research in Computer Security, pages 607–627. Springer.