A Fully Private Pipeline for Deep Learning on Electronic Health Records

11/25/2018 ∙ by Edward Chou, et al. ∙ 0

We introduce an end-to-end private deep learning framework, applied to the task of predicting 30-day readmission from electronic health records. By using differential privacy during training and homomorphic encryption during inference, we demonstrate that our proposed pipeline could maintain high performance while providing robust privacy guarantees against information leak from data transmission or attacks against the model. We also explore several techniques to address the privacy-utility trade-off in deploying neural networks with privacy mechanisms, improving the accuracy of differentially-private training and the computation cost of encrypted operations using ideas from both machine learning and cryptography.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have been applied to a variety of clinical tasks to great success. Medical imaging diagnosis gulshan2016development , genome processing nguyen2017metagenomic , and disease onset predictions liu2018deepehr are domains where deep learning could help uncover patterns in data and greatly improve quality of care and treatment. Unfortunately, since medical data is also extremely privacy-sensitive, the healthcare industry is subject to stringent patient protection regulations such as HIPAA and GINA, impeding the widespread adoption of data mining techniques in the medical community asghar2017hipaaprivacyhealth .

Without addressing these privacy concerns, it is unlikely that machine learning as a service (MLaaS) platforms will be adopted by the healthcare community due to the risk of information leakage during data transmission or to cloud providers bae2018securityprivacyissues . Anonymized data is vulnerable to de-anonymization attacks, as shown in the Netflix deanonymization demonstration narayanan2008netflixdeanonymization , and thus is not an adequate solution for data sharing in the healthcare domain gymrek2013identifyinggenomesurname . The alternative scheme of deploying only the trained models is also insufficient, as recent works yeom2017overfittingMI ; carlini2018secretsharer have demonstrated how neural networks could memorize training data even when they are not overfitting. Attacks like membership inference shokri2016membershipinference and model inversion fredrickson2015modelinversion can reveal population information or recover training inputs from a neural network, in some cases with only black-box access to the model.

Figure 1: The healthcare provider trains a neural network using differential privacy and hosts the encrypted model on a third-party host, allowing end users to send and receive encrypted data.

Machine-learning services that are private and secure by design will allow healthcare practitioners to benefit from advances in deep learning. In this work, we explore separate constructs for private and secure machine learning that are compatible, using differential privacy during training abadi2017privatetwoapproaches and homomorphic encryption for inference gilad2016cryptonets

, in order to provide a fully-private pipeline. In addition to our work being the first to combine and apply these techniques to realistic clinical tasks, we also propose several guidelines to improve the accuracy and computational performance. In particular, we demonstrate the importance of standardizing EHR data to enhance differentially private learning in terms of privacy costs and training stability. We also use state-of-the-art techniques to improve network performance and computational overhead through parameterized activation functions with coefficients quantized to leverage sparse polynomial multiplication.

2 Related Work

Differential privacy is a privacy construct which guarantees that an individual will not change the overall statistics of the population dwork2008differential , a formal definition defined as algorithm M and dataset D being private if . Applying differential privacy to neural networks helps ensure defenses against membership inference and model inversion attacks abadi2017privatetwoapproaches . This can be achieved by either applying noise to gradients while training a single model abadi2016dpsgd ; song2013stochastic or by segregating data and adding noise in a collaborative learning setting papernot2018scalable ; shokri2015privacy .

Gentry et. al. gentry2009fully introduced fully homomorphic encryption (FHE) which allows anyone to perform computation over encrypted data without having to decrypt it. A weaker version of FHE, called leveled homomorphic encryption (LHE) permits a subset of arithmetic operations on a depth-bounded arithmetic circuit brakerski2014efficient . CryptoNets gilad2016cryptonets was one of the first works to apply LHE to a neural network setting. More recently, chabanne2017privacy and hesamifard2017cryptodl

extended this technique to deeper network architectures and developed low-degree polynomial approximations to common activation functions (i.e. ReLU, Sigmoid), in addition to leveraging batch normalization for stability.

Several previous works have attempted to apply privacy techniques to the healthcare setting. joppe2014encryptedmedical

uses homomorphic encryption to encrypt a linear regression model trained on medical databases. A good deal of literature also studies the use of differential privacy (DP) in medicine

shaked2016publishingdpmedical ; dankar2012dphealth ; dankar2013dphealthcare , although the focus is mainly on applying DP to the datasets rather than the ML algorithm.

3 Methods

3.1 Differentially Private Stochastic Gradient Descent (DP-SGD)

DP-SGD optimization was developed by abadi2016dpsgd

and involves adding Gaussian noise and clipping gradients of neural networks during training with stochastic gradient descent. It also keeps track of the privacy loss through a privacy accountant

mcsherry2009privacyaccounting , which prematurely terminates training when the total privacy cost of accessing training data exceeds a predetermined budget. Differential privacy is attained as clipping bounds the L2-norm of individual gradients, thus limiting the influence of each example on the learning updates. An outline of DP-SGD algorithm is included in the appendix.

Through standardization, we scale and translate each feature such that its values lie between 0 and 1. As seen from Table 1, this greatly reduces the L2 norm of the gradients, and equivalently, the clipping bound and the amount of noise required for privacy guarantees. We also observe that standardization improves both AUC and recall, which are especially important given the scarcity of positive labels.

3.2 CryptoNets - Inference on Encrypted Data

We use a levelled HE-scheme with a pretrained network as outlined in gilad2016cryptonets to support inference for encrypted input. We use the FV-RNS scheme proposed by bajard2016full . This is a residue number system (RNS) variant of the FV encryption scheme fan2012somewhat and is implemented in SEAL, a library for homomorphic encryption seal23 . We use a ring dimension with two plaintext moduli . Each coefficient modulus is decomposed into four 64-bit moduli for efficient use of FV-RNS. Further details of the encryption operation are placed in the appendix in the interest of space.

Neural networks mostly consist of HE compatible multiplicative and additive operations, with the exception of non-linear activation functions which the original CryptoNets paper gilad2016cryptonets substitutes with a square activation. However, the activation function of a neural network is critical for convergence glorot2011deep , and it has been shown that polynomial approximations of activation functions retain much of the performance as their nonlinear counterparts gautier2016globally ; livni2014computational . We polynomially approximate the Swish activation using a Minimax function (details provided in appendix), giving us a polynomial equation . Due to the extra multiplicative operations which are expensive in HE schemes, using more complex polynomial functions requires more computational power. However, while a brute-force implementation would require time to complete, HE methods are able to accomplish this in when a coefficient modulus is chosen such that is divisible by , by the Number Theoretic Transform harvey2014faster . Thus, we use the activation which helps lower the computational cost of our chosen activation layer.

4 Experiments

4.1 Dataset and Model

Our dataset strack2014impact

is obtained from the UCI Machine Learning Repository and contains 10 years (1999-2008) of medical records of more than 101,000 patients from 130 US hospitals. The data consists of demographics and clinical metrics associated with risk of diabetes, in addition to readmission outcome. Features with about 40% missing values such as medical speciality, payer code and weight are removed from our analysis. We aggregated ICD9 codes that represent similar diagnoses into 10 groups, and converted any categorical feature into one-hot encoding representation. Then we randomly split the dataset into train and test sets with the ratio of 75:25. Our goal is to predict whether a diabetic patient would be readmitted within 30 days after being discharged. Being a key indicator of quality of care, this task has been widely studied in existing literature, and for this dataset, by other works such as

chopra2017recurrent and bhuvan2016ehrdiabetic . Our network consists of one hidden layer of size 32 and one output layer, each is followed by an approximated quantized swish activation function. We also use a mean squared error weighted by the class imbalance ratio (8:1). The model was trained with batch size of 256 and Adam optimizer on an Intel Core i7-5930K CPU at 3.50GHz with 48GB RAM.

4.2 Prediction performance

DP noise injected Standardization Median L2-Norm Accuracy AUC Recall
No No 50.6 0.802 0.659 0.313
No Yes 2.34 0.610 0.677 0.642
Yes No N/A 0.765 0.615 0.356
Yes Yes N/A 0.588 0.638 0.662
Table 1: Effects of input standardization and DP training (with small noise ) on different metrics. L2-Norm is much smaller with standardization (results not included for DP-noise due to clipping). We find that overall DP does not hurt the AUC much and even improves the recall of our experiments.

We fix to

by estimating

where rows, and clipping bound

using the l2-norm estimation. We report test accuracy, area under Receiver Operating Characteristic curve (AUC) and recall of our model’s predictions on test dataset, with more emphasis on the last 2 metrics due to the imbalanced nature of the training data. AUC measures the discrimination at different classification thresholds, while high recall is necessary as we consider the cost of misses (i.e. discharging patients when they are not ready) to be more serious that that of false alarms. In fact, we find high accuracy to often be correlated with poor performance, as it is trivial to get

90% accuracy by guessing one class. From Table 1, the standard training of our HE-compatible model (with polynomial activations) gives better performance (AUC) than the baselines obtained from normal neural networks in chopra2017recurrent (AUC ) and bhuvan2016ehrdiabetic (AUC ). Even with large amount of noise injected (eps=1.0 in Figure 2), our fully private network still yields a higher AUC of 0.63, compared to the aforementioned existing works on this dataset. We found that with , the best performance is achieved by injecting a moderate level of noise ().

Figure 2: Left: AUC over time for different noise levels. We note that the budget is spent faster for smaller noise (i.e. larger eps) leading to early termination. Right : AUC of different values on test set. Both plots show that the best AUC is achieved with a moderate noise level: when , we obtain AUC = 0.66, recall = 0.60

4.3 Comparison of Activation Functions

Before approximated swish activation, we also experimented with other functions, both low-degree polynomial (square activation) and non-linear (ReLU and Sigmoid). Test results from training our network with each function combination with no differential private noise are shown in Table 2

, with confidence level obtained from 10 repeated trials. We observe that square activations produce more instability in performance and high variance across different runs. For approximated swish activations, the AUC is fairly stable and even surpasses the performance of non-linear activations.

Activations Accuracy AUC Recall
Square 0.650 0.141 0.654 0.014 0.552 0.167
ReLU-Sigmoid 0.633 0.125 0.668 0.003 0.596 0.163
Approximated Swish 0.618 0.153 0.678 0.003 0.645 0.181
Table 2: Accuracy of Activation Functions. We see that the square activation produces the lowest AU and recall. We also find the approximated swish function produces higher AUC and recall values than the nonlinear ReLU-Sigmoid activations.
Activations Wallclock Runtime (s) Multiplicative Operations
Square 21.700657 6780
Approx. Swish (without quantization) 21.900602 6912
Approx. Swish (with quantization) 21.797248 6912
Table 3: Computational Costs of Activation Functions: We can see that the non-quantized swish has longer runtime due to the higher amount of multiplicative operations, and also that the quantized swish approximation reduces the runtime without reducing the number of operations.

Homomorphic encryption is computationally expensive, raising the runtime of neural network inferences from an order of milliseconds to 20 seconds. We can see in Table 3 that the approximated swish function without quantization adds additional runtime due to the extra multiplicative operations, but with quantization, the increased costs become negligible. With larger networks that contain more activations and multiplicative operations, this time saving will be even more pronounced.

5 Conclusion

For deep learning to be widely adopted in the healthcare community, the privacy and security of patient data will have to be ensured at every step of deployment. We utilized differentially private learning and homomorphic encryption to protect privacy at both the training and inference stages, and demonstrated the deployment of our framework on a representative clinical prediction task. We also discussed several techniques to minimize the training instability and computational trade-off incurred by those privacy measures. We hope this work will inspire future efforts to build machine learning systems that prioritizes patient privacy by design.

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. pages 308–318, 2016.
  • [2] M. Abadi, Úlfar Erlingsson, I. Goodfellow, H. B. McMahan, N. Papernot, I. Mironov, K. Talwar, and L. Zhang. On the protection of private information in machine learning systems: Two recent approaches. In IEEE 30th Computer Security Foundations Symposium (CSF), pages 1–6, 2017.
  • [3] M. R. Asghar, T. Lee, M. M. Baig, E. Ullah, G. Russello, and G. Dobbie. A review of privacy and consent management in healthcare: A focus on emerging data sources. CoRR, abs/1711.00546, 2017.
  • [4] H. Bae, J. Jang, D. Jung, H. Jang, H. Ha, and S. Yoon. Security and Privacy Issues in Deep Learning. ArXiv e-prints, July 2018.
  • [5] J.-C. Bajard, J. Eynard, A. Hasan, and V. Zucca. A full rns variant of fv like somewhat homomorphic encryption schemes. In Selected Areas in Cryptography, 2016.
  • [6] M. S. Bhuvan, A. Kumar, A. Zafar, and V. Kishore. Identifying diabetic patients with high risk of readmission. CoRR, abs/1602.04257, 2016.
  • [7] J. W. Bos, K. Lauter, and M. Naehrig. Private predictive analysis on encrypted medical data. Journal of Biomedical Informatics, 50:234 – 243, 2014. Special Issue on Informatics Methods in Medical Privacy.
  • [8] Z. Brakerski and V. Vaikuntanathan. Efficient fully homomorphic encryption from (standard) lwe. Journal on Computing, 2014.
  • [9] N. Brisebarre, J.-M. Muller, and A. Tisserand. Computing machine-efficient polynomial approximations. Transactions on Mathematical Software, 2006.
  • [10] N. Carlini, C. Liu, J. Kos, Ú. Erlingsson, and D. Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets. CoRR, abs/1802.08232, 2018.
  • [11] H. Chabanne, A. de Wargny, J. Milgram, C. Morel, and E. Prouff. Privacy-preserving classification on deep neural network. Cryptology ePrint Archive, 2017.
  • [12] H. Chen, K. Han, Z. Huang, A. Jalali, and K. Laine. Simple encrypted arithmetic library v2.3.0. Microsoft Research TechReport, 2017.
  • [13] C. Chopra, S. Sinha, S. Jaroli, A. Shukla, and S. Maheshwari. Recurrent neural networks with non-sequential data to predict hospital readmission of diabetic patients. In Proceedings of the 2017 International Conference on Computational Biology and Bioinformatics, pages 18–23. ACM, 2017.
  • [14] F. K. Dankar and K. El Emam. Practicing differential privacy in health care: A review. Trans. Data Privacy, 6(1):35–67, Apr. 2013.
  • [15] F. K. Dankar and K. E. Emam. The application of differential privacy to health data. In EDBT/ICDT Workshops, 2012.
  • [16] C. Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 2008.
  • [17] J. Fan and F. Vercauteren. Somewhat practical fully homomorphic encryption. Cryptology ePrint Archive, 2012.
  • [18] M. Fredrikson, S. Jha, and T. Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 1322–1333, New York, NY, USA, 2015. ACM.
  • [19] A. Gautier, Q. N. Nguyen, and M. Hein. Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods. In NIPS, 2016.
  • [20] C. Gentry et al. Fully homomorphic encryption using ideal lattices. In STOC, 2009.
  • [21] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In ICML, 2016.
  • [22] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
  • [23] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.
  • [24] M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. Identifying Personal Genomes by Surname Inference. Science, 339:321, Jan. 2013.
  • [25] D. Harvey. Faster arithmetic for number-theoretic transforms. Journal of Symbolic Computation, 2014.
  • [26] E. Hesamifard, H. Takabi, and M. Ghasemi. Cryptodl: Deep neural networks over encrypted data. arXiv, 2017.
  • [27] J. Liu, Z. Zhang, and N. Razavian. Deep EHR: Chronic Disease Prediction Using Medical Notes. ArXiv e-prints, Aug. 2018.
  • [28] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efficiency of training neural networks. In NIPS, 2014.
  • [29] F. D. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. SIGMOD, 2009.
  • [30] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, SP ’08, pages 111–125, Washington, DC, USA, 2008. IEEE Computer Society.
  • [31] T. H. Nguyen, Y. Chevaleyre, E. Prifti, N. Sokolovska, and J. Zucker. Deep learning for metagenomic data: using 2d embeddings and convolutional neural networks. CoRR, abs/1712.00244, 2017.
  • [32] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and Ú. Erlingsson. Scalable private learning with pate. arXiv preprint arXiv:1802.08908, 2018.
  • [33] S. Shaked and L. Rokach. Publishing differentially private medical events data. In F. Buccafurri, A. Holzinger, P. Kieseberg, A. M. Tjoa, and E. Weippl, editors, Availability, Reliability, and Security in Information Systems, pages 219–235, Cham, 2016. Springer International Publishing.
  • [34] R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310–1321. ACM, 2015.
  • [35] R. Shokri, M. Stronati, and V. Shmatikov. Membership inference attacks against machine learning models. CoRR, abs/1610.05820, 2016.
  • [36] S. Song, K. Chaudhuri, and A. D. Sarwate. Stochastic gradient descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages 245–248. IEEE, 2013.
  • [37] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore. Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international, 2014, 2014.
  • [38] S. Yeom, M. Fredrikson, and S. Jha. The unintended consequences of overfitting: Training data inference attacks. CoRR, abs/1709.01604, 2017.

Appendix

Appendix A Differentially Private Gradient Optimization

input :  Examples

, loss function

, Parameters: learning rate , noise scale , group size , gradient norm bound .
output :  and calculate privacy cost using a privacy accountant method
1 Initialize randomly;
2 for  do
3       Take a random sample

with sampling probability

;
4       Compute gradient;
5       For each , compute ;
6       Clip gradient;
7       ;
8       Add noise;
9       ;
10       Descent;
11       ;
12      
Algorithm 1 Differentially private SGD

Appendix B Levelled Homomorphic Encryption

The levelled homomorphic encryption scheme is a structure-preserving transformation between two algebraic structures, which can be leveraged by cryptosystems to allow for arithmetic operations on encrypted data. Let denote the polynomial ring . We let denote uniformly random sampling of from an arbitrary set , and denote a coefficient-wise division and rounding of the polynomial with respect to integer moduli and . Let denote the reduction of the coefficients of the polynomial modulo , and let denote .

Encryption Scheme. Bajard et al. [5] proposed an encryption scheme, FV-RNS, which is a residue number system (RNS) variant of the FV encryption scheme. In FV-RNS, plaintexts are elements of the polynomial ring , where is the plaintext modulus and is the maximum degree of the polynomial, which is commonly selected to be one of . The plaintext elements are mapped to multiple ciphertexts in in the encryption scheme, with as the ciphertext coefficient modulus. For any logarithm base , let be the number of terms in the base- decomposition of polynomials in that is used for relinearization.

Let

denote the truncated discrete Gaussian distribution. The secret key is generated as

with coefficients . The public key is generated by sampling and and constructing . The evaluation keys are generated by sampling and constructing for each .

A plaintext is encrypted by sampling with coefficients and , and letting . A ciphertext is decrypted as .

Arithmetic. The addition of two ciphertexts and is . The multiplication of two ciphertexts and occurs by constructing

We express in base as . We then let and , which forms the product ciphertext .

The addition of ciphertext and plaintext is the ciphertext . The multiplication of ciphertext and plaintext is the ciphertext .

The advantage of the residue number system variant is that the coefficient modulus can be decomposed into several small moduli to avoid multiple-precision operations on the polynomial coefficients in the homomorphic operations, which improves the efficiency of evaluation.

Integer Encoder. To encode real numbers involved in the computation, we choose a fixed precision for the values (15 bits) and scale each value by the corresponding power of 2 to get an integer for use with the encoder described below. After decryption, we can divide by the accumulated scaling factor to obtain a real value for the prediction. The encoder consists of a base-2 integer encoder [12]. For a given integer , consider the binary expansion of . The the coefficients of the polynomial in the plaintext ring are if otherwise .

Polynomials Let and let denote the activation function. Our task is to approximate with a polynomial where subject to the constraint that each coefficient is a power of 2. Define as the set of all polynomials of degree less than or equal to , such that all coefficients are base-2. That is, . Let be the minimax approximation to on some interval . Let be the same as , but with all coefficients rounded to the nearest where . Note, .

Maximum Error & Minimax The maximum difference (i.e., error) between two functions and is . This provides a strong bound on the optimal polynomial approximation error where We state minimax problem as follows. For a given activation function , we seek to find the best polynomial such that,

(1)

subject to the constraint,

(2)

Finite Number of Solutions. Let , and . For , let such that if . We can construct a bounded polyhedron,

where each tuple represents any polynomial , and where represents the degree coefficient. [9] show that the number of polynomials satisfying Equation 2 is finite if the polynomials are contained in . They also proposed an efficient scanning method to find the optimal polynomial approximation . Equipped with our new-found approximation , we can evaluate the effectiveness of as an activation function in both non-encrypted and encrypted domains.