Faster CryptoNets: Leveraging Sparsity for Real-World Encrypted Inference

11/25/2018 ∙ by Edward Chou, et al. ∙ Stanford University 0

Homomorphic encryption enables arbitrary computation over data while it remains encrypted. This privacy-preserving feature is attractive for machine learning, but requires significant computational time due to the large overhead of the encryption scheme. We present Faster CryptoNets, a method for efficient encrypted inference using neural networks. We develop a pruning and quantization approach that leverages sparse representations in the underlying cryptosystem to accelerate inference. We derive an optimal approximation for popular activation functions that achieves maximally-sparse encodings and minimizes approximation error. We also show how privacy-safe training techniques can be used to reduce the overhead of encrypted inference for real-world datasets by leveraging transfer learning and differential privacy. Our experiments show that our method maintains competitive accuracy and achieves a significant speedup over previous methods. This work increases the viability of deep learning systems that use homomorphic encryption to protect user privacy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 6

page 7

page 9

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As cloud-based machine learning services become more widespread, there is a strong need to ensure the confidentiality of sensitive healthcare records, financial data, and other information that enters third-party pipelines. Traditional machine learning algorithms require access to raw data, which opens up potential security and privacy risks. For some fields such as healthcare, regulations may preclude the use of external prediction services if the technology cannot provide the necessary privacy guarantees.

In this work, we address the task of encrypted inference for secure machine learning services. We make the assumption that the third-party provider already has a trained model, as is common in “machine learning as a service” paradigms. Using cryptographic techniques, an organization such as a research hospital or fraud detection company will be able to offer prediction services to users while ensuring security guarantees for all parties involved. We follow the procedure set by previous work [29, 68] and employ homomorphic encryption (HE) to convert a trained machine learning model into a HE-enabled model.

Homomorphic encryption [56] allows a machine learning model to perform calculations over encrypted data. By design, the output prediction is also encrypted, which prevents the input or output from leaking information to the model’s host. As show in Figure 1, the model does not decrypt the data nor is the private key needed [12].

Several challenges prevent widespread adoption of encrypted machine learning. A major bottleneck is computational complexity. Inference on plain networks is performed in the orders of milliseconds, while encrypted networks require minutes or hours per example [29, 38]. Also, the reduced arithmetic set of HE prevents the use of modern activation functions [15]. necessitating the use of simpler lower-performance functions.

Fig. 1: Encrypted machine learning as a service paradigm. Dashed lines indicate data transfer. The end-user (Alice) encrypts her sensitive data and sends it to a third-party host (Eve). Since Alice owns the private key, Eve cannot decrypt the input nor output prediction. Eve produces an encrypted prediction which is returned to Alice. Privacy is preserved in the entire pipeline for both inputs and outputs.

I-a Contributions

We propose Faster CryptoNets – a method for encrypted inference on the order of seconds. This is a significant improvement over existing state-of-the-art, which performs inference on the order of minutes. Our contributions accelerate the homomorphic evaluation of deep learning models on encrypted data using sparse representations throughout the neural network. Additionally, we are able to efficiently approximate modern activation functions. Finally, we show how this technique can be combined with private training techniques in a plausible real-world scenario.

By intelligently pruning the network parameters, we can avoid many multiplication operations – a major contributor to computational complexity. We can progressively quantize the remaining network parameters such that the plaintext encodings achieve maximum sparsity. Also, given that the activation function is the single most expensive operation of the network, we derive an optimal, quantized polynomial approximation to the activation function also with maximally-sparse encodings. We empirically show a significant improvement in the runtime of the network on MNIST. We perform additional experiments on larger datasets to demonstrate the viability and performance gain on practical tasks. We use a feature-extraction based framework to reduce the number of layers requiring encrypted computation, while using differentially private training to achieve competitive accuracy on real-world datasets.

Ii Related Work

Ii-a Related Work

Privacy-preserving machine learning models attempt to address computation and statistical modeling of private data [4]. Privacy is preserved when two conditions are met: (i) the end-user learns nothing about the model and (ii) the model learns nothing from the data [13]. Differential privacy, multi-party computation (MPC), and homomorphic encryption are different methods to preserve privacy.

Differential privacy allows statistics to be computed over a dataset without revealing information about individual records [22, 16]. A common method is to apply noise to individual examples to obfuscate statistical differences that might be distinguishable [52]. However, differential privacy is better suited for the training phase. During test-time, adding noise to a single example may change the prediction.

Secure multi-party computation enables multiple parties to jointly compute a function over their inputs while keeping their inputs private. This has been explored using Garbled Circuits [69] in the works of [57, 42] and [50]. These methods often involve a high communication complexity with significant bandwidth costs.

Fully homomorphic encryption (FHE) was proposed by [26] and allows anyone to compute over encrypted data without decrypting it [51]. A weaker version of FHE, termed leveled homomorphic encryption (LHE) permits a subset of arithmetic operations on a depth-bounded arithmetic circuit [12]. While HE has been explored for machine learning applications, many works focus on simpler models such as linear [34], logistic [18]

and ridge regression

[28]. CryptoNets [29] was one of the first works to implement HE in a neural network setting. More recently, [15] and [38]

extended this to deeper network architectures and developed additional polynomial approximations to the activation function that leveraged batch normalization for stability.

Other works have explored the broader use of polynomial activation functions. [54] and [47] used a polynomial function in the non-encrypted domain to some success. The original theory dates back to [40] who argues that as long as the activation function is arbitrarily bounded and non-constant, the neural network is a universal approximator. Some prior work even suggests that neural networks equipped with polynomial functions have the same representational power as their non-polynomial counterparts [25, 45]. In §IV and §V, we explore these ideas in greater detail.

Recent works have proposed techniques that accelerate neural network inference on encrypted data. Sanyal et. al. [59]

use sparsification techniques via binarized neural networks which achieves a similar speedup of around 30x wall-clock time as our technique on MNIST. Florian et. al.

[11] opt for an approach that leverages scale invariance to allow unrestricted depth of neural networks. The technique we propose is distinct from these approaches due to its use of the encoding scheme to accelerate multiplicative operations, in contrast to the previous approaches which bypass expensive operations using the sign activation function. Our approach is advantageous in that it is more compatible with common neural network components; sign activation functions are known to cause difficulty with convergence, and the scale invariant [11] precludes the use of convolutional layers. We do not present detailed comparisons to these works in our analysis due to these fundamental architectural differences, and opt for a direct comparison to CryptoNets to clearly demonstrate in which layers and with which operations are our speedups derived from.

Iii Threat Model

Machine Learning as a Service (MLaaS) [8] is a framework where cloud providers offer machine learning training of inference hosted on the cloud. In our scenario we will be considering a MLaaS inference pipeline, where users send data to a remote server and receive predictions performed by machine learning models. The machine learning model is pre-trained on a proprietary dataset.

A universal threat in multi-party situations is the inherent risk of data transmission, either by interception or side-channel attacks. This threat can be mitigated to a large extent by using strong cryptographic and signature protocols to protect the data in-transmission. However, a concern that is much harder to alleviate involves the threat of the cloud host collecting and utilizing the transmitted data without authorization [8]. In a naive scheme, a user sends encrypted data to the cloud, but also has to provide a key to the server to decrypt the data and compute a output with a machine learning algorithm before sending the encrypted prediction back to the user. The cloud host must have access to the plain data, and it is hard to guarantee or prove to the user that the data is not kept on the server, where it can either be sold to third-parties or be stolen by attackers who gain access to the data.

Homomorphic encryption provides a solution to both problems. By design, the transmitted data is protected using a strong encryption scheme. It also enables ”oblivious inference”, where a cloud host operates on data that it is oblivious to. If the service provider is only allowed to compute on the encrypted data to compute an encrypted output without ever decrypting the data at any step, it will never have access to the plain data, guaranteeing data privacy from the cloud provider.

Iv Preliminaries

A homomorphism is a structure-preserving transformation between two algebraic structures, which can be leveraged by cryptosystems to allow for arithmetic operations on encrypted data. Let be a cyclic group of order with generator . Let be randomly sampled as the public key. Consider the ElGamal encryption scheme [23], which uses a map such that for random . The map preserves the multiplicative structure of the integers such that ) where is the multiplication operation in .

The leveled homomorphic encryption scheme that we present below has a more complex algebraic structure, and supports both additive and multiplicative homomorphisms, but this example can serve as a basis for understanding the role of homomorphic encryption in our network design.

Iv-a Notation

Let denote the polynomial ring . We let denote uniformly random sampling of from an arbitrary set , and denote a coefficient-wise division and rounding of the polynomial with respect to integer moduli and . Let denote the reduction of the coefficients of the polynomial modulo , and let denote .

Iv-B Encryption Scheme.

Bajard et al. [9] proposed an encryption scheme, FV-RNS, which is a residue number system (RNS) variant of the FV encryption scheme. In FV-RNS, plaintexts are elements of the polynomial ring , where is the plaintext modulus and is the maximum degree of the polynomial, which is commonly selected to be one of . The plaintext elements are mapped to multiple ciphertexts in in the encryption scheme, with as the ciphertext coefficient modulus. For any logarithm base , let be the number of terms in the base- decomposition of polynomials in that is used for relinearization.

Let

denote the truncated discrete Gaussian distribution. The secret key is generated as

with coefficients . The public key is generated by sampling and and constructing . The evaluation keys are generated by sampling and constructing for each .

A plaintext is encrypted by sampling with coefficients and , and letting . A ciphertext is decrypted as .

Iv-C Arithmetic

. The addition of two ciphertexts and is . The multiplication of two ciphertexts and occurs by constructing

We express in base as . We then let and , which forms the product ciphertext .

The addition of ciphertext and plaintext is the ciphertext . The multiplication of ciphertext and plaintext is the ciphertext .

The advantage of the residue number system variant is that the coefficient modulus can be decomposed into several small moduli to avoid multiple-precision operations on the polynomial coefficients in the homomorphic operations, which improves the efficiency of evaluation.

Iv-D Integer Encoder

. To encode real numbers involved in the computation, we choose a fixed precision for the values (15 bits) and scale each value by the corresponding power of 2 to get an integer for use with the encoder described below. After decryption, we can divide by the accumulated scaling factor to obtain a real value for the prediction. The encoder consists of a base-2 integer encoder [17]. For a given integer , consider the binary expansion of . The the coefficients of the polynomial in the plaintext ring are if otherwise .

V Method

V-a Sparse Polynomial Multiplication

The convolutional and fully connected layers of a neural network require a substantial number of multiplications involving both the ciphertext inputs and the plaintext parameters of the model. Each operation involves computing the product of two polynomials with up to nonzero coefficients. While a brute-force implementation would require time to complete, homomorphic encryption methods are able to accomplish this in when certain conditions are met. Assuming that the coefficient modulus is chosen such that is divisible by , we can invoke the Number Theoretic Transform to achieve [36].

Our contributions leverage the following insight: a substantial improvement in efficiency occurs when the plaintext multiplier for some . The polynomial that encodes this integer is , a monomial multiplier. For such parameters, sparse polynomial multiplication [5] has been shown to use coefficient multiplications and modular reductions (see Algorithm 1).

  Input: ciphertext plaintext
  for  to  do
     Initialize .
     if  then
        
     else
        
     end if
  end for
  Output: ciphertext
Algorithm 1 Sparse Plaintext-Ciphertext Multiplication

V-B Network Pruning and Quantization

The parameters of a neural network can be iteratively removed and clustered without affecting accuracy. [35] developed a compression method that leverages these techniques. Since then, new pruning and quantization techniques have been proposed [46]. We leverage these techniques to reduce the number of weights that contribute to the multiplication count, and convert the weights to powers of 2, which have sparse polynomial representations that reduce the cost of each multiplication. Together, these lead to significant reductions in inference time.

We first train a pruned version of the network with Dynamic Network Surgery (DNS) [33] that incorporates connection splicing. The remaining network parameters are quantized to powers of 2 following the incremental network quantization (INQ) procedure proposed by [71]. The INQ method consists of an iterative quantization strategy to preserve the original inference accuracy.

For each layer , the layer’s parameters have a corresponding binary pruning mask . The elements of the binary pruning mask get updated during gradient descent according to a discriminative measure of parameter importance , typically incorporating a magnitude-based measure such as . We define and , which will help bound our quantized values for each layer:

where is used to restrict the set of powers for our desired bitwidth. We use . Note that and

is the set of possible quantized values for the parameters of layer in the network. We define a monotonically increasing weight partition schedule using the discriminative measure to progressively quantize the weights. For example, one can quantize 50% of the weights, then 75%, then 87.5%, then 100%, retraining the other non-quantized weights at each step of the quantization procedure.

V-C Approximating the Activation Function

Using our pruning and quantization scheme from §V-B, our next contribution lies in finding the optimal polynomial approximation for any activation function given the constraint that the coefficients must be a power of 2. The activation function of a neural network is critical for convergence [30] and has been thoroughly explored in literature [55]. With the goal of encrypted network inference, we must find an approximation which balances approximation error with practical usability. Inspired by [14], we find the best polynomial approximation.

Polynomials. Let and let denote the activation function. Our task is to approximate with a polynomial where subject to the constraint that each coefficient is a power of 2. Define as the set of all polynomials of degree less than or equal to , such that all coefficients are base-2. That is, . Let be the minimax approximation to on some interval . Let be the same as , but with all coefficients rounded to the nearest where . Note, .

Maximum Error & Minimax. The maximum difference (i.e., error) between two functions and is . This provides a strong bound on the optimal polynomial approximation error where We state minimax problem as follows. For a given activation function , we seek to find the best polynomial such that,

(1)

subject to the constraint,

(2)

Finite Number of Solutions. Let , and . For , let such that if . We can construct a bounded polyhedron,

where each tuple represents any polynomial , and where represents the degree coefficient. [14] show that the number of polynomials satisfying Equation 2 is finite if the polynomials are contained in . They also proposed an efficient scanning method to find the optimal polynomial approximation . Equipped with our new-found approximation , we can evaluate the effectiveness of as an activation function in both non-encrypted and encrypted domains.

Vi Experiments

Vi-a Wall-clock Runtime

The runtime refers to the wall-clock time required to perform inference on an encrypted image. This metric is the default metric reported in previous encrypted works. However, the wall-clock time is an imperfect metric for measuring improvements in encrypted inference. It is hardware-dependent, varying greatly depending on the available memory and computational power of the device, and it is also possibly encryption-scheme dependent, with even the same encryption algorithm being implemented differently across libraries. In the next sections, we introduce a metric to evaluate our methods using hardware-independent metrics.

Vi-B Explanation of HOPs

We report the number of homomorphic operations (HOPs) of our inference network. This is in contrast to previous work [29, 15, 38], which measured either throughput or wall-clock time – both of which are highly dependent on hardware specifications and software parallelization, and are not entirely reliable measures. The HOPs metric is similar to the FLOPs (floating point) metric used in scientific computing.

A homomorphic operation is defined as addition or multiplication involving a ciphertext, a plaintext, or both. The four classes of HOPs are (i) plaintext-ciphertext addition, (ii) ciphertext-ciphertext addition, (iii) plaintext-ciphertext multiplication, and (iv) ciphertext-ciphertext multiplication. While the exact implementation of HOPs may vary, we believe HOPs are a hardware-independent metric for performance analysis that enable a better comparison of models for encrypted inference, demonstrating whether speedup occurs due to decreased number of operations or due to algorithmic speedup.

It is important to note that the different HOPs classes vary in cost. In general, multiplicative operations are significantly more expensive than additive operations, with ciphertext-ciphertext multiplications being the most costly operations found in neural networks. Throughout our analysis, we break down our HOPs into separate operations, following the rule-of-thumb that reducing multiplicative HOPs outweighs the cost of adding additive HOPs.

Vi-C Datasets

We use the MNIST dataset of handwritten digits [44] which contains grayscale images of Arabic numerals 0 to 9 (i.e., 10-class classification task), which has a standard split of 50,000 training images and 10,000 images test set images. While MNIST is arguably a simple dataset, it has remained the standard benchmark for homomorphic inference tasks [29, 38].

Vi-D Network Architecture

The network architecture used for MNIST inference is presented below. The architecture itself is a slight variant of the CryptoNet [29] architecture that incorporates batch normalization layers to support a greater variety of activation functions. The multiplicative depth is unchanged. As shown in Figure 2, our approximation error is minimized close to zero. Batch normalization encourages the pre-activation values to fit in this range. As confirmed in [15] and [38]

, by reducing the variance in the input values to the activation layer, the approximation error of the network decreases. Overall, our model is a convolutional neural network

[44] consisting of convolutional layers, activation functions, scaled average pooling, batch normalization, and fully-connected layers.

1. Convolutional Layer.

The input image is 28 x 28. There are 20 kernels of size 5 x 5, with stride 2, and padding of 1.

2. Batch Normalization Layer. This layer applies the batch normalization weights and biases to each input value.

3. Activation Layer. This layer applies the approximate activation function to each input value.

4. Scaled Average Pool Layer. This layer has 3 x 3 windows, with a stride of 2, padding of 1, and output size of 5 x 13 x 13.

5. Convolutional Layer. This layer has 50 kernels of size 20 x 5 x 5, with a stride of 1, and zero padding.

6. Scaled Average Pool Layer. This layer has 3 x 3 windows and a stride of 2, padding of 1, and output size of 50 x 5 x 5.

7. Fully-Connected Layer. This layer has parameters of size 1250 x 100 for matrix multiplication with respect to inputs.

8. Batch Normalization Layer. This layer applies the batch normalization weights and biases to each input value.

9. Activation Layer. This layer applies the approximate activation function to each input value.

10. Fully-Connected Layer. This layer has parameters of size 100 x 10 for matrix multiplication with respect to inputs.

Vi-E Encryption Scheme

. The parameters for the FV-RNS encryption scheme are: coefficient count of , plaintext moduli of = 1099511922689 and = 1099512004609. The values of are selected for 128-bit security ( = 219). This choice of coefficient modulus meets the security standards established by the Homomorphic Encryption Standardization Workshop [6].

Vi-F Hardware/Software Setup

. The machine used for the MNIST experiments has an Intel Core i7-5930K CPU at 3.5 GHz with 48 GB RAM on Ubuntu 17.10. The HE library was SEAL v2.3.0-4 [17], modified by us to support our proposed method.

Vi-G Optimization Hyperparmeters

We provide the hyperparameter settings used to train our non-encrypted network. A batch size of 64 was used and the model was trained for 30 epochs. The learning rate schedule was initialized at

with a step size of 10 epochs and

. The model was trained with stochastic gradient descent with a momentum of 0.9. For the square function, gradients were clipped at 0.25. He weight initialization was used for the convolutional layers.

Vi-H Dynamic Network Surgery Hyperparemeters

We report the hyperparameters of our dynamic network surgery operations. The sparsity denotes the final fraction of non-pruned connections over the total connections c-rate denotes compression rate used to set the threshold of importance before removing a connection.

We report metrics for each layer. The conv1 layer had a sparsity of 0.1440 and c-rate of 1.5. The conv2 layer had a sparsity of 0.0701 and c-rate of 1.65. The dense-fc1 layer had a sparsity of 0.0568 and c-rate of 1.65. The dense-fc2 layer had a sparsity of 0.1480 and c-rate of 1.5. All layers stopped at iteration 10,000.

Vi-I Approximation Results

(a) Swish:
(b) ReLU:
(c) Softplus:
Fig. 2: Approximation results (non-encrypted). (Top) Different approximation methods. The original activation function

is plotted with three approximations: the minimax estimate

, the rounded minimax estimate , and the our method – the quantized minimax approximation . (Bottom) Error of our method. Our method is compared to the baseline . The blue shaded area corresponds to the post-batch normalization region during the training procedure.

Prior work suggests that neural networks equipped with polynomial functions have the same representational power as their non-polynomial counterparts [25, 45]. Faster CryptoNets uses quadratic activation functions that approximate modern activations with varying degrees of complexity and expressivity. Our proposed method allows us to construct an optimal, quantized polynomial approximation of any arbitrary function. In our experiments, we consider ReLU [30], Softplus [21], and Swish [55]. We model all activation functions with a 2nd degree polynomial. While higher-degree polynomials can decrease the approximation error, higher-degree polynomials also require more HOPs.

We note that [27]

showed how the gradient of the square function can be large. Their solution was to apply gradient clipping to improve model convergence. While this is a viable solution for recurrent networks

[19], clipping gradients in a shallow network (such as ours) may indicate model instability and may not work for deeper variants. To avoid this, we do not use the square activation function.

Vi-J Polynomial Approximation Equations

We list the polynomial approximations to the Swish, Softplus, and ReLU activation functions.

Swish

  • Minimax:

  • Rounded Minimax:

  • Quantized:

Softplus

  • Minimax:

  • Rounded Minimax:

  • Quantized:

ReLU

  • Minimax:

  • Rounded Minimax:

  • Quantized:

Vi-K Error Minimization

(a) Convolution Layer
(b) Convolution Layer (BN)
(c) Dense (fc) Layer
(d) Dense (fc) Layer (BN)
Fig. 3: Distribution of pre- and post-activation values. (Top) The axis denotes the pre-activation value. (Bottom) The axis denotes the post-activation value. (Both) The axis denotes a normalized frequency. The original activation function is denoted by , the baseline minimax estimate is , the baseline rounded minimax estimate is , and our method is . BN denotes batch normalization was applied after the convolution but before ; this is reflected in the pre-activation value. (Bottom) Values after convolution but before applying .

The purpose of the error minimization experiment is to determine which activation function produces the lowest approximation error under our quantization constraints. We evaluate the effectiveness of multiple approximation schemes including our method.

Vi-L Activation Approximation Accuracy

We present Table I which contains accuracy values for all the layers and all of the activation functions over three trials. Activation layers we considered include ReLU, square, Swish, and softplus, using the original function, approximated function, and quantized approximation function.

Trial 1 Trial 2 Trial 3 Mean Stddev
Activation Train Test Train Test Train Test Train Test Train Test
Square 99.80 99.08 99.81 99.14 99.8 99.29 99.80 99.17 0.01 0.11
ReLU 99.65 99.20 99.59 99.14 99.62 99.05 99.62 99.13 0.03 0.08
ReLU-approx 99.57 99.07 99.60 99.14 99.58 99.07 99.58 99.09 0.02 0.04
Softplus 99.42 99.17 99.37 99.06 99.41 99.05 99.4 99.09 0.03 0.07
Softplus-A 99.34 99.05 99.39 98.98 99.38 98.98 99.37 99.00 0.03 0.04
Softplus-AQ 99.17 98.92 99.13 98.92 99.17 98.87 99.16 98.9 0.16 0.03
Swish 99.63 99.16 99.64 99.22 99.64 99.02 99.64 99.13 0.01 0.10
Swish-A 99.56 99.07 99.59 99.13 99.58 99.07 99.58 99.09 0.02 0.03
Swish-AQ 99.56 99.09 99.60 99.12 99.60 99.08 99.59 99.10 0.02 0.02
TABLE I: Multiple trials for the activation function ablation study. Values denote accuracy. Minimax approximation is denoted by A and polynomial approximation with quantized coefficients is AQ

(our method). For each activation function, three models were trained with different random seeds. The mean accuracy and standard deviation are shown.

Figure 2 shows our approximation methods applied to Swish, ReLU, and softplus. The functions are plotted on the top row. Most approximations are able to fit the original function within the interval . The bottom row of Figure 2 shows the approximation error of and for different pre-activation

values. Overall, Swish has lower error than ReLU and softplus. If we can constrain the pre-activation values to fall within the interval, our model will have better approximations. Conveniently, batch norm transforms the pre-activation values into a normal distribution with zero mean and unit variance

[41] which reduces overall error of the approximation [15]. The shaded area under the curve in Figure 2 shows the approximation error within the interval . Swish has lower error than both ReLU and softplus.

In Figure 3, we investigate the correctness of our proposed activation approximation method by plotting the pre-activation and post-activation values of different layers for both the regular and approximated Swish functions. The post-activation graphs in Figure 3 for Swish show the minimum value between . We analytically compute the theoretical minimum value for Swish by taking the first order derivative . This gives us the equation , from which we can derive . Using to compute , we get an approximate minimum value of , which corroborates our empirical minimum values shown in Figure 3. We find that this minimum value remains consistent for the approximated Swish function as well, validating the correctness of our approximation method.

Vi-M Detailed Breakdown of Homomorphic Operations

Layer HOPs 20cmPT-CT
Adds 20cmCT-CT
Adds 20cmPT-CT
Mults 20cmCT-CT
Mults
Conv-1 42,757 845 20,956 20,956
Act-1 845 845
Pool-1 6,845 6,845
Conv-2 309,950 1,250 154,350 154,350
Pool-2 8,450 8,450
FC-1 241,192 100 120,546 120,546
Act-2 100 100
FC-2 1990 10 990 990
Total 612,129 2,205 312,137 296,842 945
TABLE II: CryptoNets HOPs. More detailed breakdown of HOPs for each layer. Plaintext is denoted by PT and ciphertext is denoted by CT. Adds and mults refer to the number of homomorphic addition and multiplication operations, respectively. Dashes indicate zero operations. FC refers to the dense (fully-connected) layer.
Layer HOPs 20cmPT-CT
Adds 20cmCT-CT
Adds 20cmPT-CT
Mults 20cmCT-CT
Mults
Conv-1 8619 1,690 3,042 3,887
Act-1 5,070 845 1,690 1,690 845
Pool-1 6,845 6,845
Conv-2 22,950 1250 10,850 10,850
Pool-2 8,450 8,450
FC-1 14,354 100 7,077 7,177
Act-2 600 100 200 200 100
Fc-2 306 10 148 148
Total 67,194 3,995 38,302 23,952 945
TABLE III: Faster CryptoNets HOPs. More detailed breakdown of HOPs for each layer. Plaintext is denoted by PT and ciphertext is denoted by CT. Adds and mults refer to the number of homomorphic addition and multiplication operations, respectively. Dashes indicate zero operations. FC refers to the dense (fully-connected) layer.

In Table II, we list the HOPs at a more granular level than those presented in Table IV for CryptoNets. In Table III, we list the HOPs for our Faster CryptoNets method. We can see that the number of HOPS is greatly reduced for each layer and for the overall network.

Vi-N Comparison with Prior Work

Criteria Faster CryptoNets CryptoNets CryptoDL-1 CryptoDL-2
PT-CT Adds 3,995 2,205 30,750 161,546
CT-CT Adds 38,302 312,137
PT-CT Mults 23,952 296,842
CT-CT Mults 945 945 1,600 64,512
Total HOPs 67,194 612,129
Encrypt+Decrypt Time 6.7 sec 47.5 sec 16.7 sec 16.7 sec
Inference Time 39.1 sec 249.6 sec 148.9 sec 320.0 sec
Test Set Accuracy 98.71 98.95 98.52 99.52
Message Size 411.1 MB 367.5 MB 336.7 MB 336.7 MB
Encryption Scheme FV-RNS YASHE BGV BGV
TABLE IV: Comparison of State-of-the-Art Methods (Encrypted). Plaintext is denoted by PT and ciphertext is denoted by CT. Adds and mults refer to the number of homomorphic addition and multiplication operations, respectively. The total number of homomorphic operations is denoted by HOPs. Message size is the size of a single encrypted image. Faster CryptoNets uses Swish-AQ while CryptoNets uses as the activation function. References: CryptoNets [29], CryptoDL [38].

The target use case of our work is inference on a single encrypted image (Figure 1). We believe this approach is more analogous to practical use cases, where the third-party host runs asynchronous inference for individual users. Additionally, [49] suggests that there are very significant drawbacks to batching, including having to select more numerous and restricted NTT points, forcing specific computations away from NTT, and adding large computational cost. Works focusing on accelerating neural networks neglect batching for similar reasons as ours ([59] does not use batching, and [11]. uses batching to compress messages but not to improve throughput). Works that do batch inputs use schemes not very efficient in practice (discussed in [49].) and do not report the batching cost. A thorough performance analysis of batching binary vs scalar messages across different libraries is beyond the scope of our paper but would be a great direction for future work. As such, we do not implement ciphertext batching techniques in this paper, although we find it worth noting that our technique does not preclude the use of batching techniques. [49] introduces the Karatsuba algorithm which supports batching with binary encoding, preserving the benefits from our method.

We refer (Table IV) for accuracy and runtime results. The test set accuracy of our original model is 99.12%, and is slightly reduced to 98.71% after pruning and quantization. Evaluation of network layers in Faster CryptoNets takes 39.1 seconds for one input, compared to 249.6 seconds for CryptoNet. We achieve a improvement in wall-clock time while maintaining accuracy comparable to that of CryptoNets, which achieved 98.95% test accuracy. We also find that our method achieves fewer HOPs, a larger improvement than raw wall-clock time suggests. In Faster CryptoNets, encoding/encryption takes 6.63 seconds, while decryption of the final layer’s output takes 0.02 seconds. CryptoNets takes 44.5 seconds for encoding/encryption, and 3 seconds for decryption. Our method is and faster for these operations, respectively.

MNIST images are pixels. Each ciphertext consists of 2 polynomials resulting in 65,544 integers (64-bit). Therefore, our message consists of bytes, or 411.1 MB. The output of the network consists of the 10 outputs of the final dense layer, which gives us a result consisting of bytes, or 5.24 MB. In CryptoNets, the authors’ encryption scheme results in each image consuming 367.5 MB in encrypted form. Our scheme results in comparable message sizes to previous work.

Vi-O Ablation Studies

Faster CryptoNets CryptoNets Relative
Layer Time HOPs Time HOPs Time HOPs
Conv-1 3.9 8,619 30.0 42K
Act-1 23.4 5,070 81.0 845
Mid 9.1 53K 127.0 566K
Act-2 2.7 600 10.0 100
FC-2 0.1 306 1.6 1,990
Total 39.1 67K 249.6 612K
TABLE V: Layer-Wise Analysis (Encrypted). Wall-clock time (seconds) and HOPs required for inference on a single encrypted image. K denotes thousands. Act refers to the activation function. Mid denotes a combination of pool1, conv2, pool2, and fc1, as reported by [29].

Faster CryptoNets differs from the CryptoNets model in that we use the Swish activation instead of the square function. While both methods use a 2nd degree polynomial of the form , our approximations use for increased expressivity whereas the square function set , resulting in fewer HOPs for the square function. This is shown in Table V in the rows Act-1 and Act-2. Despite our method requiring more HOPs for Act-1 and Act-2, we still achieve a faster inference time than CryptoNets. At a per-layer level, our method yields up to and improvements for wall-clock and HOPs, respectively.

We compare the performance of different activation functions when approximated with our proposed polynomial approximation and quantization (AQ) method. For MNIST, Swish-AQ produced a test accuracy of 99.10%, while ReLU-AQ achieves an equivalent test accuracy of 99.09%. We note the similarity of ReLU and Swish. This finding is corroborated by the similarity of the approximations in Figure 2. The polynomial coefficients we calculate for the ReLU and Swish approximations turn out to be the same, except for a constant factor.

We evaluate the inference quality and runtime of pruning and quantization separately. Pruning produced a test set accuracy of 98.73% and inference time of 104.7 seconds. Quantization produced 99.06% and 162.5 seconds. When combined, pruning and quantization produced 98.71% and 45.7 seconds. We record the accuracies during each INQ step for both a non-pruned network in Table VI and a DNS pruned network in Table VII. The accuracy is largely preserved as the network is successively quantized, demonstrating consistent preservation. Overall, accuracy was preserved, or slightly improved in the case of quantization.

INQ Step Partition Quantized% Accuracy
1 0.7 30% 99.00
2 0.4 60% 99.02
3 0.2 80% 98.99
4 0.0 100% 99.06
TABLE VI: INQ-only quantization schedule. Accuracies collected for each Incremental Network Quantization (INQ) step are reported in the Accuracy column. In each step, a progressively larger set of the weights are partitioned and quantized, as reported in columns 2 and 3.
INQ Step Partition Accuracy
1 0.98 98.74
2 0.96 98.78
3 0.94 98.69
4 0.92 98.68
5 0.90 98.69
6 0.88 98.79
7 0.86 98.68
8 0.00 98.71
TABLE VII: DNS+INQ quantization schedule. Similar analysis is performed on a Dynamic Network Surgery (DNS) pruned network. Accuracies collected for each INQ step are reported in the Accuracy column, as well as the weight partitioning in the Partition column.

Vii Experimental Correctness

We make sure our parameters are selected properly so that our decrypted outputs are correct. We run encrypted inference on the 10,000 image MNIST test-set, and find no accuracy loss from our method’s plaintext results (98.71%). We also find an precision error of around 0.05% when comparing the plaintext and decrypted outputs. Upon further examination, we found that his error is introduced during the floating point to fixed point conversion prior to the encoding scheme, and that this error does not effect the accuracy of our model.

Viii Scaling up

To evaluate how well our method works in real-world settings, we implement our techniques on larger datasets. First, we focus on CIFAR-10 as a larger practical image classification task. Next, we consider diabetic retinopathy dataset as a real-world medical imaging use-case where very deep neural networks would be used in practice. For both experiments, we upgrade our machines to n1-megamem-96 instances offered by the Google Cloud Platform, which each have 96 Intel Skylake 2.0 GHz vCPUs and 1433.6 GB RAM.

Viii-a FV-RNS Parameters

We use a ring dimension with fifteen plaintext moduli : 40961, 65537, 114689, 147457, 188417, 270337, 286721, 319489, 417793, 557057, 638977, 737281, 778241, 786433, 925697. The values of the coefficient moduli are selected to provide 128-bit security, such that . Furthermore, each coefficient modulus is decomposed into four 64-bit moduli for efficient use of the RNS variant of the FV encryption scheme.

Ix Cifar-10

MNIST is a relatively easy dataset, with simple machine learning algorithms like linear regression or KNN producing high accuracy results

[64]. CIFAR-10 [43] is a more complicated task where CNN’s perform notably better than other methods. We evaluated the CIFAR-10 performance of our method on the model used in CryptoDL [38], consisting of eight convolutional layers, which from now on we will denote as CNN-8.

Activation 20cmCIFAR-10 Train Acc. 20cmCIFAR-10 Test Acc.
ReLU
Square
Softplus
Swish
ReLU-A
Softplus-A
Swish-A
ReLU-AQ
Softplus-AQ
Swish-AQ
TABLE VIII: Approximation results. Minimax approximation is denoted by A and polynomial approximation with quantized coefficients is AQ (our method). The training accuracy and test accuracy are shown for CIFAR-10.
Layer 20cmPT-CT
Adds 20cmCT-CT
Adds 20cmPT-CT
Mults 20cmCT-CT
Mults
Conv-1 36,864 460,800 479,232 0
Conv-2 36,864 13,294,908 13,313,340
Activ-1 18,432 36,864 55,296 36,864
Pool-1 18,432
Conv-3 18,432 5,968,347 5,977,563 0
Conv-4 18,432 11,931,014 11,940,230
Activ-2 9216 18,432 27,648 18,432
Pool-2 0 9216
Conv-5 9216 4,713,389 4,717,997 0
Conv-6 9,216 9,421,992 9,426,600 0
Activ-3 4,608 9,216 13,824 9,216
Pool-3 4,608
FC-1 256 29,4644 294,644
FC-2 10 2560 2,560 0
Total 161,546 46,184,422 46,248,934 64,512
TABLE IX: CNN-8 HOPs. More detailed breakdown of HOPS for each layer. Plaintext is denoted by PT and ciphertext is denoted by CT. Adds and mults refer to the number of homomorphic addition and multiplication operations, respectively. Dashes indicate zero operations. FC refers to the dense (fully-connected) layer.

Ix-a Activation Comparisons

We present Table VIII which contains accuracy values for all the layers and all of the activation functions. Activation layers we considered include ReLU, square, Swish, and softplus, using the original function, approximated function, and quantized approximation function. In Table VIII, we can see that training this model with the square activation function resulted in significantly worse test accuracy (59.88%) compared to training this model with the ReLU activation function (86.76%), confirming the theoretical loss of accuracy from instability of the square function for deeper neural networks. Furthermore, we find that ReLU-AQ and Swish-AQ offer comparable levels of performance (77.95% and 78.20% training accuracy, 75.99% and 75.66% test accuracy), while significantly improving on the accuracy results that were achieved with the square activation function.

Ix-B Pruning and Quantization

The pruning and quantization procedure results in a model with slightly improved accuracy (76.72%) that requires an order of magnitude fewer HOPS for inference ( HOPs vs. HOPs for the baseline method). The inference time for the model was 22,372 seconds with our method.

Ix-C Message Size

The message size for the input image is bytes, or 1,610.8 MB.

X Medical Imaging

Fig. 4: Illustration of our proposed method for encrypted inference on retinal fundus images. The cloud provider delegates the computation of the pretrained layers of the neural network to the end-user and evaluates the task-specific layers using the homomorphic encryption scheme. The output of the network is returned in encrypted form to the end-user, who can decrypt to determine the prediction of the severity of diabetic retinopathy.
Model Layers Retrained Test Accuracy
CNN-8 All Layers 63.23
DFE-RN-50 Top Block 69.89
DP-DFE-RN-50 All layers (with DP) 76.47
ResNet-50 All Layers
TABLE X: Accuracies and layers trained of each model. RN denotes ResNet, DFE denotes delegated feature extraction, and DP denotes differentially private. A ResNet-50 model is trained in a standard setting for benchmarking. We observe that DFE-RN-50 has significantly higher accuracy than the baseline CNN model and that DP-DFE-RN-50 further increases the test accuracy close to a plain ResNet-50 model.
Model Accuracy HOPs Runtime (s)
CNN-8 (sparse) 63.23 1.33E8 3325
DP-DFE-RN-50 76.04 3.95E8
DP-DFE-RN-50 (sparse)
TABLE XI: Comparison of performance metrics for transfer learning vs. fully-trained models. We show that our privacy-safe delegated feature extraction model results in both higher accuracy and fewer HOPs/runtime, and also show that our sparsity techniques maintains high accuracy.
Method Acc. HOPs 20cmInference
Time (s) Speedup
Original 70.47 12493
Pruned 70.98 1924 6.4x
20cmPruned/Quantized 70.55 1590 7.8x
TABLE XII: Ablation study of methods for improving performance of DFE-ResNet-152
Fig. 5: Illustration of retinal fundus image, graded with ‘none’ rating for diabetic retinopathy. The right image shows the result of the procedure for preprocessing.

A significant limiting factor in levelled encryption schemes used with neural networks is multiplicative depth, where only set amount of HE operations can be performed sequentially. Increasing the multiplicative depth by choosing larger parameters incurs prohibitive cost, limiting us to neural networks with three activation functions with our current settings. However, state-of-the-art real world applications for deep learning like medical imaging applications commonly use modern very deep neural networks. To mitigate this issue, we propose the use of models trained with transfer learning, where the computations involved in the pretrained layers of the model can be delegated to the client, and encryption is applied only for the evaluation of the fine-tuned layers on the server. Using this technique, which we call Delegated Feature Extraction (DFE), as well as Faster CryptoNet optimizations to speed up the computation, we achieve practical runtimes for large input sizes. An illustration of our technique is provided in Figure 4. We show how this technique can be improved with private training, demonstrating a viable framework where private, efficient, and powerful machine learning services can be provided.

X-a Data

We chose the diabetic retinopathy dataset introduced by [32] both for its clinical impact and the privacy-sensitive nature of retinal data. The dataset consists of macula-centered retinal fundus images primarily sourced from EyePACS and was graded by 54 opthalmologists or opthalmologist trainees using the International Clinical Diabetic Retinopathy scale [7] into ‘none’, ‘mild’, ‘moderate’, ‘severe’, or ‘proliferative’ ratings for the severity of the condition.

We were able to obtain a subset of around 35,126 images of the dataset, with a label distribution of 25,810 ‘none’, 2,443 ‘mild’, 5,292 ‘moderate’, 873 ‘severe’, and 708 ‘proliferative’ diagnoses. Scans from both the left and right eye were sourced from each patient. To compare our results to the replication study performed by [66], we group the ‘none’ and ‘mild’ labels to a ‘0’ label and the ‘moderate’, ‘severe’, and ‘proliferative’ labels to a ‘1’ label to reframe our problem into a binary classification task. We randomly subsampled our dataset to get an even split between our two labels, and following the guidelines recommended by [32] we use an 80-20 split for training and test data.

Before using the retinal images with our network, we perform some preprocessing on the raw images. The scans are scaled to

, the standard ImageNet input size, with cropping performed using edge detection to reframe the images. To normalize the colors and lighting, each image is subtracted by the local average color of each image, after which the local average is mapped to grayscale. Random rotation is performed on the image to make the model invariant to left/right eye positioning and for general augmentation. Samples of the data we use are provided in Figure

5.

X-B Transfer Learning

Transfer learning is a useful way to learn an accurate model with limited data [10], and requires retraining (fine-tuning) of only a few of the layers of the network rather than training a full network from scratch [39]. The base layers are commonly trained on ImageNet [20, 58], and the final layers are retrained on a specific proprietary dataset. This practice is common in the healthcare setting, where large datasets are expensive to acquire and transfer learning can simplify and/or improve the training of the model, especially when using Machine Learning as a Service (MLaaS) offerings from cloud computing providers to expedite development of the application [63, 31]. As we describe in more detail in our methods section, we can leverage transfer learning such that only a small number of fine-tuned layers of the network require evaluation under the encryption scheme, while the generic feature extraction of the base network layers is delegated to the client.

X-C Network Architectures

We first implement a baseline model to train on the retinal dataset. Our baseline model (CNN-8) resembles the CNN network architecture presented in [38] for CIFAR-10, but is designed to support the larger input image sizes (scaled from to input). It leverages an identical multiplicative depth budget to our transfer learning models in realizing the privacy guarantees. In particular, it contains eight convolutional layers and three approximate activation functions.

Modern deep neural networks like ResNet-152 [37] and Inception-v3 [62] can contain hundred of layers. The multiplicative depth of the fully-trained ResNet-152 or the Inception-v3 in [32] is at least an order of magnitude greater and would have a prohibitively large runtime in the encrypted setting. Additionally, there exist challenges in achieving strong accuracy when training all layers of a deep neural network with approximate activation functions. Our proposed model (DFE-ResNet-152) only requires retraining of the top block of the ResNet-152 network, which contains three convolutional layers and three activation functions. For this top block, we replace the ReLU activation functions in the top block with our approximate activation function, and we use the rest of the ImageNet-pretrained model as a delegated feature extractor on the client.

X-D Model Adaptations

We use the approximation of Swish [55] given by as derived in previous sections. This approximation is required for only the activation functions in the retrained block, which reduces difficulties with convergence in training.

Since the encryption scheme only supports addition and multiplication, some minor modifications are required to support average pooling. While other work has used scaled average pooling, in this work we encode the reciprocal of the size of the pooling window. Furthermore, to support the addition operation in the residual block, the scaling factor must be encoded as well, to scale the encrypted input image for the addition operation.

X-E Client-Server Interaction

The client uses a standard deep learning framework to evaluate the base network layers on a single RGB retinal fundus image. Each element of the activation volume of the final base layer can be converted to a fixed-point value and encoded using the integer encoder described above. Each value is encrypted on the client and transmitted to the server. The server returns encrypted output values, which the client decrypts, converts to floating point, and applies the sigmoid function to learn the final predictions of the model. Given the support for deep learning operations in major mobile platforms, the client could even be the patient’s own mobile device, allowing for direct service models in developing countries.

X-F Evaluation Metrics

The natural alternative to our proposed technique is to have the entire fully-trained model stored on the server and for clients to transmit their encrypted images for diagnosis. We want to demonstrate that our proposed technique of using fine-tuned layers of a much deeper model has higher accuracy, greater performance, and a smaller message size, without compromising any security or privacy guarantees. Our primary points of comparison will be between a standard model that is fully-trained on our dataset and a transfer learning model that uses pretrained ImageNet weights and is fine-tuned on our dataset.

We will need to consider the multiplicative depth of the models, which corresponds to the length of the deepest path of ciphertext multiplications through the network. We keep the multiplicative depth fixed between the two methods to enable a fair comparison. We will also analyze the count of homomorphic operations (HOPs), which serves as a hardware-independent and implementation-agnostic measure of the complexity of evaluation.

X-G Dp-Sgd

Once concern of the earlier approach is that the users are limited to fine-tuning only a few layers of a network. However, training the entire model and release the upper portion of the network as a feature extractor could lead to a privacy risk as the feature extractor could potentially leak sensitive information. We explore the use of private training techniques to fine-tune the feature extractor further to improve accuracy while preserving end-to-end privacy.

Differential privacy is a privacy construct which guarantees that an individual will not change the overall statistics of the population [22]. Formally, it is defined that an algorithm and dataset are private if . Applying differential privacy to neural networks helps ensure defenses against membership inference and model inversion attacks [3]. This can be achieved by either applying noise to gradients while training a single model [2] [61] or by segregating data and adding noise in a collaborative learning setting [53] [60].

DP-SGD optimization was developed by [2] and involves adding Gaussian noise and clipping gradients of neural networks during training with stochastic gradient descent. It also keeps track of the privacy loss through a privacy accountant [48], which prematurely terminates training when the total privacy cost of accessing training data exceeds a predetermined budget. Differential privacy is attained as clipping bounds the L2-norm of individual gradients, thus limiting the influence of each example on the learning updates. We outline the DP-SGD algorithm briefly below in Algorithm 2.

input :  Examples

, loss function

, Parameters: learning rate , noise scale , group size , gradient norm bound .
output :  and calculate privacy cost using a privacy accountant method
1 Initialize randomly;
2 for  do
3       Take a random sample

with sampling probability

;
4       Compute gradient;
5       For each , compute ;
6       Clip gradient;
7       ;
8       Add noise;
9       ;
10       Descent;
11       ;
12      
Algorithm 2 Differentially private SGD

We use a modified DP-SGD algorithm [2] to fine-tune the entire network, using the following techniques introduced by [70]

; warm-starting, where a public dataset is used to initialize the weights of the model, and weights clustering, where the same dataset is used to estimate the gradient l2 norms of each parameter before using a hierarchical clustering algorithm to group parameters with similar clipping bounds (in Algorithm

3). The public dataset we use is a smaller retinal scan dataset from the STARE project [1]. We train the network privacy settings of and giving us a value of . Finally, we retrain the layers to be encrypted as before, leaving us with our final model (DP-DFE-RN-50).

input : k - target number of groups; - parameter-specific gradient clipping bounds
output :  - grouping of parameters
1 ;
2 while  k do
3       ;
4       merge and w. clipping bound
5return ;
Algorithm 3 Weight Clustering

The accuracy and peformance metrics of our models are listed in Table X and XI. We compare in terms of both a hardware-independent HOPS (homomorphic operations) and wall-clock runtime.

X-H Message Sizes

The data transfer between the client and server consists of the values for the encrypted input and the values for the encrypted output prediction. Note that both the image and the prediction are encrypted under multiple keys held by the client, each corresponding to a distinct plaintext moduli , which leads the message size to be proportional to the number of moduli used for evaluation. We used fifteen plaintext moduli in our experiments.

As a result, the message size for the encrypted input in our DFE-ResNet-152 method is bytes (789.2 GB), corresponding to the encryption of the activation volume. The message size for the encrypted input in our CNN-8 baseline method is bytes (1183.8 GB), corresponding to the encryption of the input image. The message size of the encrypted output is identical in each case: bytes (7.9 MB).

Since the input is transformed to a representation with a smaller dimensionality in the transfer learning method, the cost of data transfer is reduced by 1.5x. While the message size is significant in both cases, we note that ciphertext batching techniques can amortize the cost of encrypted inference when a user wishes to request predictions on multiple images. In the case of diabetic retinopathy detection, this could correspond to predictions for both the left and right eye, or predictions for images of multiple patients of a healthcare provider.

X-I Experimental Correctness

We validate on 100 images of our CIFAR-10 and Retina experiments to ascertain the correctness of our decrypted outputs. Once more, we find around 0.05% error due to the floating/fixed point conversion.

X-J Overall Comparison

In a direct comparison between the baseline CNN-8 model and our transfer learning DFE-RN-50 model, we observe an across-the-board improvement. DFE-RN-50 has higher accuracy, significantly reduces both the count of HOPs and measured runtime, and produces smaller message sizes than our baseline model. We demonstrate the effectiveness of the sparsity-based optimization techniques to reduce computation time (7.8x speedup). We also show how other privacy concepts like differential privacy can be used to further improve the performance of our feature extraction architecture as we can see with the improved accuracy of DP-DFE-RN-50. To the best of our knowledge, this is the first implementation of homomorphic encryption and neural networks on a real-world medical imaging dataset.

Xi Discussion

Encrypted inference is not a panacea for private machine learning. It has some obvious constraints the paper touches on in several sections, including the computational cost and network depth limitations. Additionally, it does not cover the problem of private training and of defending against machine learning attacks. The encrypted inference paradigm is still vulnerable to black box attacks, as it still returns encrypted outputs that are otherwise unaffected. For example, membership inference [24] and model stealing [65] attacks can be performed with only access to the outputs of the model.

Xii Conclusion

Personal privacy is increasingly under threat in the modern digital age, and machine learning models continue to fuel the appetite for more invidual data and information. Homomorphic encryption holds great promise due to the security guarantees it can provide against both eavesdroppers and service hosts. Unlocking its potential will require reducing the high overhead of arithmetic operations prevalent in neural networks.

In this work, we introduced and evaluated techniques for accelerating CryptoNets [29]. The fundamental approach to our method is to leverage sparsity by using (i) efficient polynomial approximations for the activation functions and (ii) pruning and quantization that is tailored to the encryption scheme for significant performance gains. We show that our method, Faster CryptoNets, is faster than CryptoNets without much loss of test set accuracy.We also demonstrate how our technique can be deployed in a privately trained feature extraction setting, possibly inspiring future avenues of work where different privacy concepts can be combined to deliver an end-to-end privacy safe training and inference pipeline. To the best of our knowledge, this is the first implementation of homomorphic encryption on a real-world medical imaging dataset.

Recent developments can produce even greater improvements. Structured sparsity [67], filter-level pruning methods [46], and efficient batching scheme and hardware acceleration techniques [49] could further accelerate evaluation of deeper networks. In particular, more optimal encoding schemes could help reduce the message sizes of the encrypted data and provide more efficient parameters for the encryption scheme. We hope this work will inspire future lines of research in efficient and privacy-safe machine learning.

References

  • [1] V. K. A. Hoover and M. Goldbaum, “Locating blood vessels in retinal images by piece-wise threhsold probing of a matched filter response,” 2000.
  • [2] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” pp. 308–318, 2016. [Online]. Available: http://doi.acm.org/10.1145/2976749.2978318
  • [3] M. Abadi, Úlfar Erlingsson, I. Goodfellow, H. B. McMahan, N. Papernot, I. Mironov, K. Talwar, and L. Zhang, “On the protection of private information in machine learning systems: Two recent approaches,” in IEEE 30th Computer Security Foundations Symposium (CSF), 2017, pp. 1–6. [Online]. Available: https://arxiv.org/abs/1708.08022
  • [4] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in Sigmod Record.   ACM, 2000.
  • [5] S. Akleylek, N. Bindel, J. Buchmann, J. Krämer, and G. A. Marson, “An efficient lattice-based signature scheme with provably secure instantiation,” in International Conference on Cryptology in Africa.   Springer, 2016.
  • [6] M. Albrecht, M. Chase, H. Chen, J. Ding, Goldwasser et al., “Homomorphic encryption standard,” 2018.
  • [7] American Academy of Ophthalmology, “International clinical diabetic retinopathy disease severity scale detailed table.” 2002.
  • [8] H. Bae, J. Jang, D. Jung, H. Jang, H. Ha, and S. Yoon, “Security and Privacy Issues in Deep Learning,” ArXiv e-prints, Jul. 2018.
  • [9] J.-C. Bajard, J. Eynard, A. Hasan, and V. Zucca, “A full rns variant of fv like somewhat homomorphic encryption schemes,” in Selected Areas in Cryptography, 2016.
  • [10] Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” in Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 17–36.
  • [11] F. Bourse, M. Minelli, M. Minihold, and P. Paillier, “Fast homomorphic evaluation of deep discretized neural networks,” IACR Cryptology ePrint Archive, vol. 2017, p. 1114, 2017.
  • [12] Z. Brakerski and V. Vaikuntanathan, “Efficient fully homomorphic encryption from (standard) lwe,” Journal on Computing, 2014.
  • [13]

    J. Brickell and V. Shmatikov, “Privacy-preserving classifier learning,” in

    International Conference on Financial Cryptography and Data Security.   Springer, 2009.
  • [14] N. Brisebarre, J.-M. Muller, and A. Tisserand, “Computing machine-efficient polynomial approximations,” Transactions on Mathematical Software, 2006.
  • [15] H. Chabanne, A. de Wargny, J. Milgram, C. Morel, and E. Prouff, “Privacy-preserving classification on deep neural network.” IACR Cryptology ePrint Archive, 2017.
  • [16] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” JMLR, 2011.
  • [17] H. Chen, K. Han, Z. Huang, A. Jalali, and K. Laine, “Simple encrypted arithmetic library v2.3.0,” Microsoft Research TechReport, 2017.
  • [18] J. H. Cheon, J. Jeong, J. Lee, and K. Lee, “Privacy-preserving computations of predictive medical models with minimax approximation and non-adjacent form,” in International Conference on Financial Cryptography and Data Security.   Springer, 2017.
  • [19]

    J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”

    arXiv, 2014.
  • [20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  • [21] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incorporating second-order functional knowledge for better option pricing,” in NIPS, 2001.
  • [22] C. Dwork, “Differential privacy: A survey of results,” in International Conference on Theory and Applications of Models of Computation.   Springer, 2008.
  • [23] T. ElGamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” Transactions on Information Theory, 1985.
  • [24] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15.   New York, NY, USA: ACM, 2015, pp. 1322–1333. [Online]. Available: http://doi.acm.org/10.1145/2810103.2813677
  • [25] A. Gautier, Q. N. Nguyen, and M. Hein, “Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods,” in NIPS, 2016.
  • [26] C. Gentry et al., “Fully homomorphic encryption using ideal lattices.” in STOC, 2009.
  • [27] Z. Ghodsi, T. Gu, and S. Garg, “Safetynets: Verifiable execution of deep neural networks on an untrusted cloud,” in NIPS, 2017.
  • [28] I. Giacomelli, S. Jha, M. Joye, C. D. Page, and K. Yoon, “Privacy-preserving ridge regression with only linearly-homomorphic encryption,” Cryptology ePrint Archive, 2017.
  • [29] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy,” in ICML, 2016.
  • [30] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in AISTATS, 2011.
  • [31] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1153–1159, 2016.
  • [32] V. Gulshan, , L. Peng et al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” Jama, 2016.
  • [33] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in NIPS, 2016.
  • [34] R. Hall, S. E. Fienberg, and Y. Nardi, “Secure multiple linear regression based on homomorphic encryption,” Journal of Official Statistics, 2011.
  • [35] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” ICLR, 2016.
  • [36] D. Harvey, “Faster arithmetic for number-theoretic transforms,” Journal of Symbolic Computation, 2014.
  • [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
  • [38] E. Hesamifard, H. Takabi, and M. Ghasemi, “Cryptodl: Deep neural networks over encrypted data,” arXiv, 2017.
  • [39] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [40] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, 1991.
  • [41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
  • [42] V. Kolesnikov and T. Schneider, “Improved garbled circuit: Free xor gates and applications,” in International Colloquium on Automata, Languages, and Programming.   Springer, 2008.
  • [43] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.html
  • [44] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, 1995.
  • [45] R. Livni et al., “On the computational efficiency of training neural networks,” in NIPS, 2014.
  • [46] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv, 2017.
  • [47] L. Ma and K. Khorasani, “Constructive feedforward neural networks using hermite polynomial activation functions,” Transactions on Neural Networks, 2005.
  • [48] F. D. McSherry, “Privacy integrated queries: An extensible platform for privacy-preserving data analysis,” SIGMOD, 2009.
  • [49] V. Migliore, C. Seguin, M. M. Real, V. Lapotre, A. Tisserand, C. Fontaine, G. Gogniat, and R. Tessier, “A high-speed accelerator for homomorphic encryption using the karatsuba algorithm,” ACM Trans. Embed. Comput. Syst., vol. 16, no. 5s, pp. 138:1–138:17, Sep. 2017. [Online]. Available: http://doi.acm.org/10.1145/3126558
  • [50] P. Mohassel and Y. Zhang, “Secureml: A system for scalable privacy-preserving machine learning,” in Symposium on Security and Privacy.   IEEE, 2017.
  • [51] M. Naehrig, K. Lauter, and V. Vaikuntanathan, “Can homomorphic encryption be practical?” in ACM Workshop on Cloud computing security.   ACM, 2011.
  • [52] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-supervised knowledge transfer for deep learning from private training data,” ICLR, 2017.
  • [53] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and Ú. Erlingsson, “Scalable private learning with pate,” arXiv preprint arXiv:1802.08908, 2018.
  • [54] F. Piazza, A. Uncini, and M. Zenobi, “Artificial neural networks with adaptive polynomial activation function,” 1992.
  • [55] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: a self-gated activation function,” arXiv, 2017.
  • [56] R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On data banks and privacy homomorphisms,” Foundations of secure computation, 1978.
  • [57] B. D. Rouhani, M. S. Riazi, and F. Koushanfar, “Deepsecure: Scalable provably-secure deep learning,” arXiv, 2017.
  • [58] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, 2015.
  • [59] A. Sanyal, M. J. Kusner, A. Gascón, and V. Kanade, “TAPAS: Tricks to Accelerate (encrypted) Prediction As a Service,” ArXiv e-prints, Jun. 2018.
  • [60] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security.   ACM, 2015, pp. 1310–1321.
  • [61] S. Song, K. Chaudhuri, and A. D. Sarwate, “Stochastic gradient descent with differentially private updates,” in Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE.   IEEE, 2013, pp. 245–248.
  • [62] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015. [Online]. Available: http://arxiv.org/abs/1512.00567
  • [63] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
  • [64] B. Toghi and D. Grover, “MNIST Dataset Classification Utilizing k-NN Classifier with Modified Sliding Window Metric,” ArXiv e-prints, Sep. 2018.
  • [65] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction apis,” CoRR, vol. abs/1609.02943, 2016. [Online]. Available: http://arxiv.org/abs/1609.02943
  • [66] M. Voets, K. Møllersen, and L. Ailo Bongo, “Replication study: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” ArXiv e-prints, Mar. 2018.
  • [67] W. Wen et al., “Learning structured sparsity in deep neural networks,” in NIPS, 2016.
  • [68] P. Xie, M. Bilenko, T. Finley, R. Gilad-Bachrach, K. Lauter, and M. Naehrig, “Crypto-nets: Neural networks over encrypted data,” arXiv, 2014.
  • [69] A. C.-C. Yao, “How to generate and exchange secrets,” in Foundations of Computer Science.   IEEE, 1986.
  • [70] X. Zhang, S. Ji, and T. Wang, “Differentially private releasing via deep generative model,” CoRR, vol. abs/1801.01594, 2018. [Online]. Available: http://arxiv.org/abs/1801.01594
  • [71] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with low-precision weights,” ICLR, 2017.