I Introduction
As cloudbased machine learning services become more widespread, there is a strong need to ensure the confidentiality of sensitive healthcare records, financial data, and other information that enters thirdparty pipelines. Traditional machine learning algorithms require access to raw data, which opens up potential security and privacy risks. For some fields such as healthcare, regulations may preclude the use of external prediction services if the technology cannot provide the necessary privacy guarantees.
In this work, we address the task of encrypted inference for secure machine learning services. We make the assumption that the thirdparty provider already has a trained model, as is common in “machine learning as a service” paradigms. Using cryptographic techniques, an organization such as a research hospital or fraud detection company will be able to offer prediction services to users while ensuring security guarantees for all parties involved. We follow the procedure set by previous work [29, 68] and employ homomorphic encryption (HE) to convert a trained machine learning model into a HEenabled model.
Homomorphic encryption [56] allows a machine learning model to perform calculations over encrypted data. By design, the output prediction is also encrypted, which prevents the input or output from leaking information to the model’s host. As show in Figure 1, the model does not decrypt the data nor is the private key needed [12].
Several challenges prevent widespread adoption of encrypted machine learning. A major bottleneck is computational complexity. Inference on plain networks is performed in the orders of milliseconds, while encrypted networks require minutes or hours per example [29, 38]. Also, the reduced arithmetic set of HE prevents the use of modern activation functions [15]. necessitating the use of simpler lowerperformance functions.
Ia Contributions
We propose Faster CryptoNets – a method for encrypted inference on the order of seconds. This is a significant improvement over existing stateoftheart, which performs inference on the order of minutes. Our contributions accelerate the homomorphic evaluation of deep learning models on encrypted data using sparse representations throughout the neural network. Additionally, we are able to efficiently approximate modern activation functions. Finally, we show how this technique can be combined with private training techniques in a plausible realworld scenario.
By intelligently pruning the network parameters, we can avoid many multiplication operations – a major contributor to computational complexity. We can progressively quantize the remaining network parameters such that the plaintext encodings achieve maximum sparsity. Also, given that the activation function is the single most expensive operation of the network, we derive an optimal, quantized polynomial approximation to the activation function also with maximallysparse encodings. We empirically show a significant improvement in the runtime of the network on MNIST. We perform additional experiments on larger datasets to demonstrate the viability and performance gain on practical tasks. We use a featureextraction based framework to reduce the number of layers requiring encrypted computation, while using differentially private training to achieve competitive accuracy on realworld datasets.
Ii Related Work
Iia Related Work
Privacypreserving machine learning models attempt to address computation and statistical modeling of private data [4]. Privacy is preserved when two conditions are met: (i) the enduser learns nothing about the model and (ii) the model learns nothing from the data [13]. Differential privacy, multiparty computation (MPC), and homomorphic encryption are different methods to preserve privacy.
Differential privacy allows statistics to be computed over a dataset without revealing information about individual records [22, 16]. A common method is to apply noise to individual examples to obfuscate statistical differences that might be distinguishable [52]. However, differential privacy is better suited for the training phase. During testtime, adding noise to a single example may change the prediction.
Secure multiparty computation enables multiple parties to jointly compute a function over their inputs while keeping their inputs private. This has been explored using Garbled Circuits [69] in the works of [57, 42] and [50]. These methods often involve a high communication complexity with significant bandwidth costs.
Fully homomorphic encryption (FHE) was proposed by [26] and allows anyone to compute over encrypted data without decrypting it [51]. A weaker version of FHE, termed leveled homomorphic encryption (LHE) permits a subset of arithmetic operations on a depthbounded arithmetic circuit [12]. While HE has been explored for machine learning applications, many works focus on simpler models such as linear [34], logistic [18]
and ridge regression
[28]. CryptoNets [29] was one of the first works to implement HE in a neural network setting. More recently, [15] and [38]extended this to deeper network architectures and developed additional polynomial approximations to the activation function that leveraged batch normalization for stability.
Other works have explored the broader use of polynomial activation functions. [54] and [47] used a polynomial function in the nonencrypted domain to some success. The original theory dates back to [40] who argues that as long as the activation function is arbitrarily bounded and nonconstant, the neural network is a universal approximator. Some prior work even suggests that neural networks equipped with polynomial functions have the same representational power as their nonpolynomial counterparts [25, 45]. In §IV and §V, we explore these ideas in greater detail.
Recent works have proposed techniques that accelerate neural network inference on encrypted data. Sanyal et. al. [59]
use sparsification techniques via binarized neural networks which achieves a similar speedup of around 30x wallclock time as our technique on MNIST. Florian et. al.
[11] opt for an approach that leverages scale invariance to allow unrestricted depth of neural networks. The technique we propose is distinct from these approaches due to its use of the encoding scheme to accelerate multiplicative operations, in contrast to the previous approaches which bypass expensive operations using the sign activation function. Our approach is advantageous in that it is more compatible with common neural network components; sign activation functions are known to cause difficulty with convergence, and the scale invariant [11] precludes the use of convolutional layers. We do not present detailed comparisons to these works in our analysis due to these fundamental architectural differences, and opt for a direct comparison to CryptoNets to clearly demonstrate in which layers and with which operations are our speedups derived from.Iii Threat Model
Machine Learning as a Service (MLaaS) [8] is a framework where cloud providers offer machine learning training of inference hosted on the cloud. In our scenario we will be considering a MLaaS inference pipeline, where users send data to a remote server and receive predictions performed by machine learning models. The machine learning model is pretrained on a proprietary dataset.
A universal threat in multiparty situations is the inherent risk of data transmission, either by interception or sidechannel attacks. This threat can be mitigated to a large extent by using strong cryptographic and signature protocols to protect the data intransmission. However, a concern that is much harder to alleviate involves the threat of the cloud host collecting and utilizing the transmitted data without authorization [8]. In a naive scheme, a user sends encrypted data to the cloud, but also has to provide a key to the server to decrypt the data and compute a output with a machine learning algorithm before sending the encrypted prediction back to the user. The cloud host must have access to the plain data, and it is hard to guarantee or prove to the user that the data is not kept on the server, where it can either be sold to thirdparties or be stolen by attackers who gain access to the data.
Homomorphic encryption provides a solution to both problems. By design, the transmitted data is protected using a strong encryption scheme. It also enables ”oblivious inference”, where a cloud host operates on data that it is oblivious to. If the service provider is only allowed to compute on the encrypted data to compute an encrypted output without ever decrypting the data at any step, it will never have access to the plain data, guaranteeing data privacy from the cloud provider.
Iv Preliminaries
A homomorphism is a structurepreserving transformation between two algebraic structures, which can be leveraged by cryptosystems to allow for arithmetic operations on encrypted data. Let be a cyclic group of order with generator . Let be randomly sampled as the public key. Consider the ElGamal encryption scheme [23], which uses a map such that for random . The map preserves the multiplicative structure of the integers such that ) where is the multiplication operation in .
The leveled homomorphic encryption scheme that we present below has a more complex algebraic structure, and supports both additive and multiplicative homomorphisms, but this example can serve as a basis for understanding the role of homomorphic encryption in our network design.
Iva Notation
Let denote the polynomial ring . We let denote uniformly random sampling of from an arbitrary set , and denote a coefficientwise division and rounding of the polynomial with respect to integer moduli and . Let denote the reduction of the coefficients of the polynomial modulo , and let denote .
IvB Encryption Scheme.
Bajard et al. [9] proposed an encryption scheme, FVRNS, which is a residue number system (RNS) variant of the FV encryption scheme. In FVRNS, plaintexts are elements of the polynomial ring , where is the plaintext modulus and is the maximum degree of the polynomial, which is commonly selected to be one of . The plaintext elements are mapped to multiple ciphertexts in in the encryption scheme, with as the ciphertext coefficient modulus. For any logarithm base , let be the number of terms in the base decomposition of polynomials in that is used for relinearization.
Let
denote the truncated discrete Gaussian distribution. The secret key is generated as
with coefficients . The public key is generated by sampling and and constructing . The evaluation keys are generated by sampling and constructing for each .A plaintext is encrypted by sampling with coefficients and , and letting . A ciphertext is decrypted as .
IvC Arithmetic
. The addition of two ciphertexts and is . The multiplication of two ciphertexts and occurs by constructing
We express in base as . We then let and , which forms the product ciphertext .
The addition of ciphertext and plaintext is the ciphertext . The multiplication of ciphertext and plaintext is the ciphertext .
The advantage of the residue number system variant is that the coefficient modulus can be decomposed into several small moduli to avoid multipleprecision operations on the polynomial coefficients in the homomorphic operations, which improves the efficiency of evaluation.
IvD Integer Encoder
. To encode real numbers involved in the computation, we choose a fixed precision for the values (15 bits) and scale each value by the corresponding power of 2 to get an integer for use with the encoder described below. After decryption, we can divide by the accumulated scaling factor to obtain a real value for the prediction. The encoder consists of a base2 integer encoder [17]. For a given integer , consider the binary expansion of . The the coefficients of the polynomial in the plaintext ring are if otherwise .
V Method
Va Sparse Polynomial Multiplication
The convolutional and fully connected layers of a neural network require a substantial number of multiplications involving both the ciphertext inputs and the plaintext parameters of the model. Each operation involves computing the product of two polynomials with up to nonzero coefficients. While a bruteforce implementation would require time to complete, homomorphic encryption methods are able to accomplish this in when certain conditions are met. Assuming that the coefficient modulus is chosen such that is divisible by , we can invoke the Number Theoretic Transform to achieve [36].
Our contributions leverage the following insight: a substantial improvement in efficiency occurs when the plaintext multiplier for some . The polynomial that encodes this integer is , a monomial multiplier. For such parameters, sparse polynomial multiplication [5] has been shown to use coefficient multiplications and modular reductions (see Algorithm 1).
VB Network Pruning and Quantization
The parameters of a neural network can be iteratively removed and clustered without affecting accuracy. [35] developed a compression method that leverages these techniques. Since then, new pruning and quantization techniques have been proposed [46]. We leverage these techniques to reduce the number of weights that contribute to the multiplication count, and convert the weights to powers of 2, which have sparse polynomial representations that reduce the cost of each multiplication. Together, these lead to significant reductions in inference time.
We first train a pruned version of the network with Dynamic Network Surgery (DNS) [33] that incorporates connection splicing. The remaining network parameters are quantized to powers of 2 following the incremental network quantization (INQ) procedure proposed by [71]. The INQ method consists of an iterative quantization strategy to preserve the original inference accuracy.
For each layer , the layer’s parameters have a corresponding binary pruning mask . The elements of the binary pruning mask get updated during gradient descent according to a discriminative measure of parameter importance , typically incorporating a magnitudebased measure such as . We define and , which will help bound our quantized values for each layer:
where is used to restrict the set of powers for our desired bitwidth. We use . Note that and
is the set of possible quantized values for the parameters of layer in the network. We define a monotonically increasing weight partition schedule using the discriminative measure to progressively quantize the weights. For example, one can quantize 50% of the weights, then 75%, then 87.5%, then 100%, retraining the other nonquantized weights at each step of the quantization procedure.
VC Approximating the Activation Function
Using our pruning and quantization scheme from §VB, our next contribution lies in finding the optimal polynomial approximation for any activation function given the constraint that the coefficients must be a power of 2. The activation function of a neural network is critical for convergence [30] and has been thoroughly explored in literature [55]. With the goal of encrypted network inference, we must find an approximation which balances approximation error with practical usability. Inspired by [14], we find the best polynomial approximation.
Polynomials. Let and let denote the activation function. Our task is to approximate with a polynomial where subject to the constraint that each coefficient is a power of 2. Define as the set of all polynomials of degree less than or equal to , such that all coefficients are base2. That is, . Let be the minimax approximation to on some interval . Let be the same as , but with all coefficients rounded to the nearest where . Note, .
Maximum Error & Minimax. The maximum difference (i.e., error) between two functions and is . This provides a strong bound on the optimal polynomial approximation error where We state minimax problem as follows. For a given activation function , we seek to find the best polynomial such that,
(1) 
subject to the constraint,
(2) 
Finite Number of Solutions. Let , and . For , let such that if . We can construct a bounded polyhedron,
where each tuple represents any polynomial , and where represents the degree coefficient. [14] show that the number of polynomials satisfying Equation 2 is finite if the polynomials are contained in . They also proposed an efficient scanning method to find the optimal polynomial approximation . Equipped with our newfound approximation , we can evaluate the effectiveness of as an activation function in both nonencrypted and encrypted domains.
Vi Experiments
Via Wallclock Runtime
The runtime refers to the wallclock time required to perform inference on an encrypted image. This metric is the default metric reported in previous encrypted works. However, the wallclock time is an imperfect metric for measuring improvements in encrypted inference. It is hardwaredependent, varying greatly depending on the available memory and computational power of the device, and it is also possibly encryptionscheme dependent, with even the same encryption algorithm being implemented differently across libraries. In the next sections, we introduce a metric to evaluate our methods using hardwareindependent metrics.
ViB Explanation of HOPs
We report the number of homomorphic operations (HOPs) of our inference network. This is in contrast to previous work [29, 15, 38], which measured either throughput or wallclock time – both of which are highly dependent on hardware specifications and software parallelization, and are not entirely reliable measures. The HOPs metric is similar to the FLOPs (floating point) metric used in scientific computing.
A homomorphic operation is defined as addition or multiplication involving a ciphertext, a plaintext, or both. The four classes of HOPs are (i) plaintextciphertext addition, (ii) ciphertextciphertext addition, (iii) plaintextciphertext multiplication, and (iv) ciphertextciphertext multiplication. While the exact implementation of HOPs may vary, we believe HOPs are a hardwareindependent metric for performance analysis that enable a better comparison of models for encrypted inference, demonstrating whether speedup occurs due to decreased number of operations or due to algorithmic speedup.
It is important to note that the different HOPs classes vary in cost. In general, multiplicative operations are significantly more expensive than additive operations, with ciphertextciphertext multiplications being the most costly operations found in neural networks. Throughout our analysis, we break down our HOPs into separate operations, following the ruleofthumb that reducing multiplicative HOPs outweighs the cost of adding additive HOPs.
ViC Datasets
We use the MNIST dataset of handwritten digits [44] which contains grayscale images of Arabic numerals 0 to 9 (i.e., 10class classification task), which has a standard split of 50,000 training images and 10,000 images test set images. While MNIST is arguably a simple dataset, it has remained the standard benchmark for homomorphic inference tasks [29, 38].
ViD Network Architecture
The network architecture used for MNIST inference is presented below. The architecture itself is a slight variant of the CryptoNet [29] architecture that incorporates batch normalization layers to support a greater variety of activation functions. The multiplicative depth is unchanged. As shown in Figure 2, our approximation error is minimized close to zero. Batch normalization encourages the preactivation values to fit in this range. As confirmed in [15] and [38]
, by reducing the variance in the input values to the activation layer, the approximation error of the network decreases. Overall, our model is a convolutional neural network
[44] consisting of convolutional layers, activation functions, scaled average pooling, batch normalization, and fullyconnected layers.1. Convolutional Layer.
The input image is 28 x 28. There are 20 kernels of size 5 x 5, with stride 2, and padding of 1.
2. Batch Normalization Layer. This layer applies the batch normalization weights and biases to each input value.
3. Activation Layer. This layer applies the approximate activation function to each input value.
4. Scaled Average Pool Layer. This layer has 3 x 3 windows, with a stride of 2, padding of 1, and output size of 5 x 13 x 13.
5. Convolutional Layer. This layer has 50 kernels of size 20 x 5 x 5, with a stride of 1, and zero padding.
6. Scaled Average Pool Layer. This layer has 3 x 3 windows and a stride of 2, padding of 1, and output size of 50 x 5 x 5.
7. FullyConnected Layer. This layer has parameters of size 1250 x 100 for matrix multiplication with respect to inputs.
8. Batch Normalization Layer. This layer applies the batch normalization weights and biases to each input value.
9. Activation Layer. This layer applies the approximate activation function to each input value.
10. FullyConnected Layer. This layer has parameters of size 100 x 10 for matrix multiplication with respect to inputs.
ViE Encryption Scheme
. The parameters for the FVRNS encryption scheme are: coefficient count of , plaintext moduli of = 1099511922689 and = 1099512004609. The values of are selected for 128bit security ( = 219). This choice of coefficient modulus meets the security standards established by the Homomorphic Encryption Standardization Workshop [6].
ViF Hardware/Software Setup
. The machine used for the MNIST experiments has an Intel Core i75930K CPU at 3.5 GHz with 48 GB RAM on Ubuntu 17.10. The HE library was SEAL v2.3.04 [17], modified by us to support our proposed method.
ViG Optimization Hyperparmeters
We provide the hyperparameter settings used to train our nonencrypted network. A batch size of 64 was used and the model was trained for 30 epochs. The learning rate schedule was initialized at
with a step size of 10 epochs and. The model was trained with stochastic gradient descent with a momentum of 0.9. For the square function, gradients were clipped at 0.25. He weight initialization was used for the convolutional layers.
ViH Dynamic Network Surgery Hyperparemeters
We report the hyperparameters of our dynamic network surgery operations. The sparsity denotes the final fraction of nonpruned connections over the total connections crate denotes compression rate used to set the threshold of importance before removing a connection.
We report metrics for each layer. The conv1 layer had a sparsity of 0.1440 and crate of 1.5. The conv2 layer had a sparsity of 0.0701 and crate of 1.65. The densefc1 layer had a sparsity of 0.0568 and crate of 1.65. The densefc2 layer had a sparsity of 0.1480 and crate of 1.5. All layers stopped at iteration 10,000.
ViI Approximation Results
is plotted with three approximations: the minimax estimate
, the rounded minimax estimate , and the our method – the quantized minimax approximation . (Bottom) Error of our method. Our method is compared to the baseline . The blue shaded area corresponds to the postbatch normalization region during the training procedure.Prior work suggests that neural networks equipped with polynomial functions have the same representational power as their nonpolynomial counterparts [25, 45]. Faster CryptoNets uses quadratic activation functions that approximate modern activations with varying degrees of complexity and expressivity. Our proposed method allows us to construct an optimal, quantized polynomial approximation of any arbitrary function. In our experiments, we consider ReLU [30], Softplus [21], and Swish [55]. We model all activation functions with a 2^{nd} degree polynomial. While higherdegree polynomials can decrease the approximation error, higherdegree polynomials also require more HOPs.
We note that [27]
showed how the gradient of the square function can be large. Their solution was to apply gradient clipping to improve model convergence. While this is a viable solution for recurrent networks
[19], clipping gradients in a shallow network (such as ours) may indicate model instability and may not work for deeper variants. To avoid this, we do not use the square activation function.ViJ Polynomial Approximation Equations
We list the polynomial approximations to the Swish, Softplus, and ReLU activation functions.
Swish

Minimax:

Rounded Minimax:

Quantized:
Softplus

Minimax:

Rounded Minimax:

Quantized:
ReLU

Minimax:

Rounded Minimax:

Quantized:
ViK Error Minimization
The purpose of the error minimization experiment is to determine which activation function produces the lowest approximation error under our quantization constraints. We evaluate the effectiveness of multiple approximation schemes including our method.
ViL Activation Approximation Accuracy
We present Table I which contains accuracy values for all the layers and all of the activation functions over three trials. Activation layers we considered include ReLU, square, Swish, and softplus, using the original function, approximated function, and quantized approximation function.
Trial 1  Trial 2  Trial 3  Mean  Stddev  
Activation  Train  Test  Train  Test  Train  Test  Train  Test  Train  Test 
Square  99.80  99.08  99.81  99.14  99.8  99.29  99.80  99.17  0.01  0.11 
ReLU  99.65  99.20  99.59  99.14  99.62  99.05  99.62  99.13  0.03  0.08 
ReLUapprox  99.57  99.07  99.60  99.14  99.58  99.07  99.58  99.09  0.02  0.04 
Softplus  99.42  99.17  99.37  99.06  99.41  99.05  99.4  99.09  0.03  0.07 
SoftplusA  99.34  99.05  99.39  98.98  99.38  98.98  99.37  99.00  0.03  0.04 
SoftplusAQ  99.17  98.92  99.13  98.92  99.17  98.87  99.16  98.9  0.16  0.03 
Swish  99.63  99.16  99.64  99.22  99.64  99.02  99.64  99.13  0.01  0.10 
SwishA  99.56  99.07  99.59  99.13  99.58  99.07  99.58  99.09  0.02  0.03 
SwishAQ  99.56  99.09  99.60  99.12  99.60  99.08  99.59  99.10  0.02  0.02 
(our method). For each activation function, three models were trained with different random seeds. The mean accuracy and standard deviation are shown.
Figure 2 shows our approximation methods applied to Swish, ReLU, and softplus. The functions are plotted on the top row. Most approximations are able to fit the original function within the interval . The bottom row of Figure 2 shows the approximation error of and for different preactivation
values. Overall, Swish has lower error than ReLU and softplus. If we can constrain the preactivation values to fall within the interval, our model will have better approximations. Conveniently, batch norm transforms the preactivation values into a normal distribution with zero mean and unit variance
[41] which reduces overall error of the approximation [15]. The shaded area under the curve in Figure 2 shows the approximation error within the interval . Swish has lower error than both ReLU and softplus.In Figure 3, we investigate the correctness of our proposed activation approximation method by plotting the preactivation and postactivation values of different layers for both the regular and approximated Swish functions. The postactivation graphs in Figure 3 for Swish show the minimum value between . We analytically compute the theoretical minimum value for Swish by taking the first order derivative . This gives us the equation , from which we can derive . Using to compute , we get an approximate minimum value of , which corroborates our empirical minimum values shown in Figure 3. We find that this minimum value remains consistent for the approximated Swish function as well, validating the correctness of our approximation method.
ViM Detailed Breakdown of Homomorphic Operations
Layer  HOPs  20cmPTCT  
Adds  20cmCTCT  
Adds  20cmPTCT  
Mults  20cmCTCT  
Mults  
Conv1  42,757  845  20,956  20,956  — 
Act1  845  —  —  —  845 
Pool1  6,845  —  6,845  —  — 
Conv2  309,950  1,250  154,350  154,350  — 
Pool2  8,450  —  8,450  —  — 
FC1  241,192  100  120,546  120,546  — 
Act2  100  —  —  —  100 
FC2  1990  10  990  990  — 
Total  612,129  2,205  312,137  296,842  945 
Layer  HOPs  20cmPTCT  
Adds  20cmCTCT  
Adds  20cmPTCT  
Mults  20cmCTCT  
Mults  
Conv1  8619  1,690  3,042  3,887  — 
Act1  5,070  845  1,690  1,690  845 
Pool1  6,845  —  6,845  —  — 
Conv2  22,950  1250  10,850  10,850  — 
Pool2  8,450  —  8,450  —  — 
FC1  14,354  100  7,077  7,177  — 
Act2  600  100  200  200  100 
Fc2  306  10  148  148  — 
Total  67,194  3,995  38,302  23,952  945 
ViN Comparison with Prior Work
Criteria  Faster CryptoNets  CryptoNets  CryptoDL1  CryptoDL2 
PTCT Adds  3,995  2,205  30,750  161,546 
CTCT Adds  38,302  312,137  
PTCT Mults  23,952  296,842  
CTCT Mults  945  945  1,600  64,512 
Total HOPs  67,194  612,129  
Encrypt+Decrypt Time  6.7 sec  47.5 sec  16.7 sec  16.7 sec 
Inference Time  39.1 sec  249.6 sec  148.9 sec  320.0 sec 
Test Set Accuracy  98.71  98.95  98.52  99.52 
Message Size  411.1 MB  367.5 MB  336.7 MB  336.7 MB 
Encryption Scheme  FVRNS  YASHE  BGV  BGV 
The target use case of our work is inference on a single encrypted image (Figure 1). We believe this approach is more analogous to practical use cases, where the thirdparty host runs asynchronous inference for individual users. Additionally, [49] suggests that there are very significant drawbacks to batching, including having to select more numerous and restricted NTT points, forcing specific computations away from NTT, and adding large computational cost. Works focusing on accelerating neural networks neglect batching for similar reasons as ours ([59] does not use batching, and [11]. uses batching to compress messages but not to improve throughput). Works that do batch inputs use schemes not very efficient in practice (discussed in [49].) and do not report the batching cost. A thorough performance analysis of batching binary vs scalar messages across different libraries is beyond the scope of our paper but would be a great direction for future work. As such, we do not implement ciphertext batching techniques in this paper, although we find it worth noting that our technique does not preclude the use of batching techniques. [49] introduces the Karatsuba algorithm which supports batching with binary encoding, preserving the benefits from our method.
We refer (Table IV) for accuracy and runtime results. The test set accuracy of our original model is 99.12%, and is slightly reduced to 98.71% after pruning and quantization. Evaluation of network layers in Faster CryptoNets takes 39.1 seconds for one input, compared to 249.6 seconds for CryptoNet. We achieve a improvement in wallclock time while maintaining accuracy comparable to that of CryptoNets, which achieved 98.95% test accuracy. We also find that our method achieves fewer HOPs, a larger improvement than raw wallclock time suggests. In Faster CryptoNets, encoding/encryption takes 6.63 seconds, while decryption of the final layer’s output takes 0.02 seconds. CryptoNets takes 44.5 seconds for encoding/encryption, and 3 seconds for decryption. Our method is and faster for these operations, respectively.
MNIST images are pixels. Each ciphertext consists of 2 polynomials resulting in 65,544 integers (64bit). Therefore, our message consists of bytes, or 411.1 MB. The output of the network consists of the 10 outputs of the final dense layer, which gives us a result consisting of bytes, or 5.24 MB. In CryptoNets, the authors’ encryption scheme results in each image consuming 367.5 MB in encrypted form. Our scheme results in comparable message sizes to previous work.
ViO Ablation Studies
Faster CryptoNets  CryptoNets  Relative  
Layer  Time  HOPs  Time  HOPs  Time  HOPs 
Conv1  3.9  8,619  30.0  42K  
Act1  23.4  5,070  81.0  845  
Mid  9.1  53K  127.0  566K  
Act2  2.7  600  10.0  100  
FC2  0.1  306  1.6  1,990  
Total  39.1  67K  249.6  612K 
Faster CryptoNets differs from the CryptoNets model in that we use the Swish activation instead of the square function. While both methods use a 2^{nd} degree polynomial of the form , our approximations use for increased expressivity whereas the square function set , resulting in fewer HOPs for the square function. This is shown in Table V in the rows Act1 and Act2. Despite our method requiring more HOPs for Act1 and Act2, we still achieve a faster inference time than CryptoNets. At a perlayer level, our method yields up to and improvements for wallclock and HOPs, respectively.
We compare the performance of different activation functions when approximated with our proposed polynomial approximation and quantization (AQ) method. For MNIST, SwishAQ produced a test accuracy of 99.10%, while ReLUAQ achieves an equivalent test accuracy of 99.09%. We note the similarity of ReLU and Swish. This finding is corroborated by the similarity of the approximations in Figure 2. The polynomial coefficients we calculate for the ReLU and Swish approximations turn out to be the same, except for a constant factor.
We evaluate the inference quality and runtime of pruning and quantization separately. Pruning produced a test set accuracy of 98.73% and inference time of 104.7 seconds. Quantization produced 99.06% and 162.5 seconds. When combined, pruning and quantization produced 98.71% and 45.7 seconds. We record the accuracies during each INQ step for both a nonpruned network in Table VI and a DNS pruned network in Table VII. The accuracy is largely preserved as the network is successively quantized, demonstrating consistent preservation. Overall, accuracy was preserved, or slightly improved in the case of quantization.
INQ Step  Partition  Quantized%  Accuracy 
1  0.7  30%  99.00 
2  0.4  60%  99.02 
3  0.2  80%  98.99 
4  0.0  100%  99.06 
INQ Step  Partition  Accuracy 
1  0.98  98.74 
2  0.96  98.78 
3  0.94  98.69 
4  0.92  98.68 
5  0.90  98.69 
6  0.88  98.79 
7  0.86  98.68 
8  0.00  98.71 
Vii Experimental Correctness
We make sure our parameters are selected properly so that our decrypted outputs are correct. We run encrypted inference on the 10,000 image MNIST testset, and find no accuracy loss from our method’s plaintext results (98.71%). We also find an precision error of around 0.05% when comparing the plaintext and decrypted outputs. Upon further examination, we found that his error is introduced during the floating point to fixed point conversion prior to the encoding scheme, and that this error does not effect the accuracy of our model.
Viii Scaling up
To evaluate how well our method works in realworld settings, we implement our techniques on larger datasets. First, we focus on CIFAR10 as a larger practical image classification task. Next, we consider diabetic retinopathy dataset as a realworld medical imaging usecase where very deep neural networks would be used in practice. For both experiments, we upgrade our machines to n1megamem96 instances offered by the Google Cloud Platform, which each have 96 Intel Skylake 2.0 GHz vCPUs and 1433.6 GB RAM.
Viiia FVRNS Parameters
We use a ring dimension with fifteen plaintext moduli : 40961, 65537, 114689, 147457, 188417, 270337, 286721, 319489, 417793, 557057, 638977, 737281, 778241, 786433, 925697. The values of the coefficient moduli are selected to provide 128bit security, such that . Furthermore, each coefficient modulus is decomposed into four 64bit moduli for efficient use of the RNS variant of the FV encryption scheme.
Ix Cifar10
MNIST is a relatively easy dataset, with simple machine learning algorithms like linear regression or KNN producing high accuracy results
[64]. CIFAR10 [43] is a more complicated task where CNN’s perform notably better than other methods. We evaluated the CIFAR10 performance of our method on the model used in CryptoDL [38], consisting of eight convolutional layers, which from now on we will denote as CNN8.Activation  20cmCIFAR10 Train Acc.  20cmCIFAR10 Test Acc. 
ReLU  
Square  
Softplus  
Swish  
ReLUA  
SoftplusA  
SwishA  
ReLUAQ  
SoftplusAQ  
SwishAQ 
Layer  20cmPTCT  
Adds  20cmCTCT  
Adds  20cmPTCT  
Mults  20cmCTCT  
Mults  
Conv1  36,864  460,800  479,232  0 
Conv2  36,864  13,294,908  13,313,340  — 
Activ1  18,432  36,864  55,296  36,864 
Pool1  —  18,432  —  — 
Conv3  18,432  5,968,347  5,977,563  0 
Conv4  18,432  11,931,014  11,940,230  — 
Activ2  9216  18,432  27,648  18,432 
Pool2  0  9216  —  — 
Conv5  9216  4,713,389  4,717,997  0 
Conv6  9,216  9,421,992  9,426,600  0 
Activ3  4,608  9,216  13,824  9,216 
Pool3  —  4,608  —  — 
FC1  256  29,4644  294,644  — 
FC2  10  2560  2,560  0 
Total  161,546  46,184,422  46,248,934  64,512 
Ixa Activation Comparisons
We present Table VIII which contains accuracy values for all the layers and all of the activation functions. Activation layers we considered include ReLU, square, Swish, and softplus, using the original function, approximated function, and quantized approximation function. In Table VIII, we can see that training this model with the square activation function resulted in significantly worse test accuracy (59.88%) compared to training this model with the ReLU activation function (86.76%), confirming the theoretical loss of accuracy from instability of the square function for deeper neural networks. Furthermore, we find that ReLUAQ and SwishAQ offer comparable levels of performance (77.95% and 78.20% training accuracy, 75.99% and 75.66% test accuracy), while significantly improving on the accuracy results that were achieved with the square activation function.
IxB Pruning and Quantization
The pruning and quantization procedure results in a model with slightly improved accuracy (76.72%) that requires an order of magnitude fewer HOPS for inference ( HOPs vs. HOPs for the baseline method). The inference time for the model was 22,372 seconds with our method.
IxC Message Size
The message size for the input image is bytes, or 1,610.8 MB.
X Medical Imaging
Model  Layers Retrained  Test Accuracy 
CNN8  All Layers  63.23 
DFERN50  Top Block  69.89 
DPDFERN50  All layers (with DP)  76.47 
ResNet50  All Layers 
Model  Accuracy  HOPs  Runtime (s) 
CNN8 (sparse)  63.23  1.33E8  3325 
DPDFERN50  76.04  3.95E8  
DPDFERN50 (sparse) 
Method  Acc.  HOPs  20cmInference  
Time (s)  Speedup  
Original  70.47  12493  –  
Pruned  70.98  1924  6.4x  
20cmPruned/Quantized  70.55  1590  7.8x 
A significant limiting factor in levelled encryption schemes used with neural networks is multiplicative depth, where only set amount of HE operations can be performed sequentially. Increasing the multiplicative depth by choosing larger parameters incurs prohibitive cost, limiting us to neural networks with three activation functions with our current settings. However, stateoftheart real world applications for deep learning like medical imaging applications commonly use modern very deep neural networks. To mitigate this issue, we propose the use of models trained with transfer learning, where the computations involved in the pretrained layers of the model can be delegated to the client, and encryption is applied only for the evaluation of the finetuned layers on the server. Using this technique, which we call Delegated Feature Extraction (DFE), as well as Faster CryptoNet optimizations to speed up the computation, we achieve practical runtimes for large input sizes. An illustration of our technique is provided in Figure 4. We show how this technique can be improved with private training, demonstrating a viable framework where private, efficient, and powerful machine learning services can be provided.
Xa Data
We chose the diabetic retinopathy dataset introduced by [32] both for its clinical impact and the privacysensitive nature of retinal data. The dataset consists of maculacentered retinal fundus images primarily sourced from EyePACS and was graded by 54 opthalmologists or opthalmologist trainees using the International Clinical Diabetic Retinopathy scale [7] into ‘none’, ‘mild’, ‘moderate’, ‘severe’, or ‘proliferative’ ratings for the severity of the condition.
We were able to obtain a subset of around 35,126 images of the dataset, with a label distribution of 25,810 ‘none’, 2,443 ‘mild’, 5,292 ‘moderate’, 873 ‘severe’, and 708 ‘proliferative’ diagnoses. Scans from both the left and right eye were sourced from each patient. To compare our results to the replication study performed by [66], we group the ‘none’ and ‘mild’ labels to a ‘0’ label and the ‘moderate’, ‘severe’, and ‘proliferative’ labels to a ‘1’ label to reframe our problem into a binary classification task. We randomly subsampled our dataset to get an even split between our two labels, and following the guidelines recommended by [32] we use an 8020 split for training and test data.
Before using the retinal images with our network, we perform some preprocessing on the raw images. The scans are scaled to
, the standard ImageNet input size, with cropping performed using edge detection to reframe the images. To normalize the colors and lighting, each image is subtracted by the local average color of each image, after which the local average is mapped to grayscale. Random rotation is performed on the image to make the model invariant to left/right eye positioning and for general augmentation. Samples of the data we use are provided in Figure
5.XB Transfer Learning
Transfer learning is a useful way to learn an accurate model with limited data [10], and requires retraining (finetuning) of only a few of the layers of the network rather than training a full network from scratch [39]. The base layers are commonly trained on ImageNet [20, 58], and the final layers are retrained on a specific proprietary dataset. This practice is common in the healthcare setting, where large datasets are expensive to acquire and transfer learning can simplify and/or improve the training of the model, especially when using Machine Learning as a Service (MLaaS) offerings from cloud computing providers to expedite development of the application [63, 31]. As we describe in more detail in our methods section, we can leverage transfer learning such that only a small number of finetuned layers of the network require evaluation under the encryption scheme, while the generic feature extraction of the base network layers is delegated to the client.
XC Network Architectures
We first implement a baseline model to train on the retinal dataset. Our baseline model (CNN8) resembles the CNN network architecture presented in [38] for CIFAR10, but is designed to support the larger input image sizes (scaled from to input). It leverages an identical multiplicative depth budget to our transfer learning models in realizing the privacy guarantees. In particular, it contains eight convolutional layers and three approximate activation functions.
Modern deep neural networks like ResNet152 [37] and Inceptionv3 [62] can contain hundred of layers. The multiplicative depth of the fullytrained ResNet152 or the Inceptionv3 in [32] is at least an order of magnitude greater and would have a prohibitively large runtime in the encrypted setting. Additionally, there exist challenges in achieving strong accuracy when training all layers of a deep neural network with approximate activation functions. Our proposed model (DFEResNet152) only requires retraining of the top block of the ResNet152 network, which contains three convolutional layers and three activation functions. For this top block, we replace the ReLU activation functions in the top block with our approximate activation function, and we use the rest of the ImageNetpretrained model as a delegated feature extractor on the client.
XD Model Adaptations
We use the approximation of Swish [55] given by as derived in previous sections. This approximation is required for only the activation functions in the retrained block, which reduces difficulties with convergence in training.
Since the encryption scheme only supports addition and multiplication, some minor modifications are required to support average pooling. While other work has used scaled average pooling, in this work we encode the reciprocal of the size of the pooling window. Furthermore, to support the addition operation in the residual block, the scaling factor must be encoded as well, to scale the encrypted input image for the addition operation.
XE ClientServer Interaction
The client uses a standard deep learning framework to evaluate the base network layers on a single RGB retinal fundus image. Each element of the activation volume of the final base layer can be converted to a fixedpoint value and encoded using the integer encoder described above. Each value is encrypted on the client and transmitted to the server. The server returns encrypted output values, which the client decrypts, converts to floating point, and applies the sigmoid function to learn the final predictions of the model. Given the support for deep learning operations in major mobile platforms, the client could even be the patient’s own mobile device, allowing for direct service models in developing countries.
XF Evaluation Metrics
The natural alternative to our proposed technique is to have the entire fullytrained model stored on the server and for clients to transmit their encrypted images for diagnosis. We want to demonstrate that our proposed technique of using finetuned layers of a much deeper model has higher accuracy, greater performance, and a smaller message size, without compromising any security or privacy guarantees. Our primary points of comparison will be between a standard model that is fullytrained on our dataset and a transfer learning model that uses pretrained ImageNet weights and is finetuned on our dataset.
We will need to consider the multiplicative depth of the models, which corresponds to the length of the deepest path of ciphertext multiplications through the network. We keep the multiplicative depth fixed between the two methods to enable a fair comparison. We will also analyze the count of homomorphic operations (HOPs), which serves as a hardwareindependent and implementationagnostic measure of the complexity of evaluation.
XG DpSgd
Once concern of the earlier approach is that the users are limited to finetuning only a few layers of a network. However, training the entire model and release the upper portion of the network as a feature extractor could lead to a privacy risk as the feature extractor could potentially leak sensitive information. We explore the use of private training techniques to finetune the feature extractor further to improve accuracy while preserving endtoend privacy.
Differential privacy is a privacy construct which guarantees that an individual will not change the overall statistics of the population [22]. Formally, it is defined that an algorithm and dataset are private if . Applying differential privacy to neural networks helps ensure defenses against membership inference and model inversion attacks [3]. This can be achieved by either applying noise to gradients while training a single model [2] [61] or by segregating data and adding noise in a collaborative learning setting [53] [60].
DPSGD optimization was developed by [2] and involves adding Gaussian noise and clipping gradients of neural networks during training with stochastic gradient descent. It also keeps track of the privacy loss through a privacy accountant [48], which prematurely terminates training when the total privacy cost of accessing training data exceeds a predetermined budget. Differential privacy is attained as clipping bounds the L2norm of individual gradients, thus limiting the influence of each example on the learning updates. We outline the DPSGD algorithm briefly below in Algorithm 2.
We use a modified DPSGD algorithm [2] to finetune the entire network, using the following techniques introduced by [70]
; warmstarting, where a public dataset is used to initialize the weights of the model, and weights clustering, where the same dataset is used to estimate the gradient l2 norms of each parameter before using a hierarchical clustering algorithm to group parameters with similar clipping bounds (in Algorithm
3). The public dataset we use is a smaller retinal scan dataset from the STARE project [1]. We train the network privacy settings of and giving us a value of . Finally, we retrain the layers to be encrypted as before, leaving us with our final model (DPDFERN50).XH Message Sizes
The data transfer between the client and server consists of the values for the encrypted input and the values for the encrypted output prediction. Note that both the image and the prediction are encrypted under multiple keys held by the client, each corresponding to a distinct plaintext moduli , which leads the message size to be proportional to the number of moduli used for evaluation. We used fifteen plaintext moduli in our experiments.
As a result, the message size for the encrypted input in our DFEResNet152 method is bytes (789.2 GB), corresponding to the encryption of the activation volume. The message size for the encrypted input in our CNN8 baseline method is bytes (1183.8 GB), corresponding to the encryption of the input image. The message size of the encrypted output is identical in each case: bytes (7.9 MB).
Since the input is transformed to a representation with a smaller dimensionality in the transfer learning method, the cost of data transfer is reduced by 1.5x. While the message size is significant in both cases, we note that ciphertext batching techniques can amortize the cost of encrypted inference when a user wishes to request predictions on multiple images. In the case of diabetic retinopathy detection, this could correspond to predictions for both the left and right eye, or predictions for images of multiple patients of a healthcare provider.
XI Experimental Correctness
We validate on 100 images of our CIFAR10 and Retina experiments to ascertain the correctness of our decrypted outputs. Once more, we find around 0.05% error due to the floating/fixed point conversion.
XJ Overall Comparison
In a direct comparison between the baseline CNN8 model and our transfer learning DFERN50 model, we observe an acrosstheboard improvement. DFERN50 has higher accuracy, significantly reduces both the count of HOPs and measured runtime, and produces smaller message sizes than our baseline model. We demonstrate the effectiveness of the sparsitybased optimization techniques to reduce computation time (7.8x speedup). We also show how other privacy concepts like differential privacy can be used to further improve the performance of our feature extraction architecture as we can see with the improved accuracy of DPDFERN50. To the best of our knowledge, this is the first implementation of homomorphic encryption and neural networks on a realworld medical imaging dataset.
Xi Discussion
Encrypted inference is not a panacea for private machine learning. It has some obvious constraints the paper touches on in several sections, including the computational cost and network depth limitations. Additionally, it does not cover the problem of private training and of defending against machine learning attacks. The encrypted inference paradigm is still vulnerable to black box attacks, as it still returns encrypted outputs that are otherwise unaffected. For example, membership inference [24] and model stealing [65] attacks can be performed with only access to the outputs of the model.
Xii Conclusion
Personal privacy is increasingly under threat in the modern digital age, and machine learning models continue to fuel the appetite for more invidual data and information. Homomorphic encryption holds great promise due to the security guarantees it can provide against both eavesdroppers and service hosts. Unlocking its potential will require reducing the high overhead of arithmetic operations prevalent in neural networks.
In this work, we introduced and evaluated techniques for accelerating CryptoNets [29]. The fundamental approach to our method is to leverage sparsity by using (i) efficient polynomial approximations for the activation functions and (ii) pruning and quantization that is tailored to the encryption scheme for significant performance gains. We show that our method, Faster CryptoNets, is faster than CryptoNets without much loss of test set accuracy.We also demonstrate how our technique can be deployed in a privately trained feature extraction setting, possibly inspiring future avenues of work where different privacy concepts can be combined to deliver an endtoend privacy safe training and inference pipeline. To the best of our knowledge, this is the first implementation of homomorphic encryption on a realworld medical imaging dataset.
Recent developments can produce even greater improvements. Structured sparsity [67], filterlevel pruning methods [46], and efficient batching scheme and hardware acceleration techniques [49] could further accelerate evaluation of deeper networks. In particular, more optimal encoding schemes could help reduce the message sizes of the encrypted data and provide more efficient parameters for the encryption scheme. We hope this work will inspire future lines of research in efficient and privacysafe machine learning.
References
 [1] V. K. A. Hoover and M. Goldbaum, “Locating blood vessels in retinal images by piecewise threhsold probing of a matched filter response,” 2000.
 [2] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” pp. 308–318, 2016. [Online]. Available: http://doi.acm.org/10.1145/2976749.2978318
 [3] M. Abadi, Úlfar Erlingsson, I. Goodfellow, H. B. McMahan, N. Papernot, I. Mironov, K. Talwar, and L. Zhang, “On the protection of private information in machine learning systems: Two recent approaches,” in IEEE 30th Computer Security Foundations Symposium (CSF), 2017, pp. 1–6. [Online]. Available: https://arxiv.org/abs/1708.08022
 [4] R. Agrawal and R. Srikant, “Privacypreserving data mining,” in Sigmod Record. ACM, 2000.
 [5] S. Akleylek, N. Bindel, J. Buchmann, J. Krämer, and G. A. Marson, “An efficient latticebased signature scheme with provably secure instantiation,” in International Conference on Cryptology in Africa. Springer, 2016.
 [6] M. Albrecht, M. Chase, H. Chen, J. Ding, Goldwasser et al., “Homomorphic encryption standard,” 2018.
 [7] American Academy of Ophthalmology, “International clinical diabetic retinopathy disease severity scale detailed table.” 2002.
 [8] H. Bae, J. Jang, D. Jung, H. Jang, H. Ha, and S. Yoon, “Security and Privacy Issues in Deep Learning,” ArXiv eprints, Jul. 2018.
 [9] J.C. Bajard, J. Eynard, A. Hasan, and V. Zucca, “A full rns variant of fv like somewhat homomorphic encryption schemes,” in Selected Areas in Cryptography, 2016.
 [10] Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” in Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 17–36.
 [11] F. Bourse, M. Minelli, M. Minihold, and P. Paillier, “Fast homomorphic evaluation of deep discretized neural networks,” IACR Cryptology ePrint Archive, vol. 2017, p. 1114, 2017.
 [12] Z. Brakerski and V. Vaikuntanathan, “Efficient fully homomorphic encryption from (standard) lwe,” Journal on Computing, 2014.

[13]
J. Brickell and V. Shmatikov, “Privacypreserving classifier learning,” in
International Conference on Financial Cryptography and Data Security. Springer, 2009.  [14] N. Brisebarre, J.M. Muller, and A. Tisserand, “Computing machineefficient polynomial approximations,” Transactions on Mathematical Software, 2006.
 [15] H. Chabanne, A. de Wargny, J. Milgram, C. Morel, and E. Prouff, “Privacypreserving classification on deep neural network.” IACR Cryptology ePrint Archive, 2017.
 [16] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” JMLR, 2011.
 [17] H. Chen, K. Han, Z. Huang, A. Jalali, and K. Laine, “Simple encrypted arithmetic library v2.3.0,” Microsoft Research TechReport, 2017.
 [18] J. H. Cheon, J. Jeong, J. Lee, and K. Lee, “Privacypreserving computations of predictive medical models with minimax approximation and nonadjacent form,” in International Conference on Financial Cryptography and Data Security. Springer, 2017.

[19]
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”
arXiv, 2014.  [20] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
 [21] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incorporating secondorder functional knowledge for better option pricing,” in NIPS, 2001.
 [22] C. Dwork, “Differential privacy: A survey of results,” in International Conference on Theory and Applications of Models of Computation. Springer, 2008.
 [23] T. ElGamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” Transactions on Information Theory, 1985.
 [24] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15. New York, NY, USA: ACM, 2015, pp. 1322–1333. [Online]. Available: http://doi.acm.org/10.1145/2810103.2813677
 [25] A. Gautier, Q. N. Nguyen, and M. Hein, “Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods,” in NIPS, 2016.
 [26] C. Gentry et al., “Fully homomorphic encryption using ideal lattices.” in STOC, 2009.
 [27] Z. Ghodsi, T. Gu, and S. Garg, “Safetynets: Verifiable execution of deep neural networks on an untrusted cloud,” in NIPS, 2017.
 [28] I. Giacomelli, S. Jha, M. Joye, C. D. Page, and K. Yoon, “Privacypreserving ridge regression with only linearlyhomomorphic encryption,” Cryptology ePrint Archive, 2017.
 [29] R. GiladBachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy,” in ICML, 2016.
 [30] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in AISTATS, 2011.
 [31] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1153–1159, 2016.
 [32] V. Gulshan, , L. Peng et al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” Jama, 2016.
 [33] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in NIPS, 2016.
 [34] R. Hall, S. E. Fienberg, and Y. Nardi, “Secure multiple linear regression based on homomorphic encryption,” Journal of Official Statistics, 2011.
 [35] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” ICLR, 2016.
 [36] D. Harvey, “Faster arithmetic for numbertheoretic transforms,” Journal of Symbolic Computation, 2014.
 [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
 [38] E. Hesamifard, H. Takabi, and M. Ghasemi, “Cryptodl: Deep neural networks over encrypted data,” arXiv, 2017.
 [39] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
 [40] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, 1991.
 [41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
 [42] V. Kolesnikov and T. Schneider, “Improved garbled circuit: Free xor gates and applications,” in International Colloquium on Automata, Languages, and Programming. Springer, 2008.
 [43] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.html
 [44] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, 1995.
 [45] R. Livni et al., “On the computational efficiency of training neural networks,” in NIPS, 2014.
 [46] J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv, 2017.
 [47] L. Ma and K. Khorasani, “Constructive feedforward neural networks using hermite polynomial activation functions,” Transactions on Neural Networks, 2005.
 [48] F. D. McSherry, “Privacy integrated queries: An extensible platform for privacypreserving data analysis,” SIGMOD, 2009.
 [49] V. Migliore, C. Seguin, M. M. Real, V. Lapotre, A. Tisserand, C. Fontaine, G. Gogniat, and R. Tessier, “A highspeed accelerator for homomorphic encryption using the karatsuba algorithm,” ACM Trans. Embed. Comput. Syst., vol. 16, no. 5s, pp. 138:1–138:17, Sep. 2017. [Online]. Available: http://doi.acm.org/10.1145/3126558
 [50] P. Mohassel and Y. Zhang, “Secureml: A system for scalable privacypreserving machine learning,” in Symposium on Security and Privacy. IEEE, 2017.
 [51] M. Naehrig, K. Lauter, and V. Vaikuntanathan, “Can homomorphic encryption be practical?” in ACM Workshop on Cloud computing security. ACM, 2011.
 [52] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, “Semisupervised knowledge transfer for deep learning from private training data,” ICLR, 2017.
 [53] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and Ú. Erlingsson, “Scalable private learning with pate,” arXiv preprint arXiv:1802.08908, 2018.
 [54] F. Piazza, A. Uncini, and M. Zenobi, “Artificial neural networks with adaptive polynomial activation function,” 1992.
 [55] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: a selfgated activation function,” arXiv, 2017.
 [56] R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On data banks and privacy homomorphisms,” Foundations of secure computation, 1978.
 [57] B. D. Rouhani, M. S. Riazi, and F. Koushanfar, “Deepsecure: Scalable provablysecure deep learning,” arXiv, 2017.
 [58] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, 2015.
 [59] A. Sanyal, M. J. Kusner, A. Gascón, and V. Kanade, “TAPAS: Tricks to Accelerate (encrypted) Prediction As a Service,” ArXiv eprints, Jun. 2018.
 [60] R. Shokri and V. Shmatikov, “Privacypreserving deep learning,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. ACM, 2015, pp. 1310–1321.
 [61] S. Song, K. Chaudhuri, and A. D. Sarwate, “Stochastic gradient descent with differentially private updates,” in Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE. IEEE, 2013, pp. 245–248.
 [62] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015. [Online]. Available: http://arxiv.org/abs/1512.00567
 [63] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
 [64] B. Toghi and D. Grover, “MNIST Dataset Classification Utilizing kNN Classifier with Modified Sliding Window Metric,” ArXiv eprints, Sep. 2018.
 [65] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction apis,” CoRR, vol. abs/1609.02943, 2016. [Online]. Available: http://arxiv.org/abs/1609.02943
 [66] M. Voets, K. Møllersen, and L. Ailo Bongo, “Replication study: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” ArXiv eprints, Mar. 2018.
 [67] W. Wen et al., “Learning structured sparsity in deep neural networks,” in NIPS, 2016.
 [68] P. Xie, M. Bilenko, T. Finley, R. GiladBachrach, K. Lauter, and M. Naehrig, “Cryptonets: Neural networks over encrypted data,” arXiv, 2014.
 [69] A. C.C. Yao, “How to generate and exchange secrets,” in Foundations of Computer Science. IEEE, 1986.
 [70] X. Zhang, S. Ji, and T. Wang, “Differentially private releasing via deep generative model,” CoRR, vol. abs/1801.01594, 2018. [Online]. Available: http://arxiv.org/abs/1801.01594
 [71] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with lowprecision weights,” ICLR, 2017.