1 Introduction
The next step in the machine learning revolution would be Deep Learning as a Service (DLaaS) which seeks to take advantage of the benefits that cloud computing brings. Cloud servers are excellent machine learning platforms, offering cheap data storage, nearzero deployment cost and high computational services. However, it is not allpowerful and there are important questions that need to be resolved before DLaaS can become widespread. One of the main questions is that cloud platforms do not guarantee data privacy. In the DLaaS setting, one uploads their data into the cloud, runs the model on it and gets the results back from the cloud. At every step along the way, there are numerous opportunities for hackers and other malicious actors to compromise the data.
Privacypreserving machine learning was considered previously by Graepel et al. [ICISC:GraLauNae12] and Aslett et al. [ARXIV:AslEspHol15]. Following them, Dowlin et al. [MSFT:DGL+16]
proposed CryptoNets, the first neural network over encrypted data, providing a method to do the inference phase of privacypreserving deep learning. Since then, others
[CCS:LJLA17, C:BMMP18, USENIX:JuvVaiCha18, PoPETS:HTGW18, Jiang:2018:SOM:3243734.3243837] have applied a variety of cryptographic techniques, such as secure multiparty computation and oblivious transfers, to achieve similar goals. Just as AlexNet by Krizhevsky et al. [NIPS:KriSutHin12]showed how image classification is viable by running convolutional neural networks (CNN) on GPUs, we show that privacypreserving deep learning is dramatically accelerated with GPUs and offers a way towards efficient DLaaS. We follow the framework put forward in CryptoNets
[MSFT:DGL+16] and apply our GPUaccelerated fully homomorphic encryption (FHE) techniques to realize efficient homomorphic convolutional neural networks (HCNNs).Although the framework is available, there are still challenges to realizing performant HCNNs. FHE, first realized by Gentry [STOC:Gentry09] almost 10 years ago, allows arbitrary computation on encrypted data. Informally, it works as follows. Encryption masks the input data, called a plaintext, by a random error sampled from some distribution, resulting in a ciphertext that reveals nothing about what it encrypts. Decryption uses the secret key to filter out the noise and retrieve the plaintext as long as the noise is within some threshold. Note that during computation, the noise in ciphertexts grows, but in a controlled manner. At some point, it grows to a point where no further computation can be done without resulting in decryption failure. Bootstrapping can be used to refresh a ciphertext with large noise into one with less noise that can be used for computation. By doing this indefinitely, theoretically, any function can be computed.
However, this approach is still impractical and bootstrapping is not used in most cases. Instead, the class of functions that can be evaluated is restricted to depth arithmetic circuits, yielding a levelled FHE scheme to avoid bootstrapping. For performance, should be minimized which means that we have to carefully design HCNNs with this in mind. Furthermore, the model of computation in FHE, arithmetic circuits with addition (HAdd) and multiplication (HMult) gates, is not compatible with nonpolynomial functions such as sigmoid, and
. This means that we should use polynomial approximations to the activation functions where possible and consider if pooling layers are useful in practice.
Besides that, we have to encode decimals in a form that is compatible with FHE plaintext data, which are usually integers. These can have high precision which mean that they will require integers of large bitsize to represent them in the commonly used scalar encoding. In this encoding, decimals are transformed into integers by multiplying them with some scaling factor and then operated on with HAdd and HMult normally. The main drawback of this encoding is that we cannot rescale encoded data midcomputation; therefore, successive homomorphic operations will cause data size to increase rapidly. Managing this scaling expansion is a necessary step towards scaling HCNNs to larger datasets and deeper neural networks.
Our Contributions.

We present the first GPUaccelerated Homomorphic Convolutional Neural Networks (HCNN) that runs a prelearned model on encrypted data from the MNIST dataset.

We provide a rich set of optimization techniques to enable easy designs of HCNN and reduce the overall computational overhead. These include lowprecision training, optimized choice of HE scheme and parameters, and a GPUaccelerated implementation.

We reduced the HCNN for the MNIST dataset to only 5 layers deep for both training and inference, smaller than CryptoNets [MSFT:DGL+16] which used 9 layers during training.

We compute predictions for 10,000 pixel images in 14.105 seconds, more than improvement over the current record (by CryptoNets) and with higher (bit) security.
Related Work. The research in the area of privacypreserving deep learning can be roughly divided into two camps: those using homomorphic encryption or combining it with secure multiparty computation (MPC) techniques. Most closely related to our work are CryptoNets by Dowlin et al. [MSFT:DGL+16], FHEDiNN by Bourse et al. [C:BMMP18] and E2DM by Jiang et al. [Jiang:2018:SOM:3243734.3243837], who focus on using only fully homomorphic encryption to address this problem. Dowlin et al. [MSFT:DGL+16] were the first to propose using FHE to achieve privacypreserving deep learning, offering a framework to design neural networks that can be run on encrypted data. They proposed using polynomial approximations of the most widespread activation function and using pooling layers only during the training phase to reduce the circuit depth of their neural network. However, they used the YASHE scheme by Bos et al. [IMA:BLLN13] which is no longer secure due to attacks proposed by Albrecht et al. [C:AlbBaiDuc16]. Also, they require a large plaintext space of over 80 bits to contain their neural network’s output. This makes it very difficult to scale to deeper networks since intermediate layers in those networks will quickly reach several hundred bits with their settings.
Following them, Bourse et al. [C:BMMP18] proposed a new type of neural network called discretized neural networks (DiNN) for inference over encrypted data. Weights and inputs of traditional CNNs are discretized into elements in and the fast bootstrapping of the TFHE scheme proposed by Chilotti et al. [AC:CGGI16]
was exploited to double as an activation function for neurons. Each neuron computes a weighted sum of its inputs and the activation function is the sign function,
which outputs the sign of the input , i.e. if and otherwise. Although this method can be applied to arbitrarily deep networks, it suffers from lower accuracy, achieving only accuracy on the MNIST dataset with lower amortized performance. Very recently, Jiang et al. [Jiang:2018:SOM:3243734.3243837] proposed a new method for matrix multiplication with HE and evaluated a neural network on the MNIST data set using this technique. They also considered packing an entire image into a single ciphertext compared to the approach of Dowlin et al. [MSFT:DGL+16] who put only one pixel per ciphertext but evaluated large batches of images at a time. They achieved good performance, evaluating images in slightly under seconds but with worse amortized performance.Some of the main limitations of pure FHEbased is the need to approximate nonpolynomial activation functions and high computation time. Addressing these problems, Liu et al. [CCS:LJLA17] proposed MiniONN, a paradigm shift in securely evaluating neural networks. They take commonly used protocols in deep learning and transform them into oblivious protocols. With MPC, they could evaluate neural networks without changing the training phase, preserving accuracy since there is no approximation needed for activation functions. However, MPC comes with its own set of drawbacks. In this setting, each computation requires communication between the data owner and model owner, thus resulting in high bandwidth usage. In a similar vein, Juvekar et al. [USENIX:JuvVaiCha18] designed GAZELLE. Instead of applying levelled FHE, they alternate between an additive homomorphic encryption scheme for convolutiontype layers and garbled circuits for activation and pooling layers. This way, communication complexity is reduced compared to MiniONN but unfortunately is still significant.
Organization of the Paper. Section 2 introduces fully homomorphic encryption and neural networks, the main components of HCNNs. Following that, Section 3 discusses the challenges of adapting convolutional neural networks to the homomorphic domain. Next, we describe the components that were used in implementing HCNNs in Section 4. In Section 5, we report the results of experiments done using our implementation of HCNNs on the MNIST dataset. Lastly, we conclude with Section 6 and discuss some of the obstacles that will be faced when extending HCNNs can be scaled to larger datasets.
2 Preliminaries
In this section, we review a set of notions that are required to understand the paper. We start by introducing FHE, thereby describing the BFV scheme, an instance of levelled FHE schemes. Next, we introduce neural networks and how to tweak them to become compatible with FHE computation model.
2.1 Fully Homomorphic Encryption
First proposed by Rivest et al. [FOSC:RivAdlDer78], fully homomorphic encryption (FHE) was envisioned to enable arbitrary computation on encrypted data. FHE would support operations on ciphertexts that translate to functions on the encrypted messages within. It remained unrealized for more than 30 years, until Gentry [STOC:Gentry09] proposed the first construction. The blueprint of this construction remains the only method to design FHE schemes. The (modernized) blueprint is a simple twostep process. First, a somewhat homomorphic encryption scheme that can evaluate its decryption function is designed. Then, we perform bootstrapping, which decrypts a ciphertext using an encrypted copy of the secret key. Note that the decryption function here is evaluated homomorphically, i.e., on encrypted data and the result of decryption is also encrypted.
As bootstrapping imposes high computation costs, we adopt a levelled FHE scheme instead, which can evaluate functions up to a predetermined multiplicative depth without bootstrapping. We chose the BrakerskiFanVercauteren (BFV) scheme [C:Brakerski12, EPRINT:FanVer12], whose security is based on the Ring Learning With Errors (RLWE) problem proposed by Lyubashevsky et al. [EC:LyuPeiReg10]. This problem is conjectured to be hard even with quantum computers, backed by reductions (in [EC:LyuPeiReg10] among others) to worstcase problems in ideal lattices.
The BFV scheme has five algorithms (KeyGen, Encrypt, Decrypt, HAdd, HMult). KeyGen is the algorithm that generates the keys used in an FHE scheme given the parameters chosen. Encrypt and Decrypt are the encyption and decryption algorithms respectively. The differentiation between FHE and standard publickey encryption schemes is the operations on ciphertexts; which we call HAdd and HMult. HAdd outputs a ciphertext that decrypts to the sum of the two input encrypted messages while HMult outputs one that decrypts to the product of the two encrypted inputs.
We informally describe the basic scheme below and refer to [EPRINT:FanVer12] for the complete details. Let with , prime and , we denote the ciphertext space as and message space as . We call ring elements “small” when their coefficients have small absolute value.

KeyGen(): Given security parameter and level as inputs, choose so that security level is achieved. Choose a random element , “small” noise and secret key , the public key is defined to be .

Encrypt(): Given public key and message as input, the encryption of is defined as , for some random noise .

Decrypt(): Given secret key and ciphertext as inputs, the decryption of is

HAdd(): Given two ciphertexts as inputs, the operation is simply componentwise addition, i.e. the output ciphertext is .

HMult(): Given two ciphertexts as inputs, proceed as follows:

(Scale and Relinearize) output
(2)
Correctness of the Scheme. For the scheme to be correct, we require that Decrypt() for output from Encrypt(), where () is a correctly generated keypair from KeyGen. We characterize when decryption will succeed in the following theorem.
Theorem 1.
Let be a ciphertext. Then, Decrypt outputs the correct message if , where is the largest coefficient of the polynomial .
Proof.
Recall that the decryption procedure computes . Therefore, to have , we first require which means that . Finally, we need the rounding operation to output after scaling by which requires that since must be less than . ∎
To see why HAdd works, part of the decryption requires computing
This equation remains correct modulo as long as the errors are small, i.e. . Therefore, scaling by and rounding will be correct which means that we obtain the desired message.
For HMult, the procedure is more complicated but observe that
(3) 
This means that we need as well as to recover the desired message from . However, with a process called relinearization (Relinearize), proposed by Brakerski and Vaikuntanathan [FOCS:BraVai11] and applicable to the BFV scheme, can be transformed to be decryptable under the original secret key .
Computation Model with Fully Homomorphic Encryption. The set of functions that can be evaluated with FHE are arithmetic circuits over the plaintext ring . However, this is not an easy plaintext space to work with; elements in are polynomials of degree up to several thousand. Addressing this issue, Smart and Vercauteren [DCC:SmaVer14] proposed a technique to support single instruction multiple data (SIMD) by decomposing into a product of smaller spaces with the Chinese Remainder Theorem over polynomial rings. For prime , for some . This means that . Therefore, the computation model generally used with homomorphic encryption is arithmetic circuits with modulo gates.
For efficiency, the circuits evaluated using the HAdd and HMult algorithms should be levelled. This means that the gates of the circuits can be organized into layers, with inputs in the first layer and output at the last, and the outputs of one layer are inputs to gates in the next layer. In particular, the most important property of arithmetic circuits for HE is its depth. The depth of a circuit is the maximum number of multiplication gates along any path of the circuit from the input to output layers.
A levelled FHE scheme with input level can evaluate circuits of at most depth which affects the choice of parameter due to noise in ciphertexts. In particular, the HMult operation on ciphertext is the main limiting factor to homomorphic evaluations. From Equation (3), we have
Even after scaling by , the overall noise () in the output is larger than that of the inputs, and . Successive calls to HMult have outputs that steadily grow. Since decryption only succeeds if the error in the ciphertext is less than , the maximum depth of a circuit supported is determined by the ciphertext modulus . To date, the only known method to sidestep this is with the bootstrapping technique proposed by Gentry [STOC:Gentry09].
2.2 Neural Networks
A neural network, by which we mean artificial feedforward neural networks, can be seen as a circuit made up of levels called layers. Each layer is made up of a set of nodes, with the first being the inputs to the network. Nodes in the layers beyond the first take the outputs from a subset of nodes in the previous layer and output the evaluation of some function over them. The values of the nodes in the last layer are the outputs of the neural network.
In the literature, many different layers are used but these can generally be grouped into three categories.

Activation Layers: Each node in this layer takes the output, , of a single node of the previous layer and outputs for some function .

ConvolutionType Layers: Each node in this layer takes the outputs, , of some subset of nodes from the previous layer and outputs a weightedsum
for some weight vector
and bias . 
Pooling Layers: Each node in this layer takes the outputs, , of some subset of nodes from the previous layer and outputs for some function .
The functions used in the activation layers are quite varied, including sigmoid (), softplus () and , where
Although commonly used in practice, some have questioned the utility of pooling layers. Springenberg et al. [ICLR:SDBR15] proposed to remove pooling layers completely from convolutional neural networks and Kamnitsas et al. [JMIA:KLN+17] showed that pooling was unnecessary for some cases of image analysis. To adapt neural networks operations over encrypted data, we do not use pooling and focus on the following layers:

Convolution (weightedsum) Layer: At each node, we take a subset of the outputs of the previous layer, also called a filter, and perform a weightedsum on them to get its output.

Square Layer: Each node linked to a single node of the previous layer; its output is the square of ’s output.

Fully Connected Layer: Similar to the convolution layer, each node outputs a weightedsum, but over the entire previous layer rather than a subset of it.
3 Homomorphic Convolutional Neural Networks
Homomorphic encryption (HE) enables computation directly on encrypted data. This is ideal to handle the challenges that machine learning face when it comes to questions of data privacy. We call convolutional neural networks (CNN) that operate over encrypted data as homomorphic convolutional neural networks (HCNN). Although HE promises a lot, there are several obstacles, ranging from the choice of plaintext space to translating neural network operations, that prevent straightforward translation of standard techniques for traditional CNNs to HCNNs.
3.1 Plaintext Space
The first problem is the choice of plaintext space for HCNN computation. Weights and inputs of a neural network are usually decimals, which are represented in floatingpoint. Unfortunately, these cannot be directly encoded and processed in most HE libraries and thus require some adjustments. For simplicity and to allow inference on large datasets, we pack the same pixel of multiple images in a single ciphertext as shown in Figure 1. Note that we can classify the entire MNIST testing dataset at once as the number of slots is more than 10,000.
Encoding into the Plaintext Space. We adopt the scalar encoding, which approximates these decimals with integers. It is done by multiplying them with some scaling factor and rounding the result to the nearest integer. Then, numbers encoded with the same scaling factor can be combined with one another using integer addition or multiplication. For simplicity, we normalize the inputs and weights of HCNNs to between and (initially) corresponds to the number of bits of precision of the approximation, as well as the upper bound on the approximation.
Although straightforward to use, there are some downsides to this encoding. The scale factor cannot be adjusted midcomputation and mixing numbers with different scaling factors is not straightforward. For example, suppose we have two messages with two different scaling factors, where :
Multiplication will just change the scaling factor of the result to but the result of adding two encoded numbers is not their standard sum. This means that as homomorphic operations are done on encoded data, the scaling factor in the outputs increases without a means to control it. Therefore, the plaintext modulus has to be large enough to accommodate the maximum number that is expected to result from homomorphic computations.
With the smallest scaling factor, , multiplications will suffice to cause the result to potentially overflow the space of bit integers. Unfortunately, we use larger in most cases which means that the expected maximum will be much larger. Thus, we require a way to handle large plaintext moduli of possibly several hundred bits.
Plaintext Space CRT Decomposition. One way to achieve this is to use a composite plaintext modulus, for some primes such that is large enough. Recall that the Chinese Remainder Theorem (CRT) gives us an isomorphism between and :
where for any , we have .
For such moduli, we can decompose any integer into a length vector with . Arithmetic modulo is replaced by componentwise addition and multiplication modulo the prime for the th entry. We can recover the output of any computation as long as it is less than because the inverse map will return a modulo result.
As illustrated in Figure 2, for homomorphic operations modulo , we separately encrypt each entry of in HE instances with the appropriate and perform modulo operations. At the end of the homomorphic computation of function , we decrypt the ciphertexts, one per HE instance, to obtain the vector . The actual output is obtained by applying the CRT map to , i.e. .
3.2 Neural Network Layers
Computation in HE schemes are generally limited to addition and multiplication operations over ciphertexts. As a result, it is easy to compute polynomial functions with HE schemes. As with all HE schemes, encryption injects a bit of noise into the data and each operation on ciphertexts increases the noise within it. As long as the noise does not exceed some threshold, decryption is possible. Otherwise, the decrypted results are essentially meaningless.
Approximating NonPolynomial Activations. For CNNs, a major stumbling block for translation to the homomorphic domain is the activation functions. These are usually not polynomials, and therefore unsuitable for evaluation with HE schemes. The effectiveness of the function in convolutional neural networks means that it is almost indispensable. Therefore, it should be approximated by some polynomial function to try to retain as much accuracy as possible. The choice of approximating polynomial depends on the desired performance of the HCNN. For example, in this work, we applied the square function, , which Dowlin et al. [MSFT:DGL+16] found to be sufficient for accurate results on the MNIST dataset with a five layer network.
The choice of approximation polynomial determines the depth of the activation layers as well as its complexity (number of HMults). The depth and complexity of this layer will be and respectively, where is the degree of the polynomial. However, with the use of scalar encoding, there is another effect to consider. Namely, the scaling factor on the output will be dependent on the depth of the approximation, i.e. if the scaling factor of the inputs to the activation layer is , then the scaling factor of the outputs will be roughly , assuming that the approximation is a monic polynomial.
Handling Pooling Layers. Similar to activations, the usual functions used in pooling layers, maximum (), norm and mean () for inputs , are generally nonpolynomial. Although is a linear function, division in HE schemes is more involved and requires different plaintext encoding methods (see Dowlin et al. [IEEE:DGL+17]). In CryptoNets [MSFT:DGL+16], a variant of the mean function, called scaledmean () is which introduces an additional factor over and does not impact its performance. Still, that is not the only choice that is available. Several works [ICLR:SDBR15, JMIA:KLN+17] have shown that pooling is not strictly necessary and good results can be obtained without it. For a simpler CNN, we chose to remove the pooling layers used in CryptoNets during training and apply the same network for both training and inference, with the latter over encrypted data.
ConvolutionType Layers. Lastly, we have the convolutionaltype layers. Since these are weighted sums, they are straightforward to compute over encrypted data; the weights can be multiplied to encrypted inputs with HMult and the results summed with HAdd. Nevertheless, we still have to take care of the scaling factor of outputs from this layer. At first thought, we may take the output scaling factor as , multiply the scaling factor of the weights and the inputs, denoted with and respectively. But, there is actually the potential for numbers to increase in bitsize from the additions done in weighted sums. Recall that when adding two bit numbers, the upper bound on the sum is bits long. Therefore, the maximum number that can appear in the worstcase in the convolutions is about bits long, where is the number of terms in the summands. In practice, this bound is usually not achieved since the summands are almost never all positive. With negative numbers in the mix, the actual contribution from the summation can be moderated by some constant .
4 Implementation
Implementation is comprised of two parts: 1) training on unencrypted data, and 2) classifying encrypted data. Training on unencrypted data is performed using the 5layer network whose details are shown in Table 1. We use the Tensorpack framework [wu2016tensorpack] to train the network and compute the model. This part is quite straightforward and can be simply verified by classifying the unencrypted test dataset. For neural networks design, one of the major constraints posed by homomorphic encryption is the limitation of numerical precision of layerwise weight variables. Training networks with lower precision weights would significantly prevent the precision explosion in ciphertext as network depth increases, and thus speed up inference rate in encrypted domain. To this end, we propose to train lowprecision networks from scratch, without incurring any loss in accuracy compared to networks trained in floating point precision. Following [zhou2016dorefa], for each convolutional layer, we quantize floating point weight variables to k bits numbers using simple uniform scalar quantizer shown below:
This equation is nondifferentiable function, we use Straight Through Estimator (STE)
[BengioLC13] to enable the backpropagation. We trained the 5layer network on MNIST training set with precision of weights at 2, 4, 8 and 32 bits, and evaluated on MNIST test set with reported accuracy 96%, 99%, 99% and 99% respectively. In view of this, we choose the 4bit network for the following experiments. It’s worth noting that CryptoNets [MSFT:DGL+16] requires 5 to 10 bits of precision on weights to hit 99% accuracy on MNIST test set, while our approach further reduces it to 4 bits and still maintain the same accuracy.The second part is more involved since it requires running the network (with the prelearned model) on encrypted data. First, we need to fix HE parameters to accommodate for both the network multiplicative depth and precision. We optimized the scaling factor in all aspects of the HCNN. Inputs were normalized to , scaled by and then rounded to their nearest integers. With the lowprecision network trained from scratch, we convert the weights of the convolutiontype layers to short bit integers, using a small scaling factor of ; no bias was used in the convolutions. Next, We implement the network using NTL [shoup2005ntl] (a multiprecision number theory C++ library). NTL is used to facilitate the treatment of the scaled inputs and accommodate for precision expansion of the intermediate values during the computation. We found that the largest precision needed is less than (). This is low enough to fit in a single word on 64bit platforms without overflow. By estimating the maximum precision required by the network, we can estimate the HE parameters required by HCNN.
Layer Type  Description  Layer Size 

Convolution  filters of size and stride without padding. 

Square  Outputs of the previous layer are squared.  
Convolution  filters of size and stride without padding.  
Square  Outputs of the previous layer are squared.  
Fully Connected  Weighted sum of the entire previous layer with filters, each output corresponding to of the possible digits. 
The next step is to implement the network using a HE library. We implement HCNN using two HE libraries: SEAL and our GPUaccelerated BFV (AFV) [TCHES:BVMA18]. The purpose of implementing the network in SEAL is to facilitate a more unified comparison under the same system parameters. In addition, we would like to highlight a limitation in the Residue Number Systems (RNS) variant that is currently implemented in SEAL. Before delving into the details of our implementation, we introduce an approach that is commonly followed to choose the FHE parameters.
4.1 Choice of Parameters
Similar to other cryptographic schemes, one needs to select FHE parameters to bound the known attacks computationally infeasible. We denote to the desired security parameter by measured in bits. This means that an adversary needs to perform
elementary operations to break the scheme with probability one. A widely acceptable estimate for
in the literature is 128 bits [smart2014algorithms], which is used here to generate the BFV parameters. We also show parameters for 80bit for comparison with previous works.In this work, we used a levelled BFV scheme that can be configured to support a known multiplicative depth . can be controlled by three parameters: , and noise growth. and are problem dependent whereas noise growth is scheme dependent. As mentioned in the previous section, we found that should be at least a 43bit integer to accommodate the precision expansion in HCNN evaluation.
For our HCNN, five multiplication operations are required: 2 ciphertext by ciphertext (in the square layer) and 3 ciphertext by plaintext (in convolution and fully connected layers) operations. It is known that the latter has less effect on noise growth. This means that needs not be set to 5. We found that is sufficient to run HCNN in AFV. However, SEAL required higher depth () to run our HCNN. The reason behind this is that SEAL implements the BEHZ [bajard2016full] RNS variant of the BFV scheme that slightly increases the noise growth. Whereas in AFV, we implement the HPS [EPRINT:HalPolSho18] RNS variant that has lower effect on the noise growth. For a detailed comparison of these two RNS variants, we refer the reader to [EPRINT:BPAVR18].
Having and fixed, we can estimate using the noise growth bounds enclosed with the BFV scheme. Next, we try to estimate to ensure a certain security level. To calculate the security level, we used the LWE hardness estimator in [albrecht2015concrete] (commit 76d05ee).
The above discussion suggests that the design space of HCNN is not limited depending on the choice of the plaintext coefficient modulus . We identify a set of possible designs that fit different requirements. The designs vary in the number of factors in (i.e., number of CRT channels) and the provided security level. Note that, in the 1CRT channel, we set as a 43bit prime number, whereas in the 2CRT channels, we use 2 22bit prime numbers whose product is a 43bit number. Table 2 shows the system parameters used for each design with the associated security level.
Design  Parameter Set  Depth  

1CRT channel  1  330  4  82  
2  360  5  76  
3  330  4  175  
4  360  5  159  
2CRT channels  5  240  4  252 
4.2 HCNN Inference Library
As most deep learning frameworks do not use functions that fit the restrictions of HE schemes, we designed an inference library using standard C++ libraries that implements some of the CNN layers using only additions and multiplications. Support for arbitrary scaling factors per layer is included for flexibility and allows us to easily define neural network layers for HCNN inference. We give a brief summary of the scaling factor growth of the layers we used in Table 3.
Layer Type  Output Scaling Factor 

ConvolutionType ()  , for some . 
Square Activation ()  . 
and are the input and weight scaling factors respectively. 
In Section 2.2, we introduced several types of layers that are commonly used in designing neural networks, namely activation, convolutiontype and pooling. Now, we briefly describe how our library realizes these layers. For convolutiontype layers, they are typically expressed with matrix operations but only require scalar additions and multiplications. Our inference library implements them using the basic form, , for input and weights .
For the other two, activation and pooling, some modifications had to be done for compatibility with HE schemes. In activation layers, the most commonly used functions are , sigmoid () and softplus (). These are nonpolynomial functions and thus cannot be directly evaluated over HE encrypted data. Our library uses integral polynomials to approximate these functions; particularly for our HCNN for MNIST data, we used the square function, , as a lowcomplexity approximation of . Pooling layers, as mentioned in Section 3.2, are not straightforward to implement with what HE offers. For this work, we choose to avoid pooling layers entirely, in contrast to CryptoNets [MSFT:DGL+16] which uses them in the training phase.
4.3 GPUAccelerated Homomorphic Encryption
The HE engine includes an implementation of an RNS variant of the BFV scheme [EPRINT:HalPolSho18] that we implemented in previous works [TCHES:BVMA18, EPRINT:BPAVR18]. The BFV scheme is considered among the most promising HE schemes due to its simple structure and low overhead primitives compared to other schemes. Moreover, it is a scaleinvariant scheme where the ciphertext coefficient modulus is fixed throughout the entire computation. This contrasts to other scalevariant schemes that keep a chain of moduli and switch between them during computation. We use our GPUbased BFV implementation as an underlying HE engine to perform the core HE primitives: key generation, encryption, decryption and homomorphic operations such as addition and multiplication.
Our HE engine (shown in Figure 3) is comprised of three main components:

Polynomial Arithmetic Unit (PAU): performs basic polynomial arithmetic such as addition and multiplication.

Residue Number System Unit (RNSU): provides additional RNS tools for efficient polynomial scaling required by BFV homomorphic multiplication and decryption.

Random Number Generator Unit (RNG): used to generate random polynomials required by BFV key generation and encryption.
In addition, the HE engine includes a set of LookUp Tables (LUTs) that are used for fast modular arithmetic and number theoretic transforms required by both PAU and RNSU. For further details on the GPU implementation of BFV, we refer the reader to the aforementioned works.
We note that further task parallelism can be extracted from HCNN by decomposing the computation into smaller independent parts that can run in parallel. For instance, in the 2CRT design, each channel can be executed on a separate GPU. In this scenario, the computation is completely separable requiring communication only at the beginning and end of computation for CRT calculations. Nevertheless, our implementation executes the channels sequentially on a single GPU.
5 Experiments
In this section, we describe our experiments to evaluate the performance of HCNN using the aforementioned mentioned designs. We start by describing the hardware configuration. Next, we present the results together with discussion and remarks on the performance.
5.1 Hardware Configuration
The experiments were performed on a server with an Intel Xeon Platinum 8170 CPU @ 2.10 GHz with 26 cores, GB RAM and an NVIDIA Tesla V100 GPU card with GB onboard memory.
5.2 Methodology
We run our HCNN under the aforementioned designs using both SEAL, version [SEAL] and our AFV library [TCHES:BVMA18] on CPU and GPU, respectively. Note that our HCNN implementations execute the 2CRT channels sequentially on a single GPU card. Timing results can be reduced into half if the network is run simultaneously on two GPUs. This also applies for SEAL as well.
Dataset. The MNIST dataset [MNIST] consists of 60,000 images (50,000 in training dataset and 10,000 in testing dataset) of handwritten digits, each is a array of values between and , corresponding to the gray level of a pixel in the image.
5.3 Results
Table 4 shows the runtime of evaluating our HCNN using SEAL and AFV on CPU and GPU, respectively. We include the timing of all the aforementioned parameter sets. It can be clearly seen that AFV outperforms SEAL significantly in all instances. In particular, the speedup factors achieved are 61.68 (1CRT at 76bit security), 108.20 (1CRT at 159bit security) and 80.57 (2CRT at 252bit security). The results show that AFV is superior at handling large FHE parameters where the maximum speedup is recorded. The amortized time represents the perimage inference time. Note that in parameter sets (3,4 and 5) we can classify the entire testing dataset of MNIST in a single network evaluation.
Design  Parameter Set  multicore CPU  GPU  Speedup  

SEAL  Amortized time  AFV  Amortized time  Speedup  
1CRT channel  1  Failure  11.286  1.378  
2  739.908  90.321  11.996  1.464  61.68  
3  Failure  14.105  1.411  
4  1563.852  156.385  14.454  1.445  108.20  
2CRT channels  5  1860.922  0.186  23.098  2.310  80.57 
The results also show the importance of lowprecision training which reduced the required precision to represent the network output. This allows running a single instance of the network without plaintext decomposition (1CRT channel). We remark that CryptoNets used higher precision training and required plaintext modulus of higher precision (). Therefore, they had to run the network twice using 2CRT channels. Moreover, our lowprecision training did not affect the accuracy of the inference as we managed to achieve 99% accuracy.
We also note that our timing results shown here for SEAL are much higher than those reported in CryptoNets (570 seconds at 80bit security). This can be attributed to the following reasons: 1) CryptoNets used the YASHE levelled FHE scheme which is known to be less computationally intensive compared to BFV that is currently implemented in SEAL [lepoint2014comparison]. It should be remarked that YASHE is no more considered secure due to the subfield lattice attacks [C:AlbBaiDuc16], and 2) CryptoNets used much lower system parameters that guarantee only 80bit security level whereas our implementation ensures much higher security level (128bit).
Lastly, we compare our best results with the currently available solutions in the literature. Table 5 shows the reported results of two previous works that utilized FHE to evaluate HCNNs. As we can see, our solution outperforms both solutions in total and amortized time. For instance, AFV is 50.51 and 2.53 faster than CryptoNets and E2DM, respectively in classifying the entire MNIST dataset. Note that E2DM classifies 64 images in a single evaluation. This means that to classify the entire dataset, one would need more than 1 hour.
Solution  Runtime  

Total  Amortized time  
CryptoNets [MSFT:DGL+16]  570  69.580  80 
E2DM [Jiang:2018:SOM:3243734.3243837]  28.590  450.0  80 
AFV  11.286  1.378  82 
6 Conclusions
In this work, we presented a fully FHEbased CNN that is able to homomorphically classify the encrypted MNIST images with AFV. The main motivation of this work was to show that privacypreserving deep learning with FHE is dramatically accelerated with GPUs and offers a way towards efficient DLaaS. Our implementation included a set of techniques such as lowprecision training, unified training and testing network, optimized FHE parameters and a very efficient GPU implementation to achieve high performance. We manged to evaluate our HCNN in 1CRT setting in contrast to previous works that required at least 2CRT. Our solution achieved high security level ( bit) and high accuracy (99%). In terms of performance, our best results show that we could classify the entire testing dataset in 14.105 seconds, with perimage amortized time (1.411 milliseconds) 40.41 faster than prior art.
In its current implementation, our HCNN have adopted the simple encoding method of packing the same pixel of multiple images into one ciphertext, as described in Section 3.1. This packing scheme is ideal for applications that require the inference of large batches of images which can be processed in parallel in a single HCNN evaluation. Other application may have different requirements such as classifying 1 or small number of images. For this particular case, other packing methods that pack more pixels of the same image in the ciphertext can be used. As future work, we will investigate other packing methods that can fit a widerange of applications. Moreover, we will target more challenging problems with larger datasets and deeper networks.