CHET: Compiler and Runtime for Homomorphic Evaluation of Tensor Programs

10/01/2018 ∙ by Roshan Dathathri, et al. ∙ 8

Fully Homomorphic Encryption (FHE) refers to a set of encryption schemes that allow computations to be applied directly on encrypted data without requiring a secret key. This enables novel application scenarios where a client can safely offload storage and computation to a third-party cloud provider without having to trust the software and the hardware vendors with the decryption keys. Recent advances in both FHE schemes and implementations have moved such applications from theoretical possibilities into the realm of practicalities. This paper proposes a compact and well-reasoned interface called the Homomorphic Instruction Set Architecture (HISA) for developing FHE applications. Just as the hardware ISA interface enabled hardware advances to proceed independent of software advances in the compiler and language runtimes, HISA decouples compiler optimizations and runtimes for supporting FHE applications from advancements in the underlying FHE schemes. This paper demonstrates the capabilities of HISA by building an end-to-end software stack for evaluating neural network models on encrypted data. Our stack includes an end-to-end compiler, runtime, and a set of optimizations. Our approach shows generated code, on a set of popular neural network architectures, is faster than hand-optimized implementations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many applications benefit from storing and computing on data in the cloud. Cloud providers centralize and manage compute and storage resources, which lets developers focus on the logic of their application and not on how to deploy and manage the infrastructure that surrounds it. A large class of applications, however, have strict data secrecy and privacy requirements ranging from government regulations to gaining competitive advantage from consumer protection. In these cases, data secrecy and privacy issues end up limiting otherwise convenient cloud adoption.

Fully Homomorphic Encryption (FHE) allows computations to be applied directly on encrypted data and thus enables a wide variety of cloud applications that may not be available today. A cloud based FHE application encrypts its plaintext input and stores it in encrypted form in cloud storage as . At a later time, the application requests the cloud to perform an FHE operation on the encrypted data and to store the (still encrypted) result . The application can download from cloud storage at any point and decrypt the result to obtain a plaintext result .

FHE provides a simple trust model: the owner of the data does not need to trust the cloud software provider or the hardware vendor. Homomorphic encryption schemes rely on well-studied mathematical hardness assumptions, i.e. if there is a polynomial time algorithm for breaking the encryption then there is also a polynomial time algorithm for solving these hard mathematical problems. Other cryptographic solutions such as Secure Multi-Party Computation (MPC) [29, 30] require complicated trust models and typically have larger communication costs. Non-cryptographic technologies such as secure enclaves [24] (such as Intel SGX [18]) requires one to trust a hardware vendor.

Of course, there is no free lunch: FHE is usually considered impractical due to its performance overhead. However, recent advances in FHE schemes and implementations dramatically improve performance, which makes many applications practical today. For example, initial FHE schemes [22] only operated on bits, requiring logic circuits to perform arithmetic. Modern FHE schemes [10, 20, 16] organically support integer and fixed-precision arithmetic dramatically reducing the size of the circuits required. Similarly, optimized implementations of FHE schemes such as [27] have further reduced the cost of performing individual FHE operations.

Despite these improvements, FHE operations are still several orders of magnitude slower than their plaintext counterparts. Nevertheless, certain privacy-sensitive applications are willing to pay the cost. For instance, a bank based in Europe is unlikely to use secure enclaves for sensitive computation as that requires trusting Intel, an US company. For such applications, it is important to build FHE applications that are as efficient as feasible, despite plaintext overhead.

Today, building an efficient FHE application still requires manual and error-prone work. FHE encryption schemes usually require a careful setting of encryption parameters that trade off performance and security. Moreover, current FHE schemes introduce noise when operating on encrypting data. Managing the growth of this noise while satisfying the precision requirements of the application currently requires laborious effort from the developer. Also, popular FHE schemes support vectorized operations (called

batching in the FHE literature) with very large vector sizes (of thousands). Application efficiency crucially relies on maximizing the parallelism available in the application to use the vector capabilities. Lastly, FHE schemes limit the operations allowed in an application (i.e., no branches or control flow) and thus require a developer to creatively structure their program to work around such limitations.

In many respects, programming FHE applications directly on FHE schemes today is akin to low-level assembly programming on modern architectures. Our central hypothesis behind this paper is future FHE applications will benefit from a compiler and runtime that targets a compact and well-reasoned interface, which we call the Homomorphic Instruction Set Architecture (HISA). Just as the hardware ISA interface enabled hardware advances to proceed independent of software advances in the compiler and language runtimes, we posit that HISA isolates future advances in FHE schemes from the compiler optimizations and runtime improvements for supporting FHE applications.

To demonstrate and evaluate the utility behind such an interface, this paper focuses on the task of fully-homomorphic deep-neural network (DNN) inference. This application is particularly suited as the first domain of study. First, the DNN inference computation does not involve discrete control flow which is a perfect match for FHE. While DNN inference involve non-polynomial activation functions, they can be effectively replaced with polynomail approximations 

[23, 13]. The ability of DNNs to tolerate errors provides us the flexibility of quantizing and reducing the precision of the computation for performance.

This paper presents, to the best of our knowledge, the first compiler and runtime called CHET for developing FHE DNN inference. CHET guarantees sound semantics by selecting the right encryption parameters that maximize performance while guaranteeing security and application-intended precision of the computed results. CHET builds on the recent encyrption scheme [16] that provides approximate arithmetic to emulate fixed-point arithmetic required for DNN inference. Finally, CHET explores a large space of optimization choices, such as the multiple ways to pack data into ciphertexts as well as multiple ways to implement computational kernels for each layout, that are currently done manually by experts today.

In addition to a compiler, CHET includes a runtime, akin to the linear algebra libraries used in unencrypted evaluation of neural networks. We have developed a set of layouts and a unified metadata representation for them. For these new layouts we have developed a set of computational kernels that implement the common operations found in convolutional neural networks (CNNs). All of these kernels were designed to use the SIMD-style capabilities of modern FHE schemes.

We evaluate CHET with a set of real-world CNN models and show how different optimization choices available in our compiler can significantly improve latencies. We further demonstrate how the optimized kernel implementations in CHET allow neural networks of unprecedented depth to be homomorphically evaluated.

In summary our main contributions are as follows:

  • A homomorphic instruction set architecture (HISA) that cleanly abstracts and exposes features of FHE libraries.

  • A set of tiled layouts and a metadata format for packing tensors into ciphertexts.

  • Novel kernel implementations of homomorphic tensor operations.

  • A compiler and runtime for homomorphic evaluation of tensor programs, known as CHET.

  • An evaluation of CHET on a set of real-world CNN models, including the homomorphic evaluation of the deepest CNN to date.

2 Background

This Section provides a concise background of homomorphic encryption with of its limitations and capabilities, followed by a brief introduction into tensor programs.

2.1 Homomorphic Encryption

We will begin by describing the properties of leveled fully homomorphic encryption (FHE) that are essential for a programmer. We first introduce general concepts followed by a description of the HEAAN scheme [16], which is used in our evaluations.

Leveled FHE can evaluate an arithmetic circuit of a pre-determined multiplicative depth . The reason for such a restriction is that typically in homomorphic encryption each ciphertext carries a inherent “noise” which increases when computations are performed. Once the noise reaches a certain bound the ciphertext becomes corrupted and the message cannot be recovered anymore, even with the correct decryption key. In Leveled FHE, the parameters of the encryption scheme are chosen so that the noise remains below this threshold for circuits of depth up to the chosen value . Unfortunately, due to the increase in the parameters of the scheme, the complexity of each homomorphic operation also grows with .

Encryption Keys:

FHE schemes have two sets of keys: public keys and private keys. Private keys are used for decryption and public keys are used for encryption and in homomorphic operations. As long as a user does not reveal their private key, FHE schemes leak no information about the user’s data.111Encryption schemes in general base their security on the hardness of some underlying problem, e.g., learning with errors [28]. Public keys, on the other hand, are meant to be shared for secure computation.

Supported Operations:

Existing FHE schemes on integers support only two basic algebraic operations: addition and multiplication. The operands can be any combination of ciphertexts and plaintexts. Techniques for fixed point operands are discussed later in this Section. However, all other operations need to be approximated in terms of additions and multiplications. For example, functions such as , , and can be approximated by polynomials using Taylor expansion. On the other hand, branching is not possible in FHE and the only possible solution is executing all cases and masking the result.

Noise and Multiplicative Depth:

We denote the amount of noise associated with ciphertext by . The amount of noise for a ciphertext depends on its operands and the operator and excessive noise completely destroys the encrypted message. Any freshly encrypted value has a small amount of noise. For any given ciphertexts, and and plaintext , the noise contribution is roughly as follows:

where is a constant number that depends on parameter selection of the encryption. The multiplicative noise increase between two ciphertexts greatly affects the design of a FHE computation. Multiplicative depth of a ciphertext is the maximum number of ciphertext multiplications that contributed to the computation of and they are all a part of a chain. For example, for ciphertext where , and are ciphertexts and is a plaintext, the multiplicative depth is . For a multiplicative depth of , the noise is . The maximum noise tolerance can be adjusted by the encryption parameters but that negatively impacts computational complexity of operations. Therefore, it is crucial to keep the multiplicative depth minimal to avoid the costs.

Floating Point:

Due to the nature of the homomorphic encryption schemes, it is not easy to support floating point operations. Instead, we can use fixed point arithmetic by combining integers with a scaling factor. To avoid accumulating scaling factor, fixed point arithmetic requires scaling the result down after each multiplication. However, the BFV/BGV schemes do not support division, and hence applications using these schemes will see an accumulation of scaling factors. This requires increasing space to store the intermediate results, and the required encryption parameters becomes exponential in the multiplicative depth of the circuit, which makes deep circuit evaluations infeasible. Section 2.2 gives background on an approximate encryption scheme that avoids this problem.

FHE Vectorization:

FHE multiplication and addition are extremely costly and usually not practical if one pair of operands are considered at a time. Fortunately, in the BFV [20], BGV [10] and HEAAN schemes, multiple ciphertexts can be packed into a single ciphertext to benefit from Single Instruction Multiple Data (SIMD) parallelism [21]. This technique allows packing multiple plaintext values into a single ciphertext. Given two and , that encode the plaintext vectors and , respectively, or correspond to elementwise multiplication and addition of the underlying individual and . Similar to shuffling instructions in the SIMD units of modern processors, the elements of a vector can be rotated in a packed ciphertext. However, random access is not directly supported. Instead random access can be implemented by multiplying with a plaintext mask, followed by a rotation. This unfortunately introduces additional multiplicative depth and should thus be avoided if possible.


True Fully Homomorphic Encryption starts with a Leveled FHE scheme and introduces an additional noise management technique traditionally referred to as bootstrapping. Bootstrapping reduces noise, and when performed frequently enough it allows in principle an unlimited number of operations to be performed. A recent work by Cheon et al. [15] describes a bootstrapping algorithm for the HEAAN [16] scheme, and best results takes about 1 second to bootstrap a single number. Therefore, we leave using bootstrapping for future work once it is more practical.

2.2 Heaan

There are a few state-of-the-art leveled FHE schemes, for example the Brakerski-Gentry-Vaikuthanathan scheme [10] and the Fan-Vercauteren variant of the Brakerski scheme [20], which we will refer to as BGV and BFV. These two schemes are similar in many aspects. In particular, both scheme supports vector plaintext type where is some finite field, and vectorized additions/multiplications. More recently, Cheon et al. proposed the HEAAN scheme [16], which supports better approximate computation over encrypted data.

The HEAAN scheme uses a novel idea to mix the encryption noise with the message. Then, one can use the modulus switching techniques for BGV scheme to achieve the functionality of “scaling down”, i.e., given an encryption of and a public integer , output an encryption of for some of fixed size. This gives the ability to control the scaling factor and hence the precision throughout the homomorphic circuit evaluation, and allows encryption parameters that are linear in multiplicative depth. However, the HEAAN scheme only supports approximate computations: first, even a freshly encrypted ciphertext decrypts to a perturbed message; also, homomorphic operations insert noise into the result. Therefore, one needs to carefully control these noises in order to maintain a desired precision of the computation output.

In the HEAAN scheme, two operations require additional public key materials. After each multiplication between ciphertexts, the output ciphertext is actually larger than the two input ciphertexts, and it’s common to run an operation called re-linearization to reduce the size of output. Both re-linearization and rotation require certain evaluation keys. These keys can be generated from the secret key, and each rotation key can perform rotation by a fixed amount. In the public implementation of HEAAN, by default only power-of-2 rotation keys are generated, and general rotations are written as a composition of power-of-2 rotations. This presents an interesting problem of choosing the best rotation keys for a given program, which balances evaluation time with storage/communication costs of additional rotation keys.

To instantiate the HEAAN scheme, we need to select two integer parameters (the degree of polynomial modulus, a power of 2) and (the ciphertext modulus). The security of the scheme is based on the ring learning-with-errors problem, and for a fixed , the security level increases with . One additional complication is: the noise generated by each homomorphic operation increases with , which could increase required precision and thus , possibly requiring a larger than the initial guess.

The number of elements that can be packed into a single ciphertext scales linearly with the parameter . In particular, HEAAN supports encrypting complex numbers in a single ciphertext. We note that the vector encryption in HEAAN comes with an additional encoding error, i.e., even a fresh encoding of a vector will lose some precision of the input due to a rounding process. In HEAAN, the encoding error is , hence we need to scale up the input numbers to make sure they are sufficiently larger compared to this error.

2.3 Tensor Programs

A tensor is a multidimensional array with regular dimensions. A tensor has a data type , which defines the representation of each element, and a shape , which is a list of the tensor’s dimensions. For example, a single 32 by 32 image with 3 channels of color values between 0 and 255 could be represented by a tensor with and .

In machine learning, neural networks are commonly expressed as programs operating on tensors. Some common tensor operations in neural networks include:


calculates a cross correlation between an input and a filter tensor.

Matrix multiplication

is used to represent neurons that are connected to all inputs, which are found in dense layers typically at the end of a convolutional neural network.


combines adjacent elements using a reduction operation such as maximum or average.

Elementwise operations

such as ReLUs and batch normalization.


reiterprets the shape of a tensor, for example, to flatten preceding a matrix multiplication.

Tensor programs are used as models for large datasets in regression and classification tasks. To define a space of models some constant tensors in the program are designated as trainable weights. Since operations are differentiable (almost everywhere), backpropagation can be used to update weights. This is used as a step in an optimization algorithm, e.g., stochastic gradient descent.

Tensor programs have attractive properties for execution on homomorphic encryption: typically no branching, very regular data access patterns making vectorization possible. We consider the tensor program as a circuit of tensor operations and this circuit is a Directed Acyclic Graph (DAG).

3 Software Stack for Homomorphic Evaluation of Tensor Programs

Figure 1: Overview of the CHET system at compile-time.
Figure 2: Overview of the CHET system at runtime.

There are many open problems in building a full software stack for FHE applications and the systems, OS, and PL community can help bridge the gap between an application programmer and the underlying crypto library that implements secure primitives. This section presents an overview of an end-to-end software stack for evaluating tensor programs on homomorphically encrypted data. The target architecture for this stack is a Fully Homomorphic Encryption (FHE) scheme and our tensor compiler, CHET, interacts with this scheme using the Homomorphic Instruction Set Architecture (HISA) defined in Section 4. At the top of stack rests the user-provided tensor program or circuit. CHET consists of a compiler and runtime to bridge the gap in-between them. We describe the runtime and the compiler in detail in Sections 5 and 6, respectively. In this Section, we present an overview of the end-to-end software stack.

Figure 1 shows the overview of the CHET compiler. In addition to the tensor circuit, CHET requires the schema of the input and weights to the circuit. The schema specifies the dimensions of the tensors as well the floating-point precision required of the values in those tensors. CHET also requires the desired floating-point precision of the output of the circuit. Using these constraints, CHET generates an equivalent, optimized homomorphic tensor circuit as well as an encryptor and decryptor. Both of these executables encode the choices made by the compiler to make the homomorphic computation efficient.

To evaluate the tensor circuit on an image, the client first generates a private key and encrypts the image using the encryptor (which can also generate private keys) provided by the compiler, as shown in Figure 2. The encrypted image is then sent to the server along with unencrypted weights and public keys required for evaluating homomorphic operations (i.e., multiplication and rotation). The server executes the optimized homomorphic tensor circuit generated by the CHET compiler. The homomorphic tensor operations in the circuit are executed using the CHET runtime, which uses an underlying FHE scheme to execute homomorphic computations on encrypted data. The circuit produces an encrypted prediction, which it then sends to the client. The client decrypts the encrypted prediction with its private keys using the compiler generated decryptor. In this way, the client runs tensor programs like neural networks on the server without the server being privy to the data, the output (prediction), or any intermediate state.

While Figure 2 presents the flow in CHET for homomorphically evaluating a tensor program on a single image, CHET supports evaluating the program on multiple images simultaneously, which is known as batching in image inference. Batching increases the throughput of image inference. In contrast, CHET ’s focus is decreasing image inference latency. In the rest of this paper, we consider a batch size of 1, although CHET trivially supports larger batch sizes.

The FHE scheme exposes several policies or heuristics such as the encryption parameters. Similarly, the CHET runtime also exposes several policies that could be specific to or independent of the FHE scheme. This is analogous to Intel MKL libraries having different implementations of the same operation, where the most performant implementation depends on the size of the input or the target architecture. In the case of homomorphic computation, these policies not only affect performance but can also affect security and accuracy. A key design principle of CHET is the separation of concern between the policies of choosing the secure, accurate, and most efficient homomorphic operation from the mechanisms of executing those policies. Some of these policies are independent of the input or weights schema, so they are entrusted to the CHET runtime. On the other hand, most policies may require either the input or weights schema or global analysis of the program. In such cases, the CHET compiler chooses the appropriate policy and the CHET runtime implements the mechanisms for that policy. We describe the policies explored and mechanisms exposed by the CHET runtime (Section 

5) and then describe how the CHET compiler (Section 6) chooses the appropriate policy.

4 Homomorphic Instruction Set Architecture

This section introduces a design for a Homomorphic Instruction Set Architecture (HISA), that provides a useful interface to fully homomorphic encryption libraries. We have designed the HISA with the following objectives in mind:

  • Abstract details of HE schemes, such as use of evaluation keys and management of moduli.

  • Provide a common interface to shared functionality in FHE libraries, while exposing unique features.

  • Not to include the complexity of plaintext evaluation. Instead the HISA is intended to be embedded into a host language.

The HISA is split into multiple profiles and all libraries are expected to implement at least the Encryption profile. Each FHE library implementing the HISA provides two types, pt for plaintexts and ct for ciphertexts, and additional ones depending on the profiles implemented by the library.

Instruction Signature Semantics Profile
Encrypt plaintext into a ciphertext. Encryption
Decrypt ciphertext into a plaintext. Encryption
Make a copy of ciphertext . Encryption
Free any resources associated with handle . Encryption
Encode vector of integers into a plaintext. Integers
Decode plaintext into a vector of integers. Integers
Rotate ciphertext left slots. Integers
Rotate ciphertext right slots. Integers
Add ciphertext, plaintext, or scalar to ciphertext . Integers
Subtract ciphertext, plaintext, or scalar from ciphertext . Integers
Multiply ciphertext with ciphertext, plaintext, or scalar. Integers
Divide ciphertext with scalar . Undefined if for some . Division
Gives largest valid divisor s.t. . Division
Ciphertext-ciphertext multiplication without re-linearization. Relin
No-op, FHE library performs re-linearization. Relin
No-op, FHE library performs bootstrapping. Bootstrap
Figure 3: Instructions of the HISA

Figure 3 presents instructions available in the HISA. The Encryption profile provides core functionality for encryption, decryption, and memory management. Usage of encryption and evaluation keys is left as a responsibility of FHE libraries and is not exposed in the HISA.

The Integers profile provides operations for encoding and computing on integers. The profile covers both FHE schemes with batched encodings and ones without by having a configurable number of slots , which is provided as an additional parameter during initialization of the FHE library.

The Division profile exposes extra functionality provided by the HEAAN family of encryption schemes. For libraries implementing the base HEAAN scheme [16], will return the largest power-of-two such that and , where is the modulus of . For a variation of HEAAN based on residue number systems (RNS) [7], on the other hand, would return the largest coprime modulus of less than , or 1 if there is none. This way the Division profile cleanly exposes the division functionality of both variants of the HEAAN encryption scheme.

The Relin profile provides the capability to separate multiplication from re-linearization. While re-linearization is semantically a no-op, it is useful to expose relinearize calls to the compiler, as their proper placement is a highly non-trivial (in fact NP-complete) problem [14].

Finally, the Bootstrap profile exposes bootstrapping in FHE libraries that support it. While bootstrapping is semantically a no-op, it must be exposed in the HISA because selection of encryption parameters depends on when bootstrapping is performed. Furthermore, the optimal placement of bootstrapping operations [8] depends on the program being evaluated: “wide” programs with a lot of concurrent should bootstrap at lower depths than “narrow” programs that are mostly sequential.

The HISA may be implemented either with precise or approximate semantics. With approximate semantics all operations that return a pt or ct

may introduce an error term. It is expected that this error is from some random distribution which may be bounded given a confidence level. However, the HISA does not offer instructions for estimating the error, as we expect these estimates to be required during selection of fixed point precisions and encryption parameters. Instead we recommend that approximate FHE libraries provide implementations of the HISA with no actual encryption and either: a way to query safe estimates of errors accumulated in each

pt and ct, or sampling of error from the same distribution as encrypted evaluation would produce. Both approaches may be used by a compiler for selecting fixed point precisions. The sampling approach is more flexible for applications where it is hard to quantify acceptable bounds for error, such as neural network classification, where only maintaining the order of results matters.

Initialization of FHE libraries is concerns that falls outside of the HISA, and is specific to the encryption and encoding scheme used. This step will include at least generating or importing encryption and evaluation keys. FHE libraries implementing leveled encryption schemes also require selecting encryption parameters that are large enough to allow the computation to succeed. Such FHE libraries should provide analysis passes as additional implementations of the HISA to help with parameter selection. We have implemented such a pass for the HEAAN library [16], which we present in Section 6.

HISA is not an IR:

The HISA does not and nor is it meant to provide a unified interface to FHE libraries. To draw a parallel to processors implementing the x86 instruction set, if a program includes instructions from Advanced Vector Extensions (AVX) then that program will not run on any processor bought before 2011. Conversely, running a program that does not use AVX on a processor that does may be leaving a lot of potential performance unused. Similarly, a program using the Division HISA profile will only run on FHE libraries implementing a HEAAN style encryption scheme [16] and again, conversely, a program using HEAAN for fixed point arithmetic without ever calling divScalar would likely achieve much better performance (through smaller encryption parameters). The role of a unified interface should instead be played by an intermediate representation (IR). For the HISA, this IR would include fixed point datatypes and specifications of required precisions. This kind of an IR could then be lowered to target FHE libraries that support different subsets of profiles. This is similar to how for example the LLVM intermediate representation can be lowered both to processors that support AVX and ones that do not.

5 Runtime for Executing Homomorphic Tensor Operations

Intel MKL libraries provide efficient implementations of BLAS operations. In the same way, we design the CHET runtime to provide efficient implementations for homomorphic tensor primitives. While the interface for BLAS and tensor operations are well-defined, the interface for a tensor operation on encrypted data is not apparent because the encrypted data is an encryption of a vector, whereas the tensor operation is on higher-dimensional tensors. The types need to be reconciled to define a clean interface. In the rest of this section, we consider a 4-dimensional tensor (batch, channel, height, and width dimensions) to illustrate this, but the same concepts apply to other higher-dimensional tensors. We first describe our cipher tensor datatype. Each tensor operation on unencrypted tensors corresponds to an equivalent homomorphic tensor operation on cipher tensor(s) and plain tensor(s), if any. We then briefly describe implementations of some homomorphic tensor operations using this clean interface and datatype.

5.1 Cipher Tensor Datatype:

A naive way to encode a 4-d tensor as a vector is to lay out all elements in the tensor contiguously in the vector and encrypt it. This approach may not be feasible for a few reasons: (i) the vector size is determined by the encryption parameters (like N in HEAAN) and the 4-d tensor might not fit in the vector, or (ii) tensor operations like convolution and matrix multiplication on this input vector (or to produce such an output vector) become complicated and very inefficient because only point-wise operations or rotations are supported by the HISA. Another option is to encrypt each inner dimension (width) element in a separate cipher. This creates a vector of vectors, where the inner vector is a cipher and the outer vector contiguously stores pointers to these ciphers. The number of homomorphic operations would increase significantly because a cipher operation may be required for each element, thereby not utilizing the vector width and making it very inefficient.

The problem of encoding a 4-d tensor as a vector of vectors is similar to tiling or blocking the data but with constraints that differ from that of locality. For example, a 4-d tensor can be blocked as a 2-d tensor of 2-d tensors. This corresponds to the 4-d tensor being laid out as a vector of vectors, where each inner vector has the inner two dimensions (height and width) encrypted. Similar issues as stated earlier may arise if a particular data layout is fixed. Some tensor operations might be more efficient in some layouts than in others. More importantly, the most efficient layout may be dependent on the schema of the tensor or on future tensor operations, so determining that is best left to the compiler. Thus, we want a uniform cipher tensor datatype that is parametric to different data layouts, so that the CHET compiler can choose the appropriate data layout.

Some tensor operations enforce more constraints on the cipher tensor datatype, primarily for performance reasons. For example, to perform convolution with same padding, the input 4-d tensor is expected to be padded. Such a padding on unencrypted data is trivial, but adding such padding to an encrypted cipher on-the-fly involves several rotation and point-wise multiplication operations. Similarly, a reshape of the tensor is trivial on unencrypted data but very inefficient on encrypted data. All these constraints arise because using only point-wise operations or rotations on the vector may be highly inefficient to perform certain tensor operations. Therefore, the cipher tensor datatype needs to allow padding or logical re-shaping without changing the elements of the tensor.

To satisfy these requirements, we define an cipher tensor datatype, that we term CipherTensor

, as a vector of ciphertexts (vectors) with associated metadata that captures the way to interpret the vector of ciphertexts as the corresponding unencrypted tensor. The metadata includes: (i) physical dimensions of the (outer) vector and those of the (inner) ciphertext, (ii) logical dimensions of the equivalent unencrypted tensor, and (iii) physical strides for each dimension of the (inner) ciphertext.

The metadata is stored as plain integers and can be modified easily. Nevertheless, the metadata does not leak any information about the data because it is agnostic to the values of the tensor and is solely reliant on the schema or dimensions of the tensor. The metadata can be used to satisfy all the required constraints of the datatype:

  • The metadata are parameters that the compiler can choose to instantiate and use a specific data layout. For example, blocking or tiling the inner dimension (height and width) of a 4-d tensor corresponds to choosing 2 dimensions for the (outer) vector and 2 dimensions for the (inner) ciphertext, where the logical dimensions match the physical dimensions. We term this as HW-tiling. Another option is to block the channel dimension too, such that multiple, but not all, channels may be in a cipher. This corresponds to choosing 2 dimensions for the (outer) vector and 3 dimensions for the (inner) ciphertext with only 4 logical dimensions. We term this as CHW-tiling.

  • The physical strides of each dimension can be specified to include sufficient padding in-between elements of that dimension. For example, for an image of height (row) and width (column) of 28, a stride of 1 for the width dimension and a stride of 30 for the height dimension allows a padding of 2 (zero or invalid) elements between the rows.

  • Reshaping the cipher tensor only involves updating the metadata to change the logical dimensions and does not perform any homomorphic operations.

Input : CipherTensor input
Input : PlainTensor filter
Output : CipherTensor output
1 foreach [b, oc] in output.getOuterDims() do
2        output.ciphers[b, oc] = zeroCipher
3        foreach ic in input.getChannelDims() do
4               hStride, wStride = input.getCipherStrides()
5               foreach [fh, fw] in filter.getHeightWidth() do
6                      weight = filter[fh, fw, ic, oc]
7                      weightFP = FixedPrecision(weight, plainLogP)
8                      rotate = fh * hStride + fw * wStride
9                      temp = input.ciphers[b, ic]
10                      FHE.rotLeftAssign(temp, rotate)
11                      FHE.mulScalarAssign(temp, weightFP)
12                      FHE.addAssign(output.ciphers[b, oc], temp)
13               end foreach
15        end foreach
16       FHE.divScalarAssign(output.ciphers[b, oc], plainLogP)
17 end foreach
Algorithm 1 HW-Tiled Homomorphic Convolution 2-d of a 4-d tensor (with valid padding)

5.2 Homomorphic Tensor Operations:

The interface for a homormorphic tensor operation is mostly similar to that of its unencrypted tensor operation counterpart, except that the types are replaced with cipher tensors or plain tensors, which we call CipherTensor and PlainTensor, respectively. The homomorphic tensor operation also exposes the output CipherTensor in the interface as a pass-by-reference parameter. This enables the compiler to specify the data layout of both the input and output CipherTensors. In addition, the interface exposes parameters to specify the scaling factors to use for the input CipherTensor(s) and PlainTensor(s), if any. When a compiler specifies all these parameters, it chooses an implementation specialized for this specification. There might be multiple implementations for the same specification that are tuned for specific FHE libraries (for those that have divScalar and for those that do not). In such cases, the compiler also chooses the implementation to use. Nevertheless, there are several algorithm choices for a given specification that have significant impact on the performance of the operation. These could be quite involved (similar to MKL implementations). We explore some of these briefly next.

HW-tiled Homomorphic Convolution 2-d with VALID padding

: Consider a HW-tiled CipherTensor that represents a 4-d tensor. Algorithm 1 shows pseudocode for a 2-d convolution with valid padding of this CipherTensor into another HW-tiled CipherTensor. Zeros (of the size of the ciphertext) encrypted into a ciphertext is assumed to be available. In convolution, the first element in a HW cipher needs to be added with some of its neighboring elements and before the addition, each of these elements need to be multiplied with different filter weights. To do so, we can left rotate each of these neighboring elements to the same position as the first element. Rotating the ciphertext vector in such a way moves the position appropriately for all HW elements. For each rotated vector, the same weight needs to be multiplied for all HW elements, so mulScalar of the ciphertext is sufficient. Before multiplication, the filter weight needs to be scaled using the compiler-provided scaling factor (plainLogP). These ciphertext vectors can then be added for different input channels to produce a ciphertext for an output channel. The rotations of the input ciphertexts are invariant to the output batch and channel, so they can be code motioned out in the implementation to reduce the number of rotations (omitted in the pseudocode for the sake of exposition).

HW-tiled Homomorphic Convolution 2-d with SAME padding

: If a similar 2-d convolution has to be implemented, but with same padding, then each HW ciphertext is expected to be padded by the compiler so that the amount to rotate (left or right) varies but the number of rotations remain the same. However, after one convolution, there are invalid elements where the padded zeros existed earlier because those are added to the neighboring elements as well. The runtime keeps track of this using an additional metadata on the CipherTensor. The next time a convolution (or some other operation) is called on the CipherTensor, if the CipherTensor contains invalid elements where zeros are expected to be present, then the implementation can mask out all invalid elements with one mulPlain operation (the plaintext vector contains 1 where valid elements exist and 0 otherwise). This not only increases the time for the convolution operation, but it also increases the modulus Q required because divScalar may need to be called after such a masking operation. For security reasons, a larger Q can increase the N that needs to be used during encryption, thereby increasing the cost of all homomorphic operations.

CHW-tiled Homormorphic Convolution 2-d

: Let us now consider a CHW-tiled CipherTensor that represents a 4-d tensor. To perform a 2-d convolution on this, even with VALID padding, mulPlain is required because different weights need to be multiplied to different channels that are all in the same cipher. In some FHE libraries like HEAAN, the asymptotic complexity of mulPlain is much higher than that of mulScalar, thereby increasing the cost of the convolution operation. Moreover, after multiplication, the multiple input channels in the ciphertext need to be summed up into a single output channel. Such a reduction can be done by rotating every channel to the position of the first one in the ciphertext and adding them up one by one. If is the number of channels in the cipher, this involves rotations. However, such a reduction can be done more efficiently by exploiting the fact that the stride between the input channels is the same. This requires at the most rotations with additions in-between rotations, similar to vectorized reduction on unencrypted data. To produce an output ciphertext with multiple channels , the input ciphertext needs to be replicated times. Instead of rotations serially, this can also be done in rotations, by adding the ciphertexts in-between.

Homomorphic matmul

: Different FHE operations have different latencies. Although rotLeft and mulPlain  have similar algorithmic complexity in HEAAN, the constants may vary and we observe that, mulPlain is more expensive than rotLeft. Due to this, it may be worth trading multiplications for rotations. This trade-off is most evident in homormorphic matmul operation. The number of mulPlain required reduce proportional to the number of replicas of the data we can add in the same cipher. Adding replicas increase the number of rotations but decrease the number of multiplications. This yields much more benefit because replicas can be added in log number of rotations instead of the linear number of multiplications.

6 Compiler For Transforming Homomorphic Tensor Operations

Figure 4: Overview of an analysis and transformation pass in the CHET compiler.

In this section, we describe the CHET compiler, which is a critical component of the end-to-end software stack to homomorphically evaluate tensor programs. The compiler is responsible for generating an optimized homomorphic tensor circuit that not only produces accurate results but also guarantees security of the data being computed. We describe the analysis and transformation framework used by the CHET compiler and then use this framework to describe a few transformations that are required for accuracy and security, and a few optimizations that improve performance significantly. We present the analysis and transformations specifically for HEAAN but it can extended for other FHE libraries trivially.

6.1 Analysis and transformation framework

In contrast to most traditional optimizing compilers, the input tensor circuit has two key properties that the CHET compiler can exploit: the data flow graph is a Directed Acyclic Graph (DAG) and the tensor dimensions are known at compile-time from the schema provided by the user (similar to High Performance Fortran compilers). The CHET compiler must also analyze multiple data flow graphs corresponding to different implementations. Fortunately, the design space is not that large and we can explore it exhaustively, one at a time.

The CHET compiler only needs to determine the policy to execute homomorphic computation while the CHET runtime handles the mechanisms of efficient execution. This separation of concerns simplifies the code generation tasks of the compiler but it could complicate the analysis required. The compiler generates high-level homomorphic tensor operations but needs to analysis the low-level HISA instructions executed. To resolve this, we exploit the CHET runtime directly to perform the analysis.

Figure 4 shows the flow of a single analysis and transformation pass in the compiler. The transformer is a simple tool that specifies the parameters to use in the homomorphic tensor circuit (the input tensor circuit is directly mapped to an equivalent homomorphic tensor circuit with parameters left unspecified). This transformed homomorphic tensor circuit can be symbolically executed using the CHET runtime. This would execute the same HISA instructions that would be executed at runtime because the tensor dimensions are known. Instead of executing these HISA instructions using the FHE scheme, the HISA instructions invoke the symoblic analyser. This tracks the data that flows through the circuit. Repeated iteration to a fix-point is not needed since the circuit is a DAG. The DAG is unrolled on-the-fly to dynamically track the data flow, without having to construct an explicit DAG. Different analysers can track different information flows. The analyser returns its results as the output of the symoblic execution. The transformer then uses these results to instantiate a new, transformed specification of the homomorphic tensor circuit.

6.2 Parameter Selection

To choose encryption parameters such that the output is accurate, the compiler needs to analyse the depth of the circuit. In HEAAN, divScalar consumes modulus Q from the ciphertext it is called on. The output should have sufficient modulus Q left to capture accurate results. We implement a symoblic analyser with a dummy ciphertext datatype that increments the modulus Q of the input ciphertext and copies into the output ciphertext whenever divScalar is called. All other HISA instructions only copy the modolus from intput to output. The modulus Q in the dummy output ciphertext is the depth of the circuit. The input ciphertext should be encrypted with modulus Q that is at least the sum of the depth of the circuit and the desired output precision. This ensures accurate results. For the input modolus Q, a large enough N must be chosen to guaratee security of the data. This is a deterministic map from Q to N. In some cases, Q might require prohibitively large N, in which case the compiler needs to introduce bootstrapping in-between. We do not explore this in this paper.

6.3 Padding Selection

This analysis does not need to track the data flow in the HISA instructions. It requires analysing the metadata in the CipherTensor that flows through the homomorphic tensor circuit. This is quite straight-forward. If tensor operations like convolution need padding, then the previous operations until the input should maintain that padding. Some tensor operations may change strides, in which case the padding required scales by that factor. The input padding (in each ciphertext dimension) selected is the maximum padding required for any homomorphic tensor operation in the circuit.

6.4 Optimization: Rotation Keys Selection

By default, HEAAN inserts public evaluation keys for power-of-2 left and right rotations, and all rotations are performed using a combination of power-of-2 rotations. The ciphertext size is , so HEAAN stores rotation keys by default. The rotation keys consume significant memory, so this is trade-off between space and performance. By storing only these keys, any rotation can be performed. However, this is too conservative. In a given homomorphic tensor circuit, the distinct slots to rotate would not be in the order of . We use the analyser to track the distinct slots to rotate used in the homomorphic tensor circuit. All rotate HISA instructions store the constant slots to rotate in the analyser (right rotations are converted to left rotations) and other HISA instructions are ignored. The analyser returns the distinct slots to rotate that were used. The compiler generates the encryptor such that it would generate evaluation keys for these rotations using the private key on the client (Figure 2). We do not need power-of-2 rotation keys and we observe that the rotation keys chosen by the compiler are a constant factor of .

6.5 Optimization: Data Layout Selection

Different homomorphic operations have different costs, even within the same FHE scheme. The cost of homomorphic operations could also vary a lot based on small variations of the scheme. For example, by using a full-RNS variant of the HEAAN scheme and storing the ciphertexts and plaintexts in the number-theoretic-transform (NTT) domain, we can decrease the complexity of mulPlain* from to . The compiler can encode the cost of each operation either from asymptotic complexity or from microbenchmarking each operation. These costs can then be used to decide which implementation or data layout to choose for a given homomorphic tensor circuit. The analyser in this case counts the number of occurrences of each operation and then uses the asymptotic complexity to determine to the total cost of the circuit. The transformer creates a homomorphic tensor circuit corresponding to a data layout and gets its cost using the analyser. It repeats this for different possible data layout options and then chooses the one with the lowest cost.

7 Evaluation

Our evaluation targets a set of neural network architectures for image classification tasks described in Figure 5.

No. of layers
Network Conv FC Act # FP operations
LeNet-5-small 2 2 4 159960
LeNet-5-medium 2 2 4 5791168
LeNet-5-large 2 2 4 21385674
Industrial 5 2 6 -
SqueezeNet-CIFAR 10 0 9 37759754
Figure 5: DNNs used in our evaluation.

is a series of networks for the MNIST [3]

dataset. We use three versions with different number of neurons: LeNet-5-small, LeNet-5-medium, and LeNet-5-large. The largest one matches the one used in the TensorFlow’s tutorials 


. These networks have two convolutional layers, each followed by ReLU activation and max pooling, and two fully connected layers with a ReLU in between.


is a network for the CIFAR-10 dataset [1] that follows the SqueezeNet [25] architecture. This version has 4 Fire-modules [4] for a total of 10 convolutional layers.


is an pretrained HE-compatible network from an industry partner for a privacy-sensitive image classification task. We are unable to reveal the details of the network other than the fact that it has 5 convolutional layers and 2 fully connected layers.

All networks other than Industrial use ReLUs and max-pooling, which are not compatible with homomorphic evaluation. For these networks, we modified the activation functions to a second-degree polynomial [23, 13]. The key difference with prior work is that our activation functions are with learnable parameters and . During the training phase, the DNN adjusts these parameters automatically to appropriately approximate the ReLU function. To avoid exploding the gradients during training (which usually happens during the initial parts of training), we initialized to zero and clipped the gradients when large. We also replaced max-pooling with average-pooling.

We trained LeNet-5-large to an accuracy of 0.993, which matches that of the network in the TensorFlow’s tutorial [5]. For SqueezeNet-CIFAR, we additionally introduce L2-regularization with a scale of to improve performance. Our resulting accuracy is 0.815 which is close to the accuracy of 0.84 of the non-HE compatible model. To the best of our knowledge SqueezeNet-CIFAR is the deepest neural network that has been homomorphically evaluated.

All experiments were run on a dual socket Intel Xeon E5-2667v3@3.2GHz with 224 GiB of memory. Hyperthreading was off for a total of 16 hardware threads. All runtimes are reported as averages over 20 different images. We present the average latency of image inference with a batch size of 1.

Model CHET Hand-written
LeNet-5-small 8 14
LeNet-5-medium 51 140
LeNet-5-large 265 -
Industrial 312 2413
SqueezeNet-CIFAR10 1342 -
Figure 6: Average latency (in seconds) of CHET and hand-written versions.
LeNet-5-small 14 240 30 16
LeNet-5-medium 14 240 30 16
LeNet-5-large 15 400 40 20
Industrial 16 705 35 25
SqueezeNet-CIFAR10 16 940 30 20
Figure 7: Encryption parameters selected by CHET and the user-provided precisions for each model.
Model HW CHW HW-conv CHW-fc
CHW-rest HW-before
LeNet-5-small 8 12 8 8
LeNet-5-medium 82 91 52 51
LeNet-5-large 325 423 270 265
Industrial 330 312 379 381
SqueezeNet-CIFAR 1342 1620 1550 1342
Figure 8: Average latency (in seconds) with different layouts.
Model Unoptimized Optimized
LeNet-5-small 14 8
LeNet-5-medium 73 51
LeNet-5-large 426 265
Industrial 645 312
SqueezeNet-CIFAR 2648 1342
Figure 9: Average latency (in seconds) with and without rotation keys optimization.
Comparison with hand-written:

Figure 6 compares hand-written implementations and CHET with all optimizations. CHET clearly outperforms hand-written implementations. The hand-written implementations lack some of the optimizations in the CHET compiler and runtime. Moreover, it is difficult to scale these hand-written implementations to large networks like LeNet-5-large and SqueezeNet-CIFAR10, so we do not have hand-written implementations for these to compare against.

Parameter Selection:

In Figure 7, the last columns show precision (the number of decimal digits) required for the image or ciphertext () and the weights or the plaintext (), that are provided by the user. The precision provided is used by the CHET compiler to select the encryption parameters and . The values of these parameters grow with the depth of the circuit, as shown in the figure. With these parameters, CHET generated homomorphic tensor circuits achieve the same accuracy as the unencrypted circuits. In addition, the difference between the output values of these circuits is within the desired precision of the output.

Data Layout Selection:

We evaluate four different data tiling layouts choices for the three largest networks: (i) HW: each ciphertext has all height and width elements of a single channel, (ii) CHW: each ciphertext has multiple channels (all height and width elements of each), (iii) HW-conv and CHW-rest: same has CHW, but move to HW before each convolution and back to CHW after each convolution, and (iv) CHW-fc and HW-before: same as HW, but switch to CHW during the first fully connected layer and CHW thereafter. Figure 8 presents the average latency of each network for each layout. We can see that for each network, a different layout provides the lowest latency. It is very difficult for the user to determine which is the best data layout and more importantly, it is difficult to implement each network manually using a different data layout. This highlights how the compiler should search the space of possible layouts and kernel implementations for each program separately, while entrusting the runtime to implement it efficiently. In this case, the compiler chooses the best performing data layout for each network based on the cost model of HEAAN.

Rotation Keys Selection:

We evaluate the efficacy of our rotation keys optimization on the three largest networks. Figure 9 presents the average latencies with the optimization on or off. The optimization provides significantly improved performance for all networks and should be always used.222Barring very memory constrained environments. Having this optimization implemented as an automatic compiler pass removes the burden of adding proper rotation keys in each program separately.

8 Related Work

FHE is currently an active area of research. See Acar et al [6] for more details. Many have observed the need and suitability of FHE for machine learning tasks. We survey the most related work below.

Cryptonets [23] was the first tool to demonstrate a fully-homomorphic inference of a DNN by replacing the (FHE incompatible) RELU activations with a quadratic function. They demonstrated an accuracy of 98.95% for MNIST with a network with two hidden layers, with a latency of 250 seconds per image. Chabanne et al. [13] improve upon Cryptonets by using batch normalization [26] to bound the inputs to the activations into a small interval where the quadratic approximation is valid. They report improvement in accuracy for MNIST to 99.30% with a network with 6 activation layers. Our method of retraining the quadratic activation function is inspired by these works. Our emphasis in this paper is primarily on automating the manual and error-prone hand tuning required to ensure that the networks are secure, correct, and efficient.

Bourse et al. [9] use the TFHE library [17] for DNN inference. TFHE operates on bits and is thus slow for multi-bit integer arithmetic. To overcome this difficulty, Bourse et al. instead use a discrete binary neural network (where activation functions output either -1 or 1) with a hidden layer using the sign function as the activation. Our goal in this paper is to build a framework for compiling larger general-purpose DNNs.

Similarly, Cingulata [12, 2] is a compiler for converting C++ programs into a Boolean circuit, which is then evaluated using a backend FHE library. Despite various optimizations [11], this approach is unlikely to scale for large DNNs.

DeepSecure [30] and SecureML [29] use secure multi-party computation techniques to respectively perform DNN inference and training. Such techniques employ communication between multiple entities that are assumed to not collude and thus do not have the simple trust model of FHE. On the other hand, such approaches are more flexible in the operations they can perform and are computationally cheaper.

Finally, prior work [31, 19] have used partially homomorphic encryption schemes (which support addition or multiplication of encrypted data but not both) to determine the encryption schemes to use for different data items so as to execute a given program. While they are able to use computationally efficient schemes, such techniques are not applicable for evaluating DNNs that require both multiplication and addition to be done on the same input.

9 Conclusion

Good abstractions separate concerns to either side of that abstraction. For example, x86 lets hardware manufacturers innovate independently of the software developers and compiler writers that target that abstraction. Likewise, cuDNN lets machine learning experts target various hardware implementations and reap the benefits of new GPU hardware without changing their software.

This paper introduces such an abstraction for FHE applications that cleanly abstracts and exposes features of FHE implementations. We demonstrate a compiler and runtime that targets FHE tensor programs, evaluate that compiler and runtime on real-world CNN models, and demonstrate our compiler is able to significantly optimize the performance of FHE tensor programs. Because of these optimizations, this paper demonstrates the deepest FHE based CNNs to date.

This paper demonstrates the fruitful application of systems and compiler research for FHE applications. We posit there are many open problems that such areas can help solve (i.e., compiler support that turns expressions into FHE compatible ones, or more optimization passes specific to encryption parameters). We expect these opportunities excites other researchers as much as it does the authors.