Does Fully Homomorphic Encryption Need Compute Acceleration?

12/13/2021
by   Leo de Castro, et al.
Boston University
MIT
Analog Devices
0

Fully Homomorphic Encryption (FHE) allows arbitrarily complex computations on encrypted data without ever needing to decrypt it, thus enabling us to maintain data privacy on third-party systems. Unfortunately, sustaining deep computations with FHE requires a periodic noise reduction step known as bootstrapping. The cost of the bootstrapping operation is one of the primary barriers to the wide-spread adoption of FHE. In this paper, we present an in-depth architectural analysis of the bootstrapping step in FHE. First, we observe that secure implementations of bootstrapping exhibit a low arithmetic intensity (<1 Op/byte), require large caches (>100 MB), and are heavily bound by the main memory bandwidth. Consequently, we demonstrate that existing workloads observe marginal performance gains from the design of bespoke high-throughput arithmetic units tailored to FHE. Second, we propose several cache-friendly algorithmic optimizations that improve the throughput in FHE bootstrapping by enabling up to 3.2x higher arithmetic intensity and 4.6x lower memory bandwidth. Our optimizations apply to a wide range of structurally similar computations such as private evaluation and training of machine learning models. Finally, we incorporate these optimizations into an architectural tool which, given a cache size, memory subsystem, the number of functional units and a desired security level, selects optimal cryptosystem parameters to maximize the bootstrapping throughput. Our optimized bootstrapping implementation represents a best-case scenario for compute acceleration of FHE. We show that despite these optimizations, bootstrapping continues to be bottlenecked by main memory bandwidth. We propose new research directions to address the underlying memory bottleneck. In summary, our answer to the titular question is: yes, but only after addressing the memory bottleneck!

READ FULL TEXT VIEW PDF

Authors

page 1

12/03/2020

Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

Homomorphic encryption (HE) draws huge attention as it provides a way of...
09/11/2021

F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption (Extended Version)

Fully Homomorphic Encryption (FHE) allows computing on encrypted data, e...
05/02/2022

ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse

Homomorphic Encryption (HE) is one of the most promising post-quantum cr...
09/29/2021

Accelerating Encrypted Computing on Intel GPUs

Homomorphic Encryption (HE) is an emerging encryption scheme that allows...
12/28/2021

Casper: Accelerating Stencil Computation using Near-cache Processing

Stencil computation is one of the most used kernels in a wide variety of...
04/19/2021

Arithmetic-Intensity-Guided Fault Tolerance for Neural Network Inference on GPUs

Neural networks (NNs) are increasingly employed in domains that require ...
02/03/2022

HECO: Automatic Code Optimizations for Efficient Fully Homomorphic Encryption

In recent years, Fully Homomorphic Encryption (FHE) has undergone severa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The rapid development of cloud-based systems has enabled reliable and affordable access to shared computing resources at scale. However, this shared access raises substantial privacy and security challenges. Therefore, new techniques are required to guarantee the confidentiality of sensitive user data when it is sent to the cloud for processing. Fully Homomorphic Encryption (FHE) [RAD, Gentry09] enables cloud operators to perform complex computations on encrypted user data without ever needing to decrypt it. The result of such FHE-based computation is in an encrypted form and can only be decrypted by the data owner. An illustrative use case of how a data owner can outsource computation on private data to an untrusted third-party cloud platform is shown in Figure 1.

Fig. 1: Third-party cloud platform with outsourced FHE-based computing.

While FHE-based privacy-preserving computing is promising, performing large encrypted computations with FHE still remains several orders of magnitude slower than operating on unencrypted data, which makes broad adoption impractical. This slowdown is an inherent feature of all existing lattice-based FHE schemes. All of these schemes produce ciphertexts containing a noise term, which is necessary for security. Each subsequent homomorphic operation performed on the ciphertext increases its noise, until it grows beyond a critical level after which recovery of the computation output is impossible. Sustained FHE computation thus requires a periodic de-noising procedure, called bootstrapping, to keep the noise below a correctness threshold. Unfortunately, this bootstrapping step is expensive in terms of both compute and memory requirements and is often more expensive than primitive operations like addition and multiplication on encrypted data.

Real-world applications commonly attempt to amortize this bootstrapping cost across multiple homomorphic operations. Even when considering these application-specific optimizations, bootstrapping consumes more than of the total compute and memory budget for end-to-end operations like machine learning training [GPUBoot21]. To make FHE-based computing practical, we need to consider a multi-layer approach to accelerate both the bootstrapping step as well as its primitive building blocks using a combination of algorithmic and hardware techniques.

In this work, we first perform a thorough compute and memory analysis of both simple and complex FHE primitives including the bootstrapping step, with an intent to determine the limits and potential opportunities for accelerating FHE. Our analysis reveals that all FHE operations exhibit low arithmetic intensity ( Op/byte) and require working-set sizes of hundreds of MB for practical and secure parameters. In fact, we observe that most existing performance optimization techniques for FHE often increase memory bandwidth requirements. These include both linear and non-linear operation optimizations proposed by Han and Ki [HK19], Han, Hhan and Cheon [CHH18], and Bossuat, Mouchet, Troncoso-Pastoriza and Hubaux [BMTH20]. Recent bootstrapping implementation on GPUs by Jung, Kim, Ahn, Cheon and Lee [GPUBoot21] is the first work to perform memory-centric optimizations for linear operations in bootstrapping. Even after these optimizations, their implementation continues to be bounded by main memory bandwidth and exhibits an arithmetic intensity of Op/byte. On the other side of the design spectrum, recent work by Samardzic et al. [F1Paper21]

presents an architecture for a high-throughput hardware accelerator for FHE. This work primarily focuses on smaller parameter sets where full ciphertexts fit in on-chip cache memory allowing them to bypass the memory bandwidth limitation. However, many natural applications such as SIMD boostrapping, deep-neural network inference (with complex activation functions) and machine learning require larger parameter sets that are not addressed in

[F1Paper21].

In this work, we focus on presenting our three key contributions, i.e., application benchmarking, new techniques to improve memory performance, and evaluation of these techniques on end-to-end applications. More specifically:

  • We present detailed benchmarking of the compute and memory requirements of various FHE computations ranging from primitive operations to end-to-end applications such as machine-learning training. We show that all these benchmarks exhibit low arithmetic intensity and require large working-sets in on-chip memory. We observe that these working-sets do not fit in the last level caches of today’s reticle-limited chips leading to bootstrapping and other applications being bottlenecked by memory accesses.

  • We next present techniques to improve main memory bandwidth utilization by effectively managing the moderate last-level cache provided by currently available commercial hardware. For cache-pressured hardware ( MB LLC) we propose a domain-specific physical address mapping to enhance DRAM utilization. We then present hardware-independent algorithmic optimizations that reduce memory and compute requirements of FHE operations.

  • We finally propose an optimized, memory-aware cryptosystem parameter set that maximizes the throughput in FHE bootstrapping and logistic regression training by enabling up to

    higher arithmetic intensity and lower memory bandwidth.

The techniques that we propose often compose with prior art and can be used as drop-ins to provide performance improvements in existing implementations without the need for new hardware. Our proposed bootstrapping parameter set represents an upper limit on the performance of FHE operations that can be attained through pure compute acceleration when paired with existing state-of-art memory subsystems. Even with this optimal parameter set, we observe that the bootstrapping step is still primarily memory bound. Thus:

Our key conceptual take-away is that to accelerate FHE, we need novel techniques to address the underlying memory bandwidth issues. Compute acceleration alone is unlikely to make a dent.

Towards the goal of addressing memory bandwidth issues, we propose novel near-term algorithmic and architectural research directions.

Ii Fully Homomorphic Encryption: The API

To set the stage, in this section we present the operations implemented by the Cheon-Kim-Kim-Song (CKKS) [CKKS17] FHE scheme. We organize these operations in the form of an API that can be used by any application developer to design privacy-preserving applications. Specifying the CKKS scheme requires several parameters, and we summarize our notation for these parameters in Table I. Though we focus on the CKKS scheme, the API is generic and can be used for the BGV [BGV12] and B/FV [Brak12, FV12] schemes as well111An exception is the function, which the BGV and B/FV schemes do not support, since they do not encrypt complex numbers..

Parameter Description
Number of coefficients in a polynomial in the ciphertext ring.
, number of plaintext elements in a single ciphertext.
Full modulus of a ciphertext coefficient.
Machine word sized prime modulus and a limb of .
Scaling factor of a CKKS plaintext.
Product of the additional limbs added for the raised modulus.
Maximum number of limbs in a ciphertext.
Current number of limbs in a ciphertext.
Number of digits in the switching key.
. Number of limbs that comprise a single digit in the key-switching decomposition. This value is fixed throughout the computation.
. An -limb polynomial is split into this number of digits during base decomposition.
TABLE I: CKKS FHE Parameters and their description.

[t] Operation Name Output Implementation Description

Adds a plaintext vector to an encrypted vector.

Adds two encrypted vectors. Algorithm 1 Multiplies a plaintext vector and an encrypted vector. Algorithm 2 Multiplies two encrypted vectors. Algorithm 3 Rotates a vector by positions; see Section II-A for an illustration. Algorithm 3 Outputs an encryption of the complex conjugate of the encrypted input vector.

TABLE II: CKKS Fully Homomorphic Encryption API.
  • Through a clever encoding [CKKS17], the operation implementation is identical to the implementation.

Ii-a Homomorphic Encryption API

The basic plaintext data-type in CKKS is a vector of length where each entry is chosen from , the field of complex numbers. All arithmetic operations on plaintexts are component-wise; the entries of the vector (resp. ) are the component-wise sums (resp. products) of the entries of with the corresponding entries of . We denote the encryption of a length- vector by .

Table II gives a complete description of the API with the exception of the rotation operation, which we describe here. The operation takes in an encryption of a vector of length and an integer , and outputs an encryption of a rotation of the vector by positions. As an example, when , the rotation is defined as follows.

The operation is necessary for computations that operate on data residing in different slots of the encrypted vectors.

Ii-B Modular Arithmetic and the Residue Number System

Scalar Modular Arithmetic

Nearly all FHE operations reduce to scalar modular additions and scalar modular multiplications. Current CPU/GPU architectures do not implement modular arithmetic directly but emulate it via multiple arithmetic instructions, which significantly increases the amount of compute required for these operations. Therefore, optimizing modular arithmetic is critical to optimizing FHE computation.

To perform modular addition over operands that are already reduced, we use the standard approach of conditional subtraction if the addition overflows the modulus. For generic modular multiplications, we use the Barrett reduction technique [Barrett]. When computing the sum of many scalars, we avoid performing a modular reduction until the end of the summation, as long as the unreduced sum fits in a machine word. As an optimization, we use Shoup’s technique [shoup2001ntl] for constant multiplication. That is, when computing where and are known in advance, we can precompute a value such that is much faster than directly computing .

Residue Number System (RNS)

Often the scalars in homomorphic encryption schemes are very large, on the order of thousands of bits. To compute on such large numbers, we use the residue number system (also called the Chinese remainder representation) where we represent numbers modulo , where each is a prime number that fits in a standard machine word (less than bits), as numbers modulo each of the . We call the set an RNS basis. We refer to each as a limb of .

This allows us to operate over values in without any native support for multi-precision arithmetic. Instead, we can represent as a length- vector of scalars , where . We refer to each as a limb of . To add two values , we have . Similarly, we have . This allows us to compute addition and multiplication over while only operating over standard machine words. The size of this representation of an element of is machine words.

Ii-C CKKS Ciphertext Structure

In this section, we give the general structure of a ciphertext in the CKKS [CKKS17] homomorphic encryption scheme. A ciphertext is a pair of polynomials each of degree . The coefficients of these ciphertexts are elements of , where has limbs. Thus, in total, the size of a ciphertext is machine words.

In CKKS, we are able to encrypt non-integer values, including complex numbers. The ciphertexts are “packed,” which means they encrypt vectors in , where , in a single ciphertext. For , we denote its encryption as where and are the two polynomials that comprise the ciphertext. We omit the subscript when there is no cause for confusion.

An example of ciphertext parameters that achieve a -bit security level is and . With an -byte machine word, this gives a total ciphertext size of  MB. Note that in today’s reticle-limited systems, the largest last-level cache size is  MB [nvidiaA100]. Consequently, we won’t be able to fit even a single ciphertext in the last-level cache, which indicates the need for multiple expensive DRAM accesses when operating on ciphertexts.

Polynomial Representation

In order to enable fast polynomial multiplication, we will have all polynomials represented by default as a series of evaluations at fixed roots of unity. This allows polynomial multiplication to occur in time. We refer to this representation as the evaluation representation. Certain subroutines, defined in section II-D, operate over the polynomial’s coefficient representation, which is simply a vector of its coefficients. Addition of two polynomials and multiplication of a polynomial by a scalar are

in both the coefficient and the evaluation representation. Moving between representations requires a number-theoretic transform (NTT) or inverse NTT, which is the finite field version of the fast Fourier transform (FFT) and takes

time and space for a degree- polynomial.

Encoding Plaintexts

CKKS supports non-integer messages, so all encoded messages must include a scaling factor . The scaling factor is usually the size of one of the limbs of the ciphertext, which is slightly less than a machine word. When multiplying messages together, this scaling factor grows as well. The scaling factor must be shrunk down in order to avoid overflowing the ciphertext coefficient modulus. We discuss how this procedure works in Section II-D.

Ii-D Implementing the API

To implement the homomorphic API described in Table II, we need some “helper” subroutines. We first describe these subroutines and then provide the implementations of the homomorphic API using the subroutines.

Handling a Growing Scaling Factor

As mentioned in section II-C, all encoded messages in CKKS must have a scaling factor . In both the and implementations, the multiplication of the encoded messages results in the product having a scaling factor of . Before these operations can complete, we must shrink the scaling factor back down to (or at least a value very close to ). If this operation is neglected, the scaling factor will eventually grow to overflow the ciphertext modulus, resulting in decryption failure.

To shrink the scaling factor, we divide the ciphertext by (or a value that is close to ) and round the result to the nearest integer. This operation, called , keeps the scaling factor of the ciphertext roughly the same throughout the computation.222A better name for this operation would be “divide and mod-down” because it reduces the scaling factor as well as the ciphertext modulus. In this paper, we stick to the standard terminology for consistency with the literature. For a more formal description, we refer the reader to [FullRNSHEAAN]. We sometimes refer to a instruction that occurs at the end of an operation as .

Handling a Changing Decryption Key

In both the and implementations, there is an intermediate ciphertext with a decryption key that differs from the decryption key of the input ciphertexts. In order to change this new decryption key back to the original decryption key, we perform a operation. This operation takes in a switching key and a ciphertext that is decryptable under a secret key . The output of the operation is a ciphertext that encrypts the same message but is decryptable under a different key .

Key Switching [Bv11]

Since the operation differs between and , we do not define it separately. Instead, we go a level deeper, and define the subroutines necessary to implement for each of these operations. In addition to the operation, we use the operation, which allows us to add primes to our RNS basis. We follow the structure of the switching key in the work of Han and Ki [HK19], where the switching key, parameterized by a length , is a matrix of polynomials.

(1)

The operation requires that a polynomial be split into “digits,” then multiplied with the switching key. We define the function that splits a polynomial into digits as well as a operation to multiply the digits by the switching key.

Before proceeding further, we refer the reader to Table III where all the subroutines described above are defined in more detail. The implementation of the API functions are given in Algorithms 1, 2 and 3. We also give a batched rotation algorithm in Algorithm 4, which computes many rotations on the same ciphertext faster than applying independently several times.

1:
2:
3:return
Algorithm 1
1:
2:
3:
4:
5: for .
6:
7:
8:
9:return
Algorithm 2
1:
2:
3: digits.
4: for .
5:
6:
7:return
Algorithm 3
1:
2: digits.
3: for .
4:for  from to  do
5:      for
6:     
7:     
8:     
9:     
10:end for
11:return
Algorithm 4
Sub-routine Name Output Used-in Description
This function takes in a polynomial in the coefficient representation, where each coefficient is modulo and represented in the RNS basis . Assume that and let be the product of the last limbs of . Let , and note that . The output of this function is a polynomial where each coefficient of equals the corresponding coefficient of divided by plus some small rounding error.
Takes a polynomial where each coefficient is in the basis and outputs the representation of where each coefficient is in the basis . could be a subset or superset of , or they could be unrelated. Note that this operation must also be performed in the coefficient representation.
Takes in a polynomial and a parameter and splits into digits. If has limbs, each digit of has roughly limbs.
Takes in a key-switching key with the structure of eq. 1 and a vector of polynomials of length . Let be the first row of and let be the second row of . The output of this operation is two polynomials and .
Takes a vector with elements and an integer and outputs a permutation of the elements. This permutation is an automorphism which is not simply a rotation; intuitively, the permutation of an encoded message will result in the decoded value being permuted by the natural rotation .
TABLE III: CKKS Subroutines: These subroutines enable the implementation of the CKKS API defined in Table II.

Key Takeaway: The Shrinking Ciphertext Modulus A main observation coming out of our description of the homomorphic API is that the ciphertext modulus shrinks for each (algorithm 1) and (algorithm 2) operation. This occurs in the operations at the end of these functions. If a ciphertext begins with limbs, we can only compute a circuit with multiplicative depth , since the ciphertext modulus shrinks by a number of limbs equal to the multiplicative depth of the circuit being homomorphically evaluated. This foreshadows the next section where we present an operation called bootstrapping [Gentry09] that increases the ciphertext modulus.

Ii-E Concrete Costs

We present the hardware cost associated with various functions and subroutines in the FHE API in Table IV and Table V, and discuss the content of the tables briefly. To generate these performance numbers, we implement an architectural modeling tool that can perform an in-depth analysis given the number of functional units, cache size, and the memory subsystem parameters. In addition, our tool allows us to tune nearly all parameters of the algorithm, including , , and the maximum ciphertext modulus for a given security level.

Key Takeaway: Low Arithmetic Intensity. The key take-away from the tables, in particular Table V, is that the arithmetic intensity, defined as the number of operations per byte transferred from DRAM, of all of the functions in the CKKS API is less than Op/byte. This means that when the ciphertexts do not fit in memory, any natural application (e.g. logistic regression training, neural network evaluation, bootstrapping, etc.) built using these functions will have performance bounded by the memory bandwidth and not the computation speed.

Since our ciphertexts will remain too large to fit in the chip cache, much of this work will focus on improving the arithmetic intensity of CKKS bootstrapping. This translates to progressing further in the bootstrapping algorithm per memory transfer, which, in turn, translates to a faster bootstrapping implementation.

Sub-routine
Name
Total Operations
(in GOP)
Total Mults
(in GOP)
Total DRAM
Transfers (in GB)
DRAM Limb
Reads (in GB)
DRAM Limb
Writes (in GB)
DRAM Key
Reads (in GB)
Arithmetic
Intensity
(in Op/byte)
TABLE IV: Hardware Cost of Auxiliary Subroutines: These benchmarks were taken for , , . The Total Operations column counts the number of modular additions and multiplications in the operations, (note that this count for the function is zero). GOP stands for Giga operations. The Total DRAM Transfers is the sum of DRAM Limb Reads, DRAM Limb Writes, and DRAM Key Reads, the last of which counts the reads specifically for the switching keys. The operation has no limb writes because the limbs are immediately used in the next operation, the . The write is counted in the when the limbs are written out in to be read back in slot-wise format, as discussed in Section IV-A. The Arithmetic Intensity column defines the number of operations per byte transferred from DRAM.
Operation
Name
Total Operations
(in GOP)
Total Mults
(in GOP)
Total DRAM
Transfers(in GB)
DRAM Limb
Reads (in GB)
DRAM Limb
Writes (in GB)
DRAM Key
Reads (in GB)
Arithmetic Intensity
(in Op/byte)
TABLE V: Hardware Cost of FHE APIs: These benchmarks were taken for , , . The number of rotations computed in the benchmark is . See the caption of Table IV for a description of the columns.

Iii Fully Homomorphic Encryption: Applications

In this section, we describe how the FHE API from Section II can be leveraged to develop applications. As discussed in Section II-D, a CKKS ciphertext can only support computation up to a fixed multiplicative depth due to the shrinking ciphertext modulus. Once this depth is reached, a bootstrapping operation must be performed to grow the ciphertext modulus, which allows for computation to continue.

Many applications of interest have a deep circuit that requires bootstrapping multiple times: in general, machine learning training algorithms are good examples where deeper circuits for the training computation often lead to greater accuracy of the resulting model. In this section, we use logistic regression training over encrypted data as a running example to explain the process of FHE-based machine learning training. Logistic regression training contains both linear (e.g. inner-products) and non-linear (e.g. sigmoid) operations. The CKKS scheme naturally supports linear operations, while for non-linear operations we need to use a polynomial approximation (as in [Kim2018LogisticRM, HELogReg]). The greater the degree of the polynomial, the greater the accuracy of the approximation, which further drives an increase in the circuit depth, in turn requiring bootstrapping.

For our running example, we use the logistic regression training application given in Han, Song, Cheon and Park [HELogReg] and depicted in fig. 2

. The training process is an iterative process that repeatedly computes an inner product followed by a sigmoid function on a training data set and the model weights. The logistic regression update equation is as follows.

(2)

The vector is the weight vector, the values and are scalars, and represents the vector of the training data set. The function is the sigmoid function.

To implement this iterative update, we split the update into two phases: a linear phase that contains the inner product333In the real implementation of Equation 2, these inner products are batched into a matrix-vector product. We use the same algorithm as [HELogReg]. and a non-linear phase that contains the sigmoid function. We implement these phases separately with common building blocks shown in Table VI. The linear phase can be implemented with an routine that computes the inner product of two encrypted vectors. The non-linear phase is approximated with a polynomial, and the homomorphic evaluation of this polynomial can be implemented with . The scalar products and summation can be implemented with the , , and functions. After some number of iterations, the encrypted weights are passed through the routine. The exact placement of the operation in a circuit is application-dependent. In our running example, bootstrapping needs to be done every three iterations (see Figure 2).

Fig. 2: Logistic regression training on encrypted data.
Name Output Description
Computes the inner product of two encrypted vectors. This computation is the specific encrypted inner product algorithm from Han et al. [HELogReg].
This operation takes an encrypted vector and a (univariate) polynomial as input. The result is an encryption of the evaluation of at , where each entry of is the evaluation of on the corresponding entry of .
This operation takes a plaintext matrix and multiplies it by an encrypted vector . The result is an encryption of the vector . This is a major subroutine in .
This operation takes in an encryption of a vector and outputs an encryption of the same vector . This operation is necessary to be able to compute indefinitely on encrypted data. Far from being a null operation, this is nearly always the bottleneck operation when computing over encrypted data.
TABLE VI: Homomorphic Encryption Application Building Blocks: These building blocks are implemented using the API from Table II.

Iii-a Bootstrapping

As discussed in Section II-D, the ciphertext modulus of CKKS shrinks with each multiplication. In order to compute indefinitely on a CKKS ciphertext, we must grow the ciphertext modulus without also growing the noise. This is not as simple as performing a function. The CKKS bootstrapping procedure [CKKS20] begins with this operation, which gives the new plaintext as where is the modulus for the input ciphertext and is some polynomial with small integer coefficients. The primary goal of the bootstrapping operation is to homomorphically evaluate the modular reduction operation modulo on this plaintext, returning the plaintext back to .

The CKKS bootstrapping algorithm follows a general structure that has remained relatively static in the literature [CCS18, CHH18, HK19, BMTH20, CKKS20] over the past few years. This structure has three main components: a linear operation, an approximation of the modular reduction function followed by another linear operation. The linear operations in bootstrapping require homomorphically evaluating the DFT on the encrypted data so that we perform modulus reduction on the coefficient representation of plaintext, rather than the evaluation (or slot) representation. The first of these DFT operations is called and the second is called . In between these two DFT operations is an approximation of the modular reduction function that consists of a polynomial evaluation followed by an exponentiation. For further details on polynomial evaluation and the exponentiation, we refer the readers to [HK19, BMTH20].

To homomorphically evaluate the DFT, we use the observation that the DFT matrix can be factored into submatrices of smaller dimension. This turns the homomorphic DFT into a series of operations. However, there is a trade-off between the number of operations that must be computed and the size of the matrices in each instance. Each has a multiplicative depth of 1. The total dimension of the DFT is for our parameters. Options to evaluate this DFT include evaluating a single with an input, which would require a very large number of rotations, or evaluating instances in sequence with only two rotations per instance. The former corresponds to treating DFT as a matrix-vector multiplication without using the structure of the DFT matrix while the latter corresponds to running the algorithm for DFT.

We can interpolate between these two extremes to find the optimal depth vs. computation trade-off. Each sub-matrix in the factorization of the DFT matrix has a

radix corresponding to the number of non-zero diagonals. The smaller the radix, fewer the rotations that must be computed during the instance. The rule is that the product of the radices of the iterations (in the DFT algorithm) must equal . For example, for our parameter of , this gives the options of three iterations with radices of , , and , or five iterations with four iterations having a radix of matrix and one iteration with a radix of . We call the number of iterations as . The homomorphic inverse DFT is computed in an analogous way.

Our approximation of the modular reduction function follows the literature, where we represent the modular reduction function modulo with a sine function with period , then approximate this sine function with a polynomial. We represent this polynomial with , and we use the Chebyshev polynomial construction used in Han and Ki [HK19]. The degree of this polynomial is . We give a high-level pseudocode for the bootstrapping algorithm in Algorithm 5.

1:
2:
3:for  from to  do phase.
4:     
5:end for
6:
7:for  from to  do phase.
8:     
9:end for
10:return
Algorithm 5

Iii-B Concrete Costs

We give the concrete costs of the logistic regression and subroutines in Table VII and Table VIII respectively. As the table shows, the arithmetic intensity of the sub-routines is less than Op/byte. As discussed in Section II-E, since our ciphertexts do not fit in cache, this means that the performance of all sub-routines is bounded by the main memory bandwidth. In Table VII, we give benchmarks for the logistic regression implementation based on our architecture modeling discussed in Section II-E. The parameters we use are from the work of Jung et al. [GPUBoot21], and these parameters were chosen to optimize their secure logistic regression application that leverages a GPU implementation of CKKS bootstrapping. We refer to the original work of Han et al. [HELogReg] for the full algorithm benchmarked in Table VII. We note that the logistic regression iteration is the “most expensive” of the three iterations that follow a , since the ciphertexts in this iteration are the largest. As the ciphertext shrinks due to the reduced ciphertext modulus, the computation becomes cheaper. However, the arithmetic intensity remains essentially the same, and the performance of each phase of the algorithm is bottle-necked by the memory bandwidth. Overall, roughly half of the total runtime is spent in bootstrapping.

Key Takeaway: Bootstrapping is often the bottle-neck operation in HE applications, especially applications that implement a deep circuit. For example, even when using a heavily-optimized GPU implementation of bootstrapping, nearly half of the time in HE logistic regression training is spent on bootstrapping [GPUBoot21] (table VII). This motivates the need to optimize the operation to efficiently support deep circuits. Furthermore, the building blocks of bootstrapping are the same as many other HE applications; there are essentially no subroutines that are unique to bootstrapping. Many of the optimizations we give in Section IV and Section V apply more generally to HE applications.

Sub-routine
Name
Total Operations
(in GOP)
Total Mults
(in GOP)
Total DRAM
Transfers(in GB)
DRAM Limb
Reads (in GB)
DRAM Limb
Writes (in GB)
DRAM Key
Reads (in GB)
Arithmetic
Intensity
(in Op/byte)
Full LR Iteration
TABLE VII: Hardware Cost of FHE Applications: These benchmarks were taken for , , . See the caption of Table IV for a description of the columns. The number of features in the logistic regression is . The and benchmarks are for the first iterations after a . The “Full LR Iteration” row is the first iteration of the training algorithm after a . The degree of the polynomial evaluated in is .
Sub-routine
Name
Total Operations
(in GOP)
Total Mults
(in GOP)
Total DRAM
Transfers(in GB)
DRAM Limb
Reads (in GB)
DRAM Limb
Writes (in GB)
DRAM Key
Reads (in GB)
Arithmetic
Intensity
(in Op/byte)
TABLE VIII: Hardware Cost of Bootstrapping: These benchmarks were taken for , , . See the caption of Table IV for a description of the columns. These benchmarks represent the performance of the main sub-routines of bootstrapping. The degree of the polynomial in is .

Iv CKKS Bootstrapping: Caching Optimizations

In this section and section V, we present our optimizations to the CKKS bootstrapping algorithm. These optimizations fall into two categories: those that rely on hardware assumptions and those that do not. Our first class of optimizations assume a lower bound on the amount of available cache size relative to the size of the ciphertext limbs while second class of optimizations are more general as they reduce the total operation count of CKKS bootstrapping as well as the total number of DRAM reads, regardless of the hardware architecture.

This section focuses on the first set of optimizations. These caching optimizations do not affect the operation count of ; instead, they reduce DRAM reads and writes to reduce the overall memory bandwidth requirement. Our optimizations demonstrate how best to utilize caches of various sizes relative to the size of the ciphertext limbs. We quantify the improvements of these optimizations in Section IV-E, where we give benchmarks for progressively larger cache sizes. Our baseline benchmark is the parameter set from the GPU bootstrapping implementation of Jung et al. [GPUBoot21]. The parameters are given in Table XI.

Iv-a Caching Limbs

This is the first in a series of optimizations that details how best to utilize a cache for various cache sizes relative to the ciphertext limbs. We begin by discussing how to utilize a cache that can store a constant number of limbs. Intuitively, this optimization computes as much as possible on a single limb before writing it back to the main memory. This often involves performing the operations of several higher-level functions on a single limb before beginning the same sequence of operations on the next limb. This technique was referred to by Jung et al. [GPUBoot21] as a “fusing” of operations, and we include all fusing operations listed in their work in our bootstrapping algorithm. In addition, we provide a novel data mapping technique to handle caching data with different data access patterns.

Data Access Patterns

Having a small-cache (about - MB) in any FHE compute system has a caveat that must be carefully addressed. Some operations in CKKS such as and operate on data within the slots of the same limb, independent of the other limbs in the ciphertext. On the other hand, RNS basis change operations in and require interaction between a certain number of slots across various limbs. This requires having a few slots from multiple limbs in on-chip memory to reduce the number of accesses to main memory for a single operation. To account for this, we define two different types of data access patterns. For the functions where limbs can be operated upon independently, we define the data access pattern as limb-wise and for the functions where slots can be operated upon independently, we define the data access pattern as slot-wise. A summary of this is given in Table IX. We have also illustrated this by giving a high-level pseudo code of in Algorithm 6. From this algorithm, it is evident that the operation includes both limb-wise and slot-wise operations, requiring a memory mapping that is efficient for both access patterns. A naive memory mapping would result in low throughput for at least one of these access patterns. Therefore, we describe a novel memory mapping approach to handle these two access patterns.

1:for  from to  do
2:      limb-wise
3:end for
4:for  from to  do Basis conversion.
5:      slot-wise
6:end for
7:for  from to  do
8:      limb-wise
9:end for
10:return
Algorithm 6
Operation Interaction Independent Access pattern
, Intra-limb Inter-limb limb-wise
Inter-limb Intra-limb slot-wise
TABLE IX: Data dependencies and access pattern in Different Functions
The function is used in both and .

Physical Address Mapping

When we re-purpose the last level cache to support both limb-wise and slot-wise access patterns, we observe that the physical address mapping of the data in main memory has a substantial impact on the time it takes to transfer data from the main memory. Figure 3 (a) shows a natural physical address mapping for a ciphertext. We call this the baseline address mapping. Through simulations in DRAMSim3 [LYRSJ20] we notice that for this baseline address mapping, the limb-wise accesses require  ms to read  limbs worth of data. However, we notice that the slot-wise access pattern requires  ms to transfer the same amount of data. This is significantly lower as with the peak theoretical bandwidth (i.e.,  GB/s) for DDR4 the time required to read  limbs worth of data is  ms.

There are two reasons for this performance hit while doing slot-wise accesses. With , the size of the ciphertext is  MB whose limbs can be stored sequentially within a memory bank in one of the bank groups in main memory. Each limb of the ciphertext spans across multiple rows of the memory bank. Typically, each bank in main memory has a currently activated row whose contents are copied into a row buffer (acting as a cache) that can be accessed quickly. However, with slot-wise access pattern, every access is trying to read a different row, which takes longer because each row must be activated first. Moreover, with slot-wise accesses, we are unable to exploit the fact that bank accesses to different banks’ groups require less time delay between accesses in comparison to the bank accesses within the same bank’s group. Instead, we keep accessing data from the memory bank within the same bank group.

Fig. 3: DDR4 physical address mapping. Baseline address mapping indexes all the slots () using the lower-order bits and all limbs using the immediate next bits. In optimized physical address mapping, slots are indexed using bits from the column and bits from the row, accounting for slots. The limbs are indexed using the bits that index bank group, bank, and rank and bits from the row index.

We propose an optimized physical address mapping as shown in Figure 3 (b). As shown in Table X, with this proposed address mapping, we observe that the limb-wise access requires a data transfer time of  ms, which is about reduction in the times observed for the baseline limb-wise accesses. However, compared to the baseline slot-wise access pattern, our optimized slot-wise access pattern sees an increase in data transfer time by , which is a significant improvement. We observe that the total data transfer time for baseline address mapping is about higher than our optimized mapping. Our optimized physical address mapping ensures that when performing limb-wise and slot-wise reads/writes, we exploit bank-level parallelism, and we focus on reducing the bank thrashing by not changing a bank’s currently activated row frequently. Note that for a different DRAM type such as HBM2 or GDDR5/6, similar physical address mappings can be done to optimize the main memory bandwidth utilization.

Mapping
limb-wise
access
slot-wise
access
Total Time
Baseline  ms  ms  ms
Optimized  ms  ms  ms
TABLE X: DRAM transfer times with Baseline and Optimized mapping for different access patterns: Transfer times are computed for reading limbs worth of data, which is  MB for our baseline parameter set.

Iv-B -Limb Caching

The next optimization considers a cache size that is . Recall that is the number of digits generated from a polynomial key switching. We refer to Han and Ki [HK19] for more details. For our parameters where , this amounts to about  MB of cache. We need space for limbs at all-times and limbs worth of space to store intermediate results and other required constants. With this optimization, we can greatly reduce the number of accesses to main memory during key-switching.

Consider the function in Algorithm 4. There are digits that are produced as the output of the operations. Naively, for each rotation we would read the limbs for each of the digits, rotate them, then compute the inner product with the key-switching key. Since now we have space in the cache for digits, we can instead pull in a single limb from each of the outputs of , then compute the rotation and the inner product with the switching key limbs all at once. This allows us to read in the outputs of the function only once, regardless of the number of rotations computed.

Iv-C -Limb Caching

For this optimization, we assume that we have a relatively large LLC that can hold limbs. Recall that is the number of limbs in a single digit after output by the function for key switching. We refer to Han and Ki [HK19] for more details. In practice, this optimization requires only slightly more than limbs, using about  MB ( +  MB) for as and .

Under this assumption, we observe a dramatic decrease in the number of accesses to the main memory. This is because all of the slot-wise basis conversion operations in (line 5 in algorithm 6) and operate over limbs. If we can fit these limbs in cache, then we can generate new limbs in their entirety within the cache. With each new limb in cache, we can perform the NTT on the limb, which completes the basis change operation, and write this limb out to memory. This lets us generate all new limbs in evaluation format without having to write them out in slot-wise format and then reading them back in limb-wise format.

Accumulator Caching

We briefly mention an optimization that is easily enabled by a large cache but is also available with smaller caches ( or even smaller). This optimization improves the memory bandwidth of the baby-step giant-step polynomial evaluation from Han and Ki [HK19]. A straight-forward optimization is to cache the leaves (the baby-step) polynomials and reuse them to compute all of the giant-step limbs. However, if there is not enough space for the baby-step polynomials, we can still save DRAM reads by caching the partial sums of the giant step limb. When we read in a baby-step limb, we add this limb to all cached accumulators.

Iv-D Re-Ordering Limb Computations

For the operation, the limbs that are being reduced need additional operations to be performed on them. The operations in key switching and bootstrapping drop limbs. In this re-ordering optimization, we propose computing these limbs first so that the additional operations can be performed immediately. This optimization is especially potent when these limbs can be cached, since then there is no need to write out these limbs as they are being computed. Once we have the limbs, we can begin the operation by computing the output of the basis conversion. Then, for each subsequent limb that is computed, this limb can be immediately combined with the basis conversion output, saving DRAM transfers.

Iv-E Key Takeaway

The benefits of the optimizations in this section are presented in Figure 4. As the figure shows, growing the cache size reduces the DRAM transfers of the bootstrapping algorithm by employing the optimizations described in this section. Note that the number of compute operations in the bootstrapping algorithm remains fixed for all these benchmarks.

Fig. 4: DRAM transfers with various memory optimizations. As the cache size grows from left-to-right more optimizations become available. The impact is assessed cumulatively i.e. each successive optimization builds on top of the earlier ones. The order of the optimizations correspond to the order of the sections in Section IV.

V CKKS Bootstrapping: Algorithmic Optimizations

In this section, we present our algorithmic optimizations to the CKKS bootstrapping algorithm. These optimizations represent strict improvements to the CKKS bootstrapping algorithm and they do not depend on the cache size. However, as an added benefit of reducing the compute operation count, they also reduce the memory bandwidth, as displayed in Figure 5.

Our baseline for demonstrating the improvements of these optimizations is the memory-optimized algorithm from Section IV. Therefore, the left-most baseline bar in Figure 5 contains all of the memory optimizations described in Section IV. For the algorithm that includes all of our optimizations, we performed a parameter search to optimize the bootstrapping throughput for a -bit security level. We discuss our parameter search method further in Section VI. These parameters are given in Table XI, and all benchmarks in Figure 5 were taken using these same parameters.

V-a Combining and in

This optimization merges the two operations in lines 7 and 9 in Algorithm 2. To merge these operations, we must lift the addition step in line 8 above the first . We achieve this by modifying the double-hoisting method from Bossuat et al. [BMTH20], multiplying the two polynomials by to efficiently lift the two polynomial to the modulus . We denote the operation that multiplies by modulo and then interprets the result modulo as . By applying the function, we can move the addition above the first , making the two operations adjacent, which allows them to be combined. This new algorithm, denoted as , is given in Algorithm 7, and the lines in blue denote the differences from Algorithm 2.

Faster Encrypted Inner Product

As a direct result of this optimization, we obtain a faster encrypted inner product. Consider the operation that computes where and are vectors of ciphertexts. Using the operation, we need to compute only one operation over the entire sum. This is because we can merge the additions in line 8 to sum all of the polynomials before any is computed.

1:
2:
3:
4:
5: for .
6:
7:
8:return
Algorithm 7

V-B Hoisting the in

In section II-D, we discussed how rotations on the same ciphertext can be computed more efficiently than simply applying the function times. This function described in Algorithm 4 achieves an improved performance by identifying an expensive common subroutine in all of the operations: the routine.

Bossuat et al. [BMTH20] present an optimization that hoists the second slot-wise operation in the function: the routine. However, their technique is similar to the one in , where the message polynomial is lifted to the raised modulus via the inexpensive procedure. They call this optimization “double-hoisting.” Our hoisting optimization is used in the context of a baby-step giant-step (BSGS) algorithm that implements . The trade-off in this algorithm is that a larger baby-step and a smaller giant step means more DRAM reads for the switching keys, while a smaller baby-step and a larger giant step means more DRAM reads for the ciphertexts, since the baby-step ciphertexts must be read in for each giant-step.

In Section V-C, we give a simple optimization to compress the size of the keys by a factor of . Using our architecture modeling tool, we determine that this optimization shifts the balance between the baby-step size and the giant-step size so significantly that the optimal number of giant steps is . This essentially collapses the baby-step giant-step structure into just a single step that computes all iterations at once. Therefore, by removing the giant steps in the BSGS algorithm, the collapses into a single instance of that includes the double-hoisting optimization, which allows the to be absorbed into the inner loop. This algorithm is given in Algorithm 8, and the lines that differ from are in blue.

1:
2: digits.
3: for .
4: We will have .
5:for  from to  do
6:      for
7:     
8:     
9:     
10:      is the non-zero diagonal of
11:     
12:end for
13:return
Algorithm 8

Removing Giant-Steps Beyond Bootstrapping

This optimization is not a bootstrapping-only optimization. The hoisting optimizations that are described for for bootstrapping are more broadly applicable to the computation. When multiple operations needs to be performed in parallel, this hoisting optimization can be amortized across these parallel computations, which results in about improvement in logistic regression training iterations for our running example.

V-C Compressing the Key with a PRNG

This optimization is not our own; rather, it is a folklore technique often used to reduce communication when sending ciphertexts or keys over a network (e.g. it is used in Kyber, a leading candidate public-key encryption scheme in the ongoing NIST post-quantum cryptography standardization [BDKLLSSSS18]). However, to our knowledge, we are the first to use this optimization to reduce the memory bandwidth for hardware acceleration of homomorphic encryption as well as the first to analyze this optimization alongside the other optimizations listed in this section. As discussed in Section V-B, this optimization has subtle yet highly impactful effects on the other optimizations that we list, drastically changing the optimal parameters for CKKS bootstrapping.

This optimization is a natural result of the observation that half of the switching key consists of truly random polynomials. By replacing these truly random polynomials with pseudorandom polynomials generated via PRNG, we can avoid shipping the large random polynomials to and from DRAM, instead sending only the short PRNG key.

Fig. 5: This figure displays the algorithmic optimizations described in Section V. The impact is assessed cumulatively i.e. each successive optimization builds on top of the earlier ones. The baseline benchmark begins with all of the memory optimizations from Section IV. All benchmarks are taken with the Best-case Parameters from Table XI. GOP on y-axis stands for Giga operations.

V-D Key Takeaways

Figure 5 shows how various optimizations impact the operation count and the DRAM transfers for CKKS bootstrapping. Moving from left to right on the plot, arithmetic intensity starts to improve as each successive optimization is applied and enabling all our optimizations result in a cumulative improvement and a final arithmetic intensity value of . We now contextualize this compute and bandwidth optimization in the context of current computing platforms.

Datacenter CPUs

Consider an example of a top-of-line datacenter CPU such as the AMD EPYC 7763. This CPU supports a maximum of parallel thread across SMT cores running at a base clock frequency of  GHz. This configuration supports peak integer theoretical throughput of  TOp/s (Each operation here is a -bit Integer Fused Multiply Add in AVX256 mode). Each socket consists of compute die (CCD) with a local  MiB L3 cache per die. The total L3 cache per socket comes out to  MiB. Additionally, the socket offers an -channel DDR4- memory subsystem with an aggregate bandwidth of  GB/s.

At first glance the total L3 capacity appears to be more than sufficient for storing multiple ciphertexts in cache. However the die-to-die bandwidth is limited by the underlying interconnect (Infinity Fabric) to  GiB/s reads and  GiB/s writes. There are similar bandwidth limits at the L1-L2 and L2-L3 interfaces on each die. Thus, it is necessary to consider the compute available on each die in the context of the bandwidth available to that die.

Each CCD pairs  GOp/s with  GiB/s of memory bandwidth. This gives a theoretical INT64 FMA arithmetic intensity of . On current hardware, -bit modular operations need to be emulated using multiple arithmetic operations as seen in section II-B. Compensating for this, we observe that the final arithmetic intensity of our bootstrapping procedure is similar to what can be supported by state-of-art CPUs. Note that the addition of modular arithmetic vector extension to existing vector engines would already result in the overall application being memory bottlenecked.

Datacenter GPUs

For GPU analysis we consider the NVIDIA A100 datacenter GPU. This GPU offers a peak  TOp/s -bit Integer FMA performance when clocked at  GHz. It has an on-chip  MB L2 last-level cache and uses an HBM2 DRAM interface supporting  TB/s of bandwidth. Note again that a single die cannot fit a complete ciphertext in memory. Applications with an INT32 FMA arithmetic intensity lower than will tend to be memory bottlenecked. In addition, -bit integer arithmetic is not natively supported on a datacenter GPU and must be emulated in assembly which has a significant overhead (up to instruction for -bit Integer multiply). As such for GPU implementations, it is advisable to use an RNS representation with -bit limbs to avoid this overhead. Addition of native -bit modular multiplication to future GPU will further worsen the memory bottleneck.

While the above estimations are simplistic and do not take into account the intricacies of instruction scheduling, the underlying point remains that raw access to compute power is not what bottlenecks existing FHE implementations. Building new hardware that merely adds an order-of-magnitude to the compute capability is unlikely to give an order of magnitude performance improvements without addressing the memory side of the story.

Vi Evaluation

In this section, we compare our bootstrapping algorithm to prior art to demonstrate the improved throughput achieved by our optimizations. In addition, we show how our improved CKKS subroutines directly result in more efficient HE applications.

Vi-a Maximizing Bootstrapping Throughput

Bootstrapping Throughput

Our metric to evaluate bootstrapping performance is based on the bootstrapping throughput metric of Han and Ki [HK19]. This metric attempts to capture the effectiveness of a bootstrapping routine by improving with the number of slots the algorithm bootstraps (which is the number of plaintext slots ), the number of limbs in the resulting ciphertext (which translates to the number of compute levels supported by the ciphertext), and the bit-precision of the plaintext data. These factors are then divided by the runtime of the bootstrapping procedure, denoted as . This gives us the throughput metric in Equation 3.

(3)

Optimal Bootstrapping Parameters

Given the throughput metric from Equation 3, we can select parameters to optimize it. We employ our architectural modeling tool to explore the parameter space of bootstrapping to maximize the throughput. As DRAM transfer times dominate in bootstrapping, our architectural model accounts for DRAM transfer time in the total runtime analysis, resulting in parameters that minimize DRAM transfers. The throughput-maximizing parameters for our fully-optimized bootstrapping algorithm (with all optimizations from Section IV and Section V) are given in Table XI.

Baseline
Best-case
  • The baseline set is based on  [GPUBoot21] originally targeting . Updated cryptanalysis in  [BMTH20] reduces the security level for sparse keys. The parameters in this work include these updated recommendations.

TABLE XI: Bootstrapping Parameters
The parameter denotes the number of limbs in the ciphertext after the initial procedure in . The parameter is the number of iterations in the and phases in . The radix values for these iterations are all balanced, with any values that need to be larger placed at the end. The value is the bit-security level.

Vi-B Bootstrapping Performance Comparisons

We now compare the throughput of our most optimized bootstrapping algorithm to prior art. To compare our algorithm to prior works, we re-implemented each algorithm in our architecture model. We then took the parameters given in each of these works and ran the algorithm in our model with these parameters. This allowed us to measure the total operations as well as the DRAM transfer times for each of these algorithms.

From this analysis as well as our discussion in Section V-D, we know that all of these bootstrapping algorithms are bottlenecked by the memory bandwidth. Therefore, we used the memory bandwidth requirement of each of these algorithms as a proxy for the overall runtimes. The memory requirement was converted to DRAM transfer time based on the memory bandwidth of the NVIDIA Tesla V100 [NVIDIAV100], which is  GB/s.

The results of this analysis is presented in Table XII. We now discuss each comparison in more detail. The parameter set selected from Jung et al. [GPUBoot21] is the same parameter set used as the baseline comparison in Section IV, which is the parameter set they give for their logistic regression implementation.

We selected the parameter set from Bossuat et al. [BMTH20] that maximized their throughput. Note that this parameter set maximized the throughput when the runtime was measured on a CPU. For our architecture model, we are considering the case where computation has been accelerated to the point where runtime is completely dominated by memory transfers.

The throughput computation for Samarzdic et al. [F1Paper21] was computed slightly differently since this work gives the DRAM bandwidth of their algorithm. However, this work only gives benchmarks for unpacked CKKS bootstrapping (i.e., there is no slot packing and the ciphertext only holds one element). Rather than re-implementing their algorithm, we use the memory bandwidth usage they give for their unpacked CKKS bootstrapping, which is  MB. To compute the runtime, we also use the peak DRAM bandwidth provided by the authors for their architecture, which is  TB/s. Using these two numbers, we found their bootstrapping procedure runtime to be milliseconds leading to the throughput number mentioned in Table XII.

Work
DRAM
Transfers
(in GB)
Throughput
Jung et al. [GPUBoot21]
Bossuat et al. [BMTH20]
Samarzdic et al. [F1Paper21]
Our Best Throughput
TABLE XII: Bootstrapping comparison
This table measures the bootstrapping throughput. The Throughput column is computed using Equation 3 with the DRAM transfer time as a proxy for the runtime. The DRAM transfer time is measured in microseconds.

Vi-C Application Comparison

A faster bootstrapping algorithm directly results in faster HE applications. Continuing with our running example of logistic regression training, we give benchmarks of the logistic regression algorithm from Section III using our optimized bootstrapping routine and parameters. These benchmarks are given in Table XIII.

Sub-routine
Name
Total
Operations
(in GOP)
Total
DRAM
Transfers
(in GB)
Arithmetic
Intensity
(in Op/byte)
Full LR Iteration
TABLE XIII: Performance of Logistic Regression Training Example
This table displays benchmarks of the logistic regression bootstrapping application using our optimized bootstrapping parameters. In parentheses next to each benchmark, we give the improvement over Table VII.

Vi-D Key Takeaways

In this section, we demonstrated that our optimizations, which mostly focus on improving the arithmetic intensity of bootstrapping and other CKKS building blocks, result in a much higher memory throughput than prior art that mostly focused on optimizing the compute throughput. This shows that focusing on compute throughput overlooks a crucial bottle-neck in CKKS applications: the memory bandwidth. To improve the overall performance of many important CKKS applications such as bootstrapping and encrypted logistic regression training, the memory bandwidth must be directly optimized.

Vii Discussion

Despite our algorithmic and cache optimizations to CKKS FHE bootstrapping (see Sections IV and V), our analysis reveals that FHE bootstrapping continues to have low arithmetic intensity and is heavily bounded by main memory bandwidth. This issue is not specific to CKKS bootstrapping alone. For one, bootstrapping algorithms for other FHE schemes such as BGV [BGV12] and B/FV [Brak12, FV12] have the same high-level structure and suffer from the same problem, although with different quantitative thresholds. Additionally, as discussed in Section III-B, many natural applications (e.g. logistic regression and secure neural network evaluation) have the same high-level structure as bootstrapping, namely, global linear operations followed by local non-linear operations, and consequently, they suffer from the main memory bottleneck as well.

Below, we discuss potential research avenues to solve this issue that is so central to the practicality of FHE.

Future Improvements to Bootstrapping: At a high-level, our optimizations can be viewed as improving the “thrashing” of various low-level operations in the bootstrapping algorithm (as well as other natural applications of FHE such as encrypted training of machine learning models). While future improvements may reduce thrashing in the baseline algorithms, the size of the ciphertexts and the size of the switching keys suggests that the overall arithmetic intensity is unlikely to drastically improve without a dramatic overhaul to FHE schemes.

In one extreme, we could be in the best-case-scenario for FHE bootstrapping. In this world that we call “FHE-mania”, all of bootstrapping can be done in cache without any DRAM reads or writes beyond the initial input and the final output. This world would call for true hardware acceleration of bootstrapping and would make our DRAM optimizations useless. On the other hand, we could be living in a world where the best possible bootstrapping algorithms remain bounded by the memory bandwidth. In this world that we call “thrashy-land”, our optimizations remain crucial to achieving the highest throughput for bootstrapping. While it may be possible to optimize our way out of thrashy-land, as long as the RNS representation remains the dominant format of FHE data, our -limb and -limb caching optimizations will remain relevant.

A realistic possibility is a world that is somewhere in between FHE-mania and Thrashy-land. For example, it turns out that bootstrapping in GSW-like FHE schemes [GSW13, DM15] incurs slower noise growth and consequently smaller parameters and ; however, it does not support packed bootstrapping as in BGV, B/FV and CKKS FHE schemes, a feature that is fundamentally important for efficiency. Can we achieve the best of both worlds? We believe there is exciting research to be done here (see [MS18] for a preliminary attempt); our analysis provides a compelling reason to pursue this line of research.

Increase Main Memory Bandwidth: There are two approaches to increasing the main memory bandwidth. First, we can use multiple DDRx channels, effectively using parallelism to increase the main memory bandwidth. We could also use alternate main memory technologies like HBM2/HBM2e [Jun2017Hbm] that provide several times higher bandwidth than DDRx technology. The second approach involves improving the physical interconnect between the compute cores and the memory by using silicon-photonic link technologies [Sun2015Nature]. Judicious use of silicon-photonic technology can help improve the main memory bandwidth, and has the additional benefit of reducing the energy consumption for memory accesses.

Improve Main Memory Bandwidth Utilization: Here, there are two complementary approaches. The first is to attempt a cleverer mapping of the data to physical memory to take advantage of spatial locality in cache lines such that we reduce the number of memory accesses required per compute operation. To complement this, we can improve FHE-based computing algorithms such that we perform more operations per byte of data that is fetched from main memory, i.e., improve temporal locality. The second approach is algorithmic: namely, improve FHE bootstrapping algorithms (as discussed above) so that we reduce the size of the key-switching parameter, the main culprit for low arithmetic intensity, or eliminate it altogether. These two complementary approaches may result in an increase in the arithmetic intensity, effectively reducing the time required for bootstrapping and FHE as a whole.

Use In-Memory/Near-Memory Computing: Two potential architecture-level approaches include performing the operations in FHE APIs within main memory i.e., in-memory computing, and having a custom die very close to main memory for performing operations in FHE APIs, i.e., near-memory computing. In the in-memory computing approach, we can eliminate a large number of expensive main memory accesses by performing matrix-vector multiplication operations in the main memory itself [Chi2016Isca]. In contrast, in case of near-memory computing, we perform all the FHE compute operations in a custom accelerator that is placed close to the main memory. Here, we cannot eliminate the memory accesses, but the cost of a memory access is lower than that of accessing a traditional memory.

Use Wafer-Scale Systems: A radical technology-level solution is to design large-scale distributed accelerators such as Cerebras style wafer-scale accelerators [Cerebras] that have  GB of high-performance on-wafer memory. Tesla’s Dojo accelerator [Tesla] also fits in this category wherein a large wafer is diced into  chip nodes, which provides high bandwidth and compute performance. Effectively, we can have large SRAM arrays i.e. large caches on the same wafer as the compute blocks, thus limiting all communication to on-chip wafer communication and avoiding expensive main memory accesses after the initial loads.

Viii Related Work

Algorithmic optimizations for CPUs: The key bottleneck in the FHE bootstrapping process is the large homomorphic matrix-vector multiplication required to convert ciphertexts from coefficient to evaluation representation and back. This requires many key-switching operations, which require accessing large number of switching keys from the DRAM, adding both to the computational cost and to data access latency. Initial implementations of bootstrapping in software (for example, the HEAAN library [CKKS17]

) did try to reduce the number of rotations required in this linear transformation step by using baby-step giant-step (BSGS) algorithm, originally invented by Halevi and Shoup 

[HS18]. Using this algorithm, one can reduce the number of rotations to while still requiring only scalar multiplications. The HEAAN library also optimizes the operational cost of approximating the modular reduction step by evaluating the sine function using a Taylor approximation. With these techniques, the HEAAN library takes about eight minutes to bootstrap slots within a ciphertext of degree on a CPU.

Chen, Chillotti and Song [CCS18] proposed a level collapsing technique along with BSGS for the linear transformation step to improve the number of rotations. They also replaced the Taylor approximation with a more accurate Chebyshev approximation to evaluate a scaled-sine function instead. For the same parameter set as the HEAAN library, they observe a speedup. More recently, Han and Ki [HK19] proposed a hybrid key-switching approach to efficiently manage the amount of noise added through the key-switching operation. They evaluated a scaled, shifted cosine function instead of the scaled-sine function in modular reduction to reduce the number of non-scalar multiplications by half. Their optimizations led to an additional speedup. Bossuat et al. [BMTH20] further lower the operational complexity of the linear transformations by optimizing rotations through double-hoisting the hybrid key-switching approach. Double-hoisting the key-switch operation reduces the number of basis conversion operations significantly, which are expensive in terms of accessing the main memory. They also carefully manage the scale factors for non-linear transformations for error-less polynomial evaluation. Their implementation in Lattigo library [lattigo] shows a further speedup of on a CPU.

Algorithmic optimizations for GPUs: All the above mentioned optimizations heavily focused on lowering the operation complexity of bootstrapping, which led to a minor reduction in the main memory accesses as well. Recently, Jung et al. [GPUBoot21] presented the first ever GPU implementation of CKKS bootstrapping. Their analysis, even though limited to GPUs, rightly points out the main-memory-bounded nature of the bootstrapping operation. Thus, their optimizations, such as inter- and intra-kernel fusion, are all focused on improving the memory bandwidth utilization rather than accelerating the compute itself. Their bootstrapping implementation is so far the fastest requiring only  ms (total time) for bootstrapping all the slots of a ciphertext of degree . As discussed in Section IV and V, our techniques are composable with all these prior works and consequently, result in higher arithmetic intensity and reduction in main memory accesses.

Hardware Accelerators for HE: Samardzic et al. [