Log In Sign Up

BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption

by   Sangpyo Kim, et al.
Seoul National University

Homomorphic encryption (HE) enables the secure offloading of computations to the cloud by providing computation on encrypted data (ciphertexts). HE is based on noisy encryption schemes in which noise accumulates as more computations are applied to the data. The limited number of operations applicable to the data prevents practical applications from exploiting HE. Bootstrapping enables an unlimited number of operations or fully HE (FHE) by refreshing the ciphertext. Unfortunately, bootstrapping requires a significant amount of additional computation and memory bandwidth as well. Prior works have proposed hardware accelerators for computation primitives of FHE. However, to the best of our knowledge, this is the first to propose a hardware FHE accelerator that supports bootstrapping as a first-class citizen. In particular, we propose BTS - Bootstrappable, Technologydriven, Secure accelerator architecture for FHE. We identify the challenges of supporting bootstrapping in the accelerator and analyze the off-chip memory bandwidth and computation required. In particular, given the limitations of modern memory technology, we identify the HE parameter sets that are efficient for FHE acceleration. Based on the insights gained from our analysis, we propose BTS, which effectively exploits the parallelism innate in HE operations by arranging a massive number of processing elements in a grid. We present the design and microarchitecture of BTS, including a network-on-chip design that exploits a deterministic communication pattern. BTS shows 5,556x and 1,306x improved execution time on ResNet-20 and logistic regression over a CPU, with a chip area of 373.6mm^2 and up to 163.2W of power.


page 1

page 7

page 11


FAB: An FPGA-based Accelerator for Bootstrappable Fully Homomorphic Encryption

FHE offers protection to private data on third-party cloud servers by al...

ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse

Homomorphic Encryption (HE) is one of the most promising post-quantum cr...

CoFHEE: A Co-processor for Fully Homomorphic Encryption Execution

The migration of computation to the cloud has raised privacy concerns as...

CryptoLight: An Electro-Optical Accelerator for Fully Homomorphic Encryption

Fully homomorphic encryption (FHE) protects data privacy in cloud comput...

BASALISC: Programmable Asynchronous Hardware Accelerator for BGV Fully Homomorphic Encryption

Fully Homomorphic Encryption (FHE) allows for secure computation on encr...

F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption (Extended Version)

Fully Homomorphic Encryption (FHE) allows computing on encrypted data, e...

The Secure Machine: Efficient Secure Execution On Untrusted Platforms

In this work we present the Secure Machine, SeM for short, a CPU archite...

1. Introduction

Homomorphic encryption (HE) allows computations on encrypted data or ciphertexts (

s). In the machine-learning-as-a-service (MLaaS) era, HE is highlighted as an enabler for privacy-preserving cloud computing, as it allows safe offloading of private data. Because HE schemes are based on the learning-with-errors (LWE) 

(Regev, 2009) problem, they are noisy in nature. Noise accumulates as we apply a sequence of computations on

s. This limits the number of computations that can be performed and hinders the applicability of HE for practical purposes, such as in deep-learning models with high accuracy 

(Lee et al., 2022). To overcome this limitation, fully HE (FHE) (Gentry, 2009) was proposed, featuring an operation (op) called bootstrapping, that “refreshes” the and hence permits an unlimited number of computations on the . Among multiple HE schemes that support FHE, CKKS (Cheon et al., 2017) is one of the prime candidates as it supports fixed-point real number arithmetic.

One of the main barriers to adopting HE has been its high computational and memory overhead. New schemes (Brakerski et al., 2014; Fan and Vercauteren, 2012; Brakerski and Vaikuntanathan, 2014; Chillotti et al., 2020; Cheon et al., 2017) and algorithmic optimizations (Han and Ki, 2020; Bossuat et al., 2021; Al Badawi et al., 2019) (using the residue number system (Cheon et al., 2018a; Bajard et al., 2016)) have reduced this overhead and resulted in a 1,000,000 speedup (Bossuat et al., 2021) at least compared to its first HE implementation (Gentry and Halevi, 2011). However, even with such efforts, HE ops experience tens of thousands of slowdowns compared to unencrypted ops (Jung et al., 2021b). Attempting to tackle this, prior works have sought hardware solutions to accelerate HE ops, including CPU extensions (Jung et al., 2021b; Boemer et al., 2021), GPU (Jung et al., 2021a; Al Badawi et al., 2020, 2019, 2018), FPGA (Riazi et al., 2020; Roy et al., 2019; Kim et al., 2020b, 2019), and ASIC (Samardzic et al., 2021).

FHE mult
len slots per thruput
() bootstrap ()
Lattigo (EPFL-LDS, 2021) CPU 32,768 - 6-10K
100x (Jung et al., 2021a) GPU 65,536 SIMT 0.1-1M
(Roy et al., 2019) FPGA - rPLP
HEAX (Riazi et al., 2020) FPGA - rPLP
F1 (Samardzic et al., 2021) ASIC 1 rPLP 4K
BTS ASIC 65,536 CLP 20M
  • Data elements that can be packed in a for SIMD execution.

  • Residue-polynomial-level parallelism (rPLP) and coefficient-level parallelism (CLP) can be exploited in parallelizing HE ops (Section 4.3).

  • F1 only supports single-slot bootstrapping which has low throughput.

Table 1. Comparing prior HE acceleration works with BTS

However, prior acceleration works mostly targeted small problem sizes, with a small target N (the length of a ), and they are lacking in bootstrapping support. Bootstrapping, which is necessary to reduce the impact of noise, occurs frequently in most FHE applications and represents the highest expense. For example, bootstrapping occurs more than 1,000 times for a single ResNet-20 inference (Lee et al., 2022) and each instance of bootstrapping can take dozens of seconds on the state-of-the-art CPU (EPFL-LDS, 2021) and hundreds of milliseconds on a GPU (Jung et al., 2021a). Most prior custom hardware acceleration works (Roy et al., 2019; Riazi et al., 2020) do not support bootstrapping at all, while F1 (Samardzic et al., 2021) demonstrated a bootstrapping time for CKKS but with limited throughput (Table 1).

We propose BTS, a bootstrapping-oriented FHE accelerator that is Bootstrappable, Technology-driven, and Secure. First, we identify the limitations that are imposed by contemporary fabrication technology when designing an HE accelerator, analyzing the implications of various conflicting requirements for the performance and security of FHE under such a constrained design space. This allows us to pinpoint appropriate optimization targets and requirements when designing the FHE accelerator. Second, we build a balanced architecture on top of those observations; we analyze the characteristics of HE functions to determine the appropriate number of processing elements (PEs) and proper data mapping that balances computation and data movement when using our FHE-optimized parameters. We also choose to exploit coefficient-level parallelism (CLP), instead of residue-polynomial-level parallelism (rPLP), to evade the load imbalance issue. Finally, we devise a novel PE microarchitecture that efficiently handles HE functions including base conversion, and a time-multiplexed NoC structure that manages both number theoretic transform and automorphism functions.

Through these detailed studies, BTS achieves a 5,714 speedup in multiplicative throughput against F1, the state-of-the-art ASIC implementation, when bootstrapping is properly considered. Also, BTS significantly reduces the training time of logistic regression (Han et al., 2019) compared to the CPU (by 1,306) and GPU (by 27) implementations, and can execute a ResNet-20 inference 5,556 faster than the prior CPU implementation (Lee et al., 2022).

In this paper, we make the following key contributions:

  • [noitemsep,leftmargin=0.135in]

  • We provide a detailed analysis of the interplay of HE parameters impacting the performance of FHE accelerators.

  • We propose BTS, a novel accelerator architecture equipped with massively parallel compute units and NoCs tailored to the mathematical traits of FHE ops.

  • BTS is the first accelerator targeting practical bootstrapping, enabling unbounded multiplicative depth, which is essential for complex workloads.

2. Background

We provide a brief overview of HE and CKKS (Cheon et al., 2017) in particular. Table 2 summarizes the key parameters and notations we use in this paper.

Symbol Definition
(Prime) moduli product
(Prime) moduli
Modulus factors
Special (prime) moduli product
Special (prime) moduli
Evaluation key () for HMult
for HRot with a rotation amount of
The degree of a polynomial
Maximum (multiplicative) level
Current (multiplicative) level of a ciphertext
Levels consumed at bootstrapping
The number of special prime moduli
Decomposition number
Security parameter of a given CKKS instance
Table 2. List of symbols used to describe CKKS (Cheon et al., 2017).

2.1. Homomorphic Encryption (HE)

HE enables direct computation on encrypted data, referred to as ciphertext (), without decryption. There are two types of HE. Leveled HE (LHE) supports a limited number of operations (ops) on a due to the noise that accumulates after the ops. In contrast, Fully HE (FHE) allows an unlimited number of ops on s through bootstrapping (Gentry, 2009) that “refreshes” a and lowers the impact of noise. LHE has limited applicability111The hybrid use of LHE with multi-party computation (Damgård et al., 2012) allows for a broader range of applications. However, such an approach has a different bottleneck of the communication cost and intense client-side computations.; in the field of privacy-preserving deep learning inference, for instance, simple/shallow networks such as LoLa (Brutzkus et al., 2019) can be implemented with LHE, but only with limited accuracy (74.1%). More accurate models such as ResNet-20 (Lee et al., 2022) (92.43%) demand much more ops applied to s and thus FHE implementation.

While other FHE schemes support integer (Brakerski et al., 2014; Brakerski and Vaikuntanathan, 2014; Fan and Vercauteren, 2012) or boolean (Chillotti et al., 2020) data types, CKKS (Cheon et al., 2017) supports fixed-point complex (real) numbers. As many real-world applications such as MLaaS (Machine Learning as a Service) require arithmetic on real numbers, CKKS has become one of the most prominent FHE schemes. In this paper, we focus on accelerating CKKS ops; however, our proposed architecture is applicable to other popular FHE schemes (e.g., BGV (Brakerski et al., 2014) and BFV (Brakerski and Vaikuntanathan, 2014; Fan and Vercauteren, 2012; Bajard et al., 2016)) that share similar core ops.

2.2. CKKS: an emerging HE scheme

CKKS first encodes a message that is a vector of complex numbers, into a plaintext

, which is a polynomial in a cyclotomic polynomial ring . The coefficients are integers modulo and the number of coefficients (or degree) is , where is a power-of-two integer, typically ranging from to . For a given , a message with up to complex numbers can be packed into a single plaintext in CKKS. Each element within a packed message is referred to as a slot. After encoding (or packing), element-wise multiplication (mult) and addition between two messages can be done through polynomial operations between plaintexts. CKKS then encrypts a plaintext into a based on the following equation,

where is a secret key, is a random polynomial, and is a small Gaussian error polynomial required for LWE security guarantee (Albrecht et al., 2019). CKKS decrypts by computing , which approximates to with small errors.

HE is mainly bottlenecked by the high computational complexity of polynomial ops. As each coefficient of a polynomial is a large integer (up to 1,000s of bits) and the degree is high (even surpassing 100,000), an op between two polynomials has high compute and data-transfer costs. To reduce the computational complexity, HE schemes using the residue number system (RNS) (Bajard et al., 2016; Cheon et al., 2018a) have been proposed. For example, Full-RNS CKKS (Cheon et al., 2018a) sets as the product of word-sized (prime) moduli , where for a given integer . Using the Chinese remainder theorem (Eq. 1), we represent a polynomial in with residue polynomials in , whose coefficients are residues obtained by performing modulo (represented as ) on the large coefficients:


Then, we can convert an op involving two polynomials into ops between the residue polynomials with word-sized coefficients ( 64 bits) corresponding to the same , avoiding costly big-integer arithmetic with carry propagation. Full-RNS CKKS provides an 8 speedup over plain CKKS (Cheon et al., 2018a) and thus, we adopt Full-RNS CKKS as our CKKS implementation, representing a polynomial in as an matrix of residues, and a as a pair of such matrices.

2.3. Primitive operations (ops) of CKKS

Primitive HE ops of CKKS are introduced here, which can be combined to create more complex HE ops such as linear transformation and convolution. Given two ciphertexts

where and , HE ops can be summarized as follows:

HAdd performs an element-wise addition of and :


HMult consists of a tensor product and key-switching

. The tensor product first creates



By computing , we recover , albeit with error terms. Key-switching recombines the tensor product result to be decryptable with using a public key, called an evaluation key (). An is a in with a larger modulus , where for given special (prime) moduli . We express an as a pair of matrices. HMult is then computed using Eq. 4, which involves key-switching with an for mult, :


HRot circularly shifts a message vector by slots. When a encrypts a message vector = , after applying HRot by a rotation amount , the rotated ciphertext encrypts = . HRot consists of an automorphism and key-switching. is mapped to after an automorphism. This moves the coefficients of a polynomial through the mapping , where is the index of the coefficient and is:


Similar to HMult, key-switching brings back , which was only decryptable with after automorphism, to be decryptable with . An HRot with a different rotation amount each requires a separate , . HRot is computed as follows:


HE applications require other HE ops, such as an addition or mult of a with a scalar (CAdd, CMult) or a polynomial (PAdd, PMult) of unencrypted, constant values. Additions are performed by adding the scalar or polynomial to , and mults are performed by multiplying each and by the scalar or polynomial.

2.4. Multiplicative level and HE bootstrapping

Multiplicative level: The error included in a is amplified during HE ops; in particular, HMult multiplies the error with other terms (e.g., and ) and can result in an explosion of the error if not treated properly. CKKS performs HRescale to mitigate this explosion and keep the error tolerable by dividing the by the last prime modulus  (Cheon et al., 2018a). After HRescale, the residue polynomial is discarded, and the is reduced in size. The continues losing the residues of with each HRescale while executing an HE application until only one residue polynomial is left when no additional HMult can be performed on the . , or the maximum multiplicative level, determines the maximum number of HMult ops that can be performed without bootstrapping, and the current (multiplicative) level denotes the number of remaining HMult operations that can be performed on the . Thus, a with a level is represented as a pair of matrices.

Bootstrapping: FHE features a bootstrapping op that restores the multiplicative level () of a to enable more ops. Bootstrapping must be commonly performed for the practical usage of HE with a complex sequence of HE ops. Bootstrapping mainly consists of homomorphic linear transforms and approximate sine evaluation (Cheon et al., 2018b), which can be broken down into hundreds of primitive HE ops. HMult and HRot ops account for more than 77% of the bootstrapping time (EPFL-LDS, 2021). As bootstrapping itself consumes levels, should be larger than . A larger is beneficial as it requires less frequent bootstrapping. ranges from 10 to 20 depending on the bootstrapping algorithm; a larger allows the use of more precise and faster bootstrapping algorithms(Chen et al., 2019; Bossuat et al., 2021; Lee et al., 2021a; Han and Ki, 2020). The bootstrapping algorithm we use in this paper is based on (Han and Ki, 2020) with updates to meet the latest security and precision requirements (Bossuat et al., 2021; Lee et al., 2020; Cheon et al., 2019), and the value of is 19. Readers are encouraged to refer to the papers for a more detailed explanation of the algorithm. Another CKKS-specific constraint is that the moduli ’s and the special moduli ’s must be large enough to tolerate the error accumulated during bootstrapping, whose typical values range from to  (Cheon et al., 2022; EPFL-LDS, 2021).

2.5. Modern algorithmic optimizations in CKKS and amortized mult time per slot

Security level (): The level of security for an HE scheme is represented by , a parameter measured by the logarithmic time complexity for an attack (Cheon et al., 2019) to deduce the secret key. A sufficiently high is required for safety; we target of 128 bits, adhering to the standard (Albrecht et al., 2019) established by recent HE studies (Bossuat et al., 2021; Lee et al., 2021a, 2020) and libraries (EPFL-LDS, 2021; PALISADE Project, 2021). is a strictly increasing function of  (Curtis and Player, 2019).

Dnum: Key-switching is an expensive function, accounting for most of the time in HRot and HMult (Jung et al., 2021a). We adopt a state-of-the-art generalized key-switching technique (Han and Ki, 2020), which balances , the computational cost, and . (Han and Ki, 2020) factorizes the moduli product into (see Eq. 7) for a given integer (decomposition number). It decomposes a into slices, each consisting of residue polynomials corresponding to the prime moduli (’s) that together compose the modulus factor . We perform key-switching on each slice in and later accumulate them. The special moduli product should only satisfy for each , allowing us to choose a smaller , leading to a higher . i) Therefore, a larger means a greater level of with fixed values of and because we can increase .


A major challenge of generalized key-switching is that different s () must be prepared for each factor , where each is a pair of matrices and is set to . ii) Thus, the aggregate size becomes , linearly increasing with . iii) The overall computational complexity of a single HE op also increases with . Therefore, choosing an appropriate crucially affects the performance.

Amortized mult time per slot (Tmult,a/slot): Changing the HE parameter set has mixed effects on the performance of HE ops. Decreasing reduces the computational complexity and memory usage. However, we should lower and to sustain security, which requires more frequent bootstrapping. Also, because a of degree can encode only up to message slots by packing, the throughput degrades.

Jung et al.(Jung et al., 2021a) introduced a metric called amortized mult time per slot (Tmult,a/slot), which is calculated as follows:


where Tboot is the bootstrapping time and Tmult is the time required to perform HMult at a level . This metric initially calculates the average cost of mult including the overhead of bootstrapping, and then divides it by the number of slots in a (). Thus, Tmult,a/slot effectively captures the reciprocal throughput of a CKKS instance (CKKS scheme with a certain parameter set).

3. Technology-driven Parameter Selection of Bootstrappable Accelerators

3.1. Technology trends regarding memory hierarchy

Domain-specific architectures (e.g., deep-learning (Jouppi et al., 2021; Medina and Dagan, 2020; Knowles, 2021) and multimedia (Ranganathan et al., 2021) accelerators) are often based on custom logic and an optimized dataflow to provide high computation capabilities. In addition, the memory capacity/bandwidth requirements of the applications are exploited in the design of the memory hierarchy. Recently, on-chip SRAM capacities have scaled significantly (Auth et al., 2017) such that the level of hundreds of MBs of on-chip SRAM is feasible, providing tens of TB/s of SRAM bandwidth(Jouppi et al., 2021; Prabhakar and Jairath, 2021; Knowles, 2021). While the bandwidth of the main-memory has also increased, its aggregate throughput is still more than an order of magnitude lower than the on-chip SRAM bandwidth (O’Connor et al., 2017), achieving a few TB/s of throughput even with high-bandwidth memory (HBM).

Similar to other domain-specific architectures (Chen et al., 2016; Jouppi et al., 2021), HE applications also follow deterministic computational flows, and the locality of the input and output s of HE ops can be maximized through software scheduling (Dathathri et al., 2020). Thus, s can be reused by exploiting a large amount of on-chip SRAM enabled by technology scaling. However, even with the increasing on-chip SRAM capacity, we observe that the size of on-chip SRAM is still insufficient to store s, rendering the off-chip memory bandwidth becomes a crucial bottleneck for modern CKKS scheme that supports bootstrapping. In the following sections, we identify the importance of bootstrapping on the overall performance and provide an analysis of how different CKKS parameters impact the amount of data movement during bootstrapping and its final throughput.

3.2. Interplay between key CKKS parameters

(a) Maximum level
(b) size
Figure 1. (a) and (b) a single size vs. for four different (polynomial degree) values and a fixed 128b security target. Normalized- of 0 means = 1 and normalized- of 1 means = max (i.e.,

= 1). Interpolated results are used for points with non-integer

values. The dotted line in (a) represents the minimum required level of 11 for bootstrapping.

Selecting one parameter of a CKKS instance has a multifaceted effect on the other parameters. First, is lowered when is higher, and is raised when is higher. Considering that a bootstrappable CKKS instance requires a high (), and with the sizes of prime moduli and set around and with a 64-bit machine word size, exceeds 500. To support 128b security when exceeds 500, must be larger than  (Lee et al., 2020).

Second, when is set from fixed values of and , a larger leads to a higher at the cost of a larger size. Considering that equals , the ratio is close to . Therefore, when is fixed, a larger means a larger and finally a larger . However, the size also increases linearly with (see Fig. 1). Because the high level of achieved by increasing saturates quickly, choosing a proper is important.

Figure 2. and the minimum bound of an HE accelerator simulated for different CKKS instances. Results are measured for all possible integer values including 1 and the max for each (N, L) pair. The points highlighted in red represent (, , ) = (, 27, 1), (, 39, 2), (, 44, 3).

3.3. Realistic minimum bound of HE accelerator execution time

Tmult,a/slot is mainly determined by the bootstrapping time, as bootstrapping is more than 60 longer than a single HMult on conventional systems (EPFL-LDS, 2021; Jung et al., 2021a). Unlike simple LHE tasks such as LoLa (Brutzkus et al., 2019), which only requires a handful of s, bootstrapping typically requires more than 40 s, mostly for the long sequence of multiple HRots applied with different ’s during the linear transformation steps of bootstrapping (Bossuat et al., 2021) (). They can amount to GBs of storage and exhibit poor locality.

The bootstrapping time is mostly spent on HMult and HRot. (Jung et al., 2021a) found that HMult and HRot are memory-bound, highly dependent on the on-chip storage capacity. Given today’s technology with low logic costs and high-density on-chip SRAMs, the performance of HMult and HRot can be improved significantly with an HE accelerator.

Despite such an increase in on-chip storage, s, with each possibly taking up several hundreds of MBs (see Fig. 1), cannot easily be stored on-chip. Because on-chip storage cannot hold all s, they must be stored off-chip and be loaded in a streaming fashion upon every HMult/HRot. Therefore, even if every temporal data and s with high locality are assumed to be stored on-chip with massive on-chip storage, the load time of becomes the minimum execution time for HMult/HRot considering the limited off-chip bandwidth.

(a) Computational flow
(b) Relative complexity
Figure 3. (a) Computational flow of the key-switching inside HMult and (b) computational complexity breakdown of HMult for s at the maximum level on CKKS instances with the same and values but different values. The computational complexity is analyzed based on (Jung et al., 2021a).

3.4. Desirable target CKKS parameters for HE accelerators

To understand the impact of CKKS parameters, we simulate Tmult,a/slot at multiple points while sweeping the , , and values. With 1TB/s of memory bandwidth (half of NVIDIA A100 (Choquette et al., 2021) and identical to F1 (Samardzic et al., 2021)), a bootstrapping algorithm that consumes 19 levels, and the simulation methodology in Section 6.2, we add two simplifying assumptions based on Section 3.3 such that 1) the computation time of HE ops can be fully hidden by the memory latency of s, and 2) all s of HE ops are stored in on-chip SRAM and re-used. Fig. 2 reports the results. The x-axis shows determined by  (Curtis and Player, 2019)

, as calculated using an estimation tool 

(Son, 2021). The y-axis shows Tmult,a/slot for different s, s, and s.

We make two key observations. First, when other values are fixed, Tmult,a/slot decreases as increases, even with the higher memory pressure from the larger s and s because the available level () increases. However, such an effect saturates after . Around our target security level of 128b in Fig. 2, the gain from to is 3.8 (111.4ns to 29.1ns), whereas that from to is 1.3. Second, while a higher can help smaller s to reach our target 128b security level, it comes at the cost of a superlinear increase in Tmult,a/slot due to the increasing size and the additional gain in being saturated.

These key observations suggest that a bootstrappable HE accelerator should target CKKS instances with high polynomial degrees () and low values. Our BTS targets the CKKS instances with highlighted in Fig. 2. With these, the simulated HE accelerator achieves Tmult,a/slot of 27.7ns, 19.9ns, and 22.1ns with corresponding () pairs of (27, 1), (39, 2), and (44, 3), respectively. Although BTS can support all CKKS instances shown in Fig. 2, it is not optimized for other CKKS instances as they either exhibit worse Tmult,a/slot, or require significantly more on-chip resources with only a marginal performance gain ().

In this paper, we use the CKKS instance with , , and as a running example. When using the 64-bit machine word size, a at the maximum level has a size of 56MB, and an has a size of 112MB.

4. Architecting BTS

We explore the organization of BTS, our HE accelerator architecture. We address the limitations of prior works, F1 (Samardzic et al., 2021) in particular, and suggest a suitable architecture for bootstrappable CKKS instances. Section 3.4 derived the optimality of such CKKS instances assuming that an HE accelerator can hide all the computation time within the loading time of an . BTS exploits massive parallelism innate in HE ops to satisfy that optimality requirement indeed, with enough, but not an excess of, functional units (FUs). To achieve this, first we dissect key-switching, which appears in both HMult and HRot, and has both heavy computation and memory requirements.

4.1. Computational breakdown of HE ops

We first dissect key-switching, which appears in both HMult and HRot, the two dominant HE ops for bootstrapping and general HE workloads. Fig. 3(a) shows the computational flow of key-switching, and Fig. 3(b) shows the corresponding computational complexity breakdown. We focus on three functions, NTT, iNTT, and BConv, which take up most of the computation.

Number Theoretic Transform (NTT): A polynomial mult between polynomials in

translates to a negacyclic convolution of their coefficients. NTT is a variant of the Discrete Fourier Transform (DFT) in

. Similar to DFT, NTT transforms the convolution between two sets of coefficients into an element-wise mult, while inverse NTT (iNTT) is applied to obtain the final result as shown below ( meaning element-wise mult):

By applying the well-known Fast Fourier Transform (FFT) algorithms (Cooley and Tukey, 1965), the computational complexity of (i)NTT is reduced from to . This strategy divides the computation into stages, where data elements are paired into

pairs in a strided manner and butterfly operations are applied to each pair per stage. The stride value changes every stage. Butterfly operations in (i)NTT are as follows:

where (a twiddle factor

) is an odd power (up to

) of the primitive -th root of unity . In total, twiddle factors are needed per prime modulus. NTT can be applied concurrently to each residue polynomial (in ) in a .

Base Conversion (BConv): BConv (Bajard et al., 2016) converts a set of residue polynomials to another set whose prime moduli are different from the former. A at level has two polynomials, with each consisting of residue polynomials corresponding to prime moduli . We denote this modulus set as , called the polynomial’s base or base in short.

BConv is required in key-switching to match the base of a with an on base where . BConv from to is performed on s, as expressed in Eq. 9, where for . Likewise, BConv from to is performed after multiplying by .


Because BConv cannot be performed on polynomials after NTT (i.e., they are in the NTT domain), iNTT is performed to bring the polynomials back to the RNS domain. BTS keeps polynomials in the NTT domain by default and brings them back to the RNS domain only for BConv. Thus, a sequence of is a common pattern in CKKS.

4.2. Limitations in prior works and the balanced design of BTS

Prior HE acceleration studies (Samardzic et al., 2021; Riazi et al., 2020; Roy et al., 2019; Reagen et al., 2021) identified (i)NTT as the paramount acceleration target and placed multiple NTT units (NTTUs) that can perform both ButterflyNTT and ButterflyiNTT. F1 (Samardzic et al., 2021) in particular populated numerous NTTUs with “the more the better” approach, provisioning 14,336 NTTUs even for a small HE parameter set with . Such an approach was viable because, under the small parameter sets, all , , and temporal data could reside on-chip, especially with proper compiler support.

However, we observe that such massive use of NTTUs is wasteful in bootstrappable CKKS instances, where the off-chip memory bandwidth becomes the main determinant of the overall performance. The FHE-optimized parameters cause a quadratic increase in , , and the temporal data (e.g., 64 when moving from to of ). This makes it impossible for these components to be located on-chip, especially considering that most prior custom hardware works only take into account the max case.

We instead analyze how many fully-pipelined NTTUs an HE accelerator requires to finish HMult or HRot within the loading time with our target CKKS instances. We define the minimum required number of NTTUs (min) as . When we assume a nominal operating frequency of 1.2GHz for NTTUs considering prior works (Choquette et al., 2021; Knowles, 2021; Jouppi et al., 2021) in 7nm process nodes, and HBM with an aggregate bandwidth of 1TB/s, min is defined as shown below:


The value of is maximized when is 1. For , the value is 1,328. We utilize 2,048 NTTUs in BTS to provide some margin for other operations.

In addition to (i)NTT, the importance of BConv grows as small s are used. The computational complexity of BConv in key-switching is proportional to (). As a result, the relative computational complexity of BConv, which is 12% at , increases to 34% at (see Fig. 3(b)). Prior works mainly targeted , focusing on the acceleration of (i)NTT. We propose a novel BConv unit (BConvU) to handle the increased significance of BConv, whose details are described later in Section 5.2.

4.3. BTS organization exploiting data parallelism

Figure 4. Data access patterns in HE functions.

We can categorize primary HE functions into three groups according to their data access patterns (see Fig. 4). Residue-polynomial-wise functions, the (i)NTT and automorphism functions, involve all residues in a residue polynomial to produce an output. Coefficient-wise functions (e.g., BConv) involve all residues of a single coefficient to produce an output residue. Element-wise functions such as CMult and PMult only involve residues on the same position over multiple polynomials.

We can exploit two types of data parallelism, residue-polynomial-level parallelism (rPLP) and coefficient-level parallelism (CLP), when parallelizing an HE op with multiple processing elements (PEs). rPLP can be exploited by distributing residue polynomials and CLP can be by distributing coefficients to many PEs. Prior works including F1 mostly exploited rPLP as prime-wise modularization is apparently possible.

When the data access pattern and the type of the parallelism being exploited are not aligned, data exchanges between PEs occur, resulting in global wire communication which has poorly scaled over technology generations (Ho et al., 2001). For the sequence of in key-switching, CLP will incur data exchanges for (i)NTT and rPLP will incur data exchanges for BConv. The total size of the transferred data is identical at . Thus, there is no clear winner between the two types of parallelism in terms of data exchanges. However, exploiting rPLP is limited in terms of the degree of parallelism due to the fluctuating multiplicative level as an FHE application is executed. This also complicates a fair distribution of jobs among PEs.

Instead, we use CLP in BTS. As is fixed throughout the running of an HE application, we decide on a fixed data distribution methodology, where the residues of a polynomial with the same coefficient index are allocated to the same PE. Then, coefficient-wise and element-wise functions are parallelized without inter-PE data exchanges; only (i)NTT and the automorphism incur inter-PE data exchanges, with the communication pattern predetermined by the fixed data distribution.

We place 2,048 PEs (Eq. 10) in BTS. Each PE has an NTTU, a BConvU, a modular adder (ModAdd) and a multiplier (ModMult) for element-wise functions, as well as an SRAM scratchpad. residues of a residue polynomial are evenly distributed to the PEs, such that one PE handles residues. Then six out of 17 (i)NTT stages can be solely computed inside a PE. We adopt 3D-NTT to minimize the data exchanges between the PEs. A residue polynomial is regarded as a 3D data structure of size . Then, each PE performs a sequence of -, -, and -point (i)NTTs, interleaved with just two rounds of inter-PE data exchange. Splitting (i)NTT in a more fine-grained manner requires more data exchange rounds and is thus less energy-efficient. The automorphism function exhibits a different communication pattern from (i)NTT, involving complex data remapping (Eq. 5). Nevertheless, the data distribution methodology and NoC structure of BTS efficiently handle data exchanges for both (i)NTT and the automorphism (see Section 5).

Figure 5. The overview of BTS: Each PE in a grid is denoted as (column index, row index). PEs interconnect through the PE-PE NoC composed of xbarv and xbarh. BrU is the broadcast unit. BrU and the main memory communicate with PEs through separate NoCs. A PE consists of a scratchpad, an NTTU to undertake NTT/iNTT, a BConvU for BConv, a modular multiplier (ModMult), and a modular adder (ModAdd). BConvU consists of a ModMult and MMAU.

5. BTS Microarchitecture

We devise a massively parallel architecture that distributes PEs in a grid. A PE consists of functional units (FUs) and an SRAM scratchpad. An NTTU in each PE handles a portion of the residues in a residue polynomial during (i)NTT. By exploiting CLP, the coefficient-wise or element-wise functions can be computed in a PE without any inter-PE data exchange.

Fig. 5 presents a high-level overview of BTS. We arrange 2,048 () PEs in a grid with a vertical height of 32 () and a horizontal width of 64 (). The PEs are interconnected via dimension-wise crossbars in the form of 3232 vertical crossbars (xbarv) and 6464 horizontal crossbars (xbarh). We populate a central, constant memory, storing precomputed values including twiddle factors for (i)NTT and

for BConv. A broadcast unit (BrU) delivers the precomputed values to the PEs at the required moments. Memory controllers are located at the top and bottom sides, each connected to an HBM stack. BTS receives instructions and necessary data from the host via the PCIe interface. The word size in BTS is 64 bits. Modular reduction units use Barrett reduction 

(Barrett, 1986) to bring the 128-bit multiplied results back to the word size.

5.1. Datapath for (i)NTT

BTS maps the coefficients of a polynomial to the PEs suited to 3D-NTT. We view the residues in a residue polynomial as a cube. Then in the RNS domain, a residue at the coefficient index (the coefficient of ) is at position in this cube, where . We allocate residues at position of such a cube to the PE of coordinate in the PE grid. 3D-NTT is broken down into five steps in BTS. First, we conduct i) NTTz inside a single PE, which corresponds to the NTT along the z-axis of the cube. Next, ii) data exchanges between vertically aligned PEs are executed, corresponding to of yz-plane parallel transposition of residues in the cube. iii) NTTy along the z-axis follows. iv) Data exchanges between horizontally aligned PEs are executed, corresponding to of xz-plane parallel transposition of residues in the cube. Finally, v) NTTx along the z-axis is carried out. iNTT is performed by the reverse process of NTT.

An NTTU supports both NTT and iNTT by using logic circuits similar to (Xin et al., 2021, 2020; Xing and Li, 2021; Zhang et al., 2021). We employ separate register files (RFNTTs) to reuse data between (i)NTT stages. An NTTU decomposes NTTx, NTTy, and NTTz into radix-2 NTTs. It is fully pipelined and performs one butterfly op per clock. An input pair is fed in, and an output pair is stored from the NTTU each cycle, provided by two pairs of RFNTTs.

We hide the time for vertical and horizontal data exchanges of 3D-NTT (steps ii) and iv)) through coarse-grained, epoch

-based pipelining. As steps i), iii), and v) are executed with the same NTTU, we determine the length of an epoch according to the time required to perform these three steps (

cycles). Within the -th epoch, we time-multiplex i) of -th, iii) of -th, and v) of the -th residue polynomials, while exchanging ii) of -th and iv) of the -th residue polynomials concurrently. Concurrent data exchanges are enabled by separate vertical (ii)) and horizontal (iv)) NoCs. Thus, (i)NTT of a single residue polynomial finishes every epoch.

A single (i)NTT on a residue polynomial requires different twiddle factors. Because each prime modulus needs different twiddle factors, the sizes of the twiddle factors for (i)NTT on a ciphertext reach dozens of MBs for our target CKKS instances. We reduce the storage for the twiddle factors by decomposing them by means of on-the-fly twiddling (OT) (Kim et al., 2020a). OT replaces the -sized precomputed twiddle-factor table with two tables: a higher-digit table of where , and a lower-digit table of where . We can compose any twiddle factor by multiplying two twiddle factors and that satisfy . OT reduces the memory usage by . BTS stores the lower-digit tables of prime moduli in PEs (each PE having different entries) while storing the higher-digit tables in the BrU (all PEs sharing the entries). The BrU broadcasts a higher-digit table for a prime modulus to PEs for every (i)NTT epoch.

5.2. Base Conversion Unit (BConvU)

BConv consists of two parts. The first part multiplies residue polynomials with and the second part does this with and accumulates them. It is the second part that exhibits the coefficient-wise access pattern because it accumulates residues at the same coefficient index in all residue polynomials.

A BConv unit (BConvU) with a modular multiplier (ModMult) for the first part and a modular multiply-accumulate unit (MMAU) for the second part is placed in each PE. BConv strongly depends on the preceding iNTT (see Fig. 3). Because iNTT is a residue-polynomial-wise function, whereas the second part of BConv is a coefficient-wise function, the MMAU must wait until iNTT is finished on all residue polynomials. We mitigate this by partially overlapping iNTT and BConv. We modify the right-hand side of Eq. 9 as follows:


This modification enables the second part to start when the preceding iNTT and the first part of BConv are finished on sub( in BTS) residue polynomials and stored in RFMMAU. The MMAU computes the corresponding partial sum (the inner sum of Eq. 11), and accumulates this result with the previous results (the outer sum), which are loaded from and stored on to a scratchpad inducing a read and write every cycle. Temporal registers and FIFO minimize the bandwidth pressure on RFMMAU and transpose the data for the correct orientation to feed sub lanes into the MMAU. The precomputed values of and (BConv tables) are respectively loaded into the dedicated and from the BrU when needed.

We also leverage the MMAU for other operations. Subtraction, scaling, and / addition at the end of key-switching (Fig. 3) can be expressed as ; thus, we fuse these three operations to be computed on the MMAU. We refer to this fusion as subtraction-scaling-addition (SSA).

5.3. Scratchpad

The per-PE scratchpad has three purposes. First, it stores the temporary data generated during the course of the HE ops. The size of the temporal data during key-switching can be large (e.g., a single (i)NTT or BConv can produce 28MB at , ). If such data does not reside on-chip, the additional off-chip access would cause severe performance degradation.

Second, the scratchpad also stores the prefetched . To hide the latency of the load time, it must be prefetched beforehand. As is not consumed immediately after being loaded on-chip, it takes up a portion of the scratchpad.

Third, the scratchpad functions as a cache for s, controlled explicitly by software (SW caching). s often show high temporal locality during a sequence of HE ops. For instance, during bootstrapping, a is commonly subjected to multiple HRots. Moreover, as HE ops form a deterministic computational flow and the granularity of cache management is as large as a , SW control is manageable.

The scratchpad bandwidth demand of the BConvU is high (as later detailed in Fig. 8) due to the accesses involved when updating the partial sums. Considering that the partial sum size is only proportional to in Eq. 11 and is loaded times, the bandwidth pressure can be relieved by increasing . However, this would also require an increase in the number of lanes in the MMAU (and hence the size of ), resulting in a trade-off.

5.4. Network-on-Chip (NoC) design

BTS has three types of on-chip communication: 1) off-chip memory traffic to the PEs (PE-Mem NoC), 2) the distribution of precomputed constants to PEs (BrU NoC), and 3) inter-PE data exchanges for (i)NTT and the automorphism (PE-PE NoC). BTS has a large number of nodes (over 2k endpoints) and requires a high bandwidth. Given the unique communication characteristics of each type of on-chip communication, BTS provides three separate NoCs instead of sharing a single NoC to enable deterministic communication while minimizing the NoC overhead.

PE-Mem NoC: Because data is distributed evenly across the PEs, the off-chip memory (i.e., HBM2e (JEDEC, 2021)) is placed on the top and bottom and each HBM only needs to communicate with half of the PEs placed nearby. The PE grid placement is exploited by separating the PEs into 32 regions and connecting each HBM pseudo-channel only to a single PE region. An HBM2e stack supports 16 pseudo-channels (Micron Technology, Inc., 2020) and thus the upper half of the PEs has 16 regions while the lower half also has 16 regions, with each region consisting of 64 PEs.

BrU NoC: BrU data is globally shared by all PEs and broadcast to all PEs. Given the large number of PEs, the BrU is organized hierarchically with 128 local BrUs. Each local BrU provides higher-digit tables of twiddle factors and BConv tables to 16 PEs. The global BrU is loaded with all precomputed values before an HE application starts and sends data to the local BrUs that serve as temporary storage/repeaters.

PE-PE NoC: The PE-PE NoC requires support for the highest bandwidth due to the data exchanges necessary between the PEs. The communication pattern is symmetric (i.e., each PE sends and receives the same amount of data), and a single PE is not oversubscribed. In addition, because the traffic pattern is known (e.g., all-to-all or a fixed, permutation traffic), the NoC can be greatly simplified. BTS implements a logical 2D flattened butterfly (Kim et al., 2007; Ahn et al., 2009) given that communication to other PEs within each row and within each column is limited. However, instead of having a router at each PE, a single “router” xbarh (respectively, xbarv) is shared by all PEs within each row (column); it is placed in the center of each row (column) and used for horizontal (vertical) data exchange steps of (i)NTT (steps ii), iv)). Each xbarh (xbarv) does not require any allocation because the traffic pattern is known ahead of time and can be scheduled through pre-determined arbitration.

5.5. Automorphism

We identify that BTS can handle the automorphism for HRots efficiently. All residues mapped to a single PE always move to another single destination PE under the BTS’ PE-coefficient mapping scheme; i.e., the inter-PE communication of the automorphism exhibits a permutation pattern. A PE of the PE-grid coordinate holds the residues at positions , corresponding to coefficient indices (Section 5.1). s in binary format only differ in the higher bit-field (), meaning that the automorphism destination indices (’s in Eq. 5) also only differ in the higher bit-field; the residues are mapped to the same destination PE corresponding to the lower bit-field ().

We can decompose such a permutation pattern into three steps to fit the PE-PE NoC structure of BTS: intra-PE permutation (z-axis), vertical permutation (y-axis), and horizontal permutation (x-axis). Each step gradually updates the s to s from higher to lower bit-fields. The intra-PE permutation process does not use the NoC. The vertical/horizontal permutations can be handled by xbarv/xbarh. The PE-PE NoC can support an arbitrary HRot with any rotation amount () without data contention, whose property is similar to that of 3D-NTT.

6. Evaluation

6.1. Hardware modeling of BTS

Area Power Freq
Component (m2) (mW) (GHz)
Scratchpad SRAM 114,724 9.86 1.2
RFs 12,479 2.29 Various
NTTU 9,501 12.17 1.2
ModMult (BConvU) 4,070 0.56 0.3
MMAU (BConvU) 9,511 8.42 1.2
Exchange unit 421 1.03 1.2
ModMult 3,833 1.35 0.6
ModAdd 325 0.08 0.6
1 PE 154,863 35.75 -
Area Power Freq
Component (mm2) (W) (GHz)
2048 PEs 317.2 73.21 -
Inter-PE NoC 3.06 45.93 1.2
Global BrU + NoC 0.42 0.10 0.6
128 local BrUs 3.69 0.04 0.6
HBM2e NoC 0.10 6.81 1.2
2 HBM2e stacks 29.6 (Jouppi et al., 2021) 31.76 (O’Connor et al., 2017) -
PCIe5x16 interface 19.6 (Jouppi et al., 2021) 5.37 (Bichan et al., 2020) -
Total 373.6 163.2 -
Table 3. The area and the peak power of components in BTS.

We used the ASAP7 (Clark et al., 2016, 2017) design library to synthesize the logic units and datapath components in a 7nm technology node. We simulated the RFs and scratchpads using FinCACTI (Shafaei et al., 2014) due to the absence of a public 7nm memory compiler. We updated the analytic models and technology constants of FinCACTI to match ASAP7 and the IRDS roadmap (IEEE, 2018). We validated the RTL synthesis and SRAM simulation results against published information (Chang et al., 2017; Song et al., 2018; Auth et al., 2017; Wu et al., 2016; Narasimha et al., 2017; Jouppi et al., 2021; Jeong et al., 2018).

BTS uses single-ported 128-bit wide 1.2GHz SRAMs for the scratchpads, providing a total capacity of 512MB and a bandwidth of 38.4TB/s chip-wide. RFs are implemented in single-ported SRAMs with variable sizes, port widths, and operating frequencies following the requirements of the FUs. 22MBs of RFs are used chip-wide, providing 292TB/s. Crossbars in the PE-PE NoC have 12-bit wide ports and run at 1.2GHz, providing a bisection bandwidth of 3.6TB/s. The NoC wires are routed over other components (Passas et al., 2012). We analyzed the cost of wires and crossbars using FinCACTI and prior works (Banerjee and Mehrotra, 2002; IEEE, 2018; Moon et al., 2008; Passas et al., 2012). Two HBM2e stacks are used (JEDEC, 2021), but with a modest 11% speedup assumed, considering the latest technology (JEDEC, 2022). The peak power and area estimation results are shown in Table 3. BTS is 373.6mm2 in size and consumes up to 163.2W of power.

6.2. Experimental setup

We developed a cycle-level simulator to model the compute capability, latency, and bandwidth of the FUs and the memory components composing BTS. When an HE op is called, the simulator converts the op into a computational graph with primary HE functions. Based on the derived computation and data dependencies, the simulator schedules functions and data loads in epoch granularity while minimizing the temporary data hold time. Utilization rates are also collected and combined with the power model to calculate the energy. The scratchpad space is prioritized in the order of the temporary data, prefetched , and finally, caching with an LRU policy.

We measured Tmult,a/slot as a microbenchmark and evaluated the most complex applications currently available on CKKS: logistic regression (HELR (Han et al., 2019)), CNN inference (ResNet-20 (Lee et al., 2022)), and sorting (Hong et al., 2021)

. HELR trains a binary classification model with MNIST 

(Deng, 2012) for 30 iterations, each with a batch containing 1,024 14

14-pixel images. ResNet-20 performs homomorphic convolution, linear transform, and ReLU. It achieves 92.43% accuracy on CIFAR-10 classification 

(Krizhevsky and Hinton, 2009). We used the channel packing method proposed in (Juvekar et al., 2018) to pack all of the feature map channels into a single to improve the performance further. Sorting uses a 2-way sorting network to sort data. Because non-linear functions such as ReLU and comparisons are approximated by high-degree polynomial functions in CKKS, they consume many levels and induce hundreds of bootstrapping for ResNet-20 and sorting, respectively.

CKKS instance Temp data
INS-1 27 1 3090 133.4 183MB
INS-2 39 2 3210 128.7 304MB
INS-3 44 3 3160 130.8 365MB
Table 4. The CKKS instances used for evaluation.

We compared BTS with the state-of-the-art implementations on a CPU (Lattigo (EPFL-LDS, 2021)), a GPU (100x (Jung et al., 2021a)), and an ASIC (F1 (Samardzic et al., 2021)) for Tmult,a/slot and HELR. We ran Lattigo on a system with an Intel Skylake CPU (Xeon Platinum 8160) and 256GB of DDR4-2666 memory. We used the 128b-secure CKKS instance preset of Lattigo and newly implemented HELR on Lattigo. For 100x and F1, the execution times reported in each paper were used. 100x (Jung et al., 2021a) used NVIDIA V100 (NVIDIA Corporation, 2017) for the evaluation. We also compared BTS with F1+, whose execution times are optimistically scaled from F1 to have the same area as BTS at 7nm (Narasimha et al., 2017). For other applications, we compared BTS with reported multi-threaded CPU performance from each paper due to the absence of available implementations. We used the CKKS instances shown in Table 4 to evaluate BTS. They all have the same degree and satisfy 128b security but use different values of and . As and increase, the temporary data increases, requiring more scratchpad space.

6.3. Performance and efficiency of BTS

Figure 6. Comparison of the Tmult,a/slot between BTS and other prior works of Lattigo (EPFL-LDS, 2021), 100x (Jung et al., 2021a), and F1 (Samardzic et al., 2021). F1+ is a scaled-up version of F1. INS-x denotes the CKKS instances used for BTS, specified in Table 4.

Amortized mult time per slot: BTS outperforms the state-of-the-art CPU/GPU/ASIC implementations by tens to thousands of times in terms of the throughput of HMult. Fig. 6 shows the Tmult,a/slot values of Lattigo, 100x, F1, F1+ and BTS. The best Tmult,a/slot is achieved with INS-2 at 45.5ns, 2,237 better than Lattigo. F1 is even 2.5 slower than Lattigo; this occurs because F1 only supports single-slot bootstrapping.222We call a sparsely-packed if its corresponding message occupies far fewer slots compared to the maximum number of available (). Bootstrapping a sparsely-packed reduces the computational complexity and consumes fewer levels (Chen et al., 2019). In an extreme case using a single-slot, such an effect is maximized. F1 only supports single-slot bootstrapping due to the lack of multiplicative levels, as it targets support of small parameter sets. F1+ is better but shows 824 lower performance than BTS. Tmult,a/slot of 100x is 743ns, reporting the best performance among prior works. However, this is for a 97b-secure parameter set; when using a 173b-secure CKKS instance, 100x reported a 8s Tmult,a/slot.

Figure 7. (a) Comparison of the minimum bound of Tmult,a/slot (Section 3) and the actual Tmult,a/slot using scratchpads of 512MB and 2GB for INS-x, and (b) the portion of the bootstrapping time for each application on INS-1.

The performance of INS-x is higher than the minimum bound performance shown in Fig. 2 because s are not always on the scratchpad with limited capacity. Fig. 7(a) shows the minimum and actual Tmult,a/slot using 512MB and 2GB of scratchpad for INS-x. INS-2 always performs the best. INS-1 performs better than INS-3 with a 512MB scratchpad because the former requires less temporary data, leading to a higher hit rate for s. With an enough (albeit not practical) scratchpad capacity of 2GB, s mostly hit, reaching a performance close to the minimum.

Logistic regression: Table 5 reports the average training time per iteration in HELR. Due to the limited parameter set F1 supports, F1 only reported the HELR training time for a single iteration with 256 images, which does not require bootstrapping but is not enough for training. We estimated F1’s end-to-end HELR performance by assuming that 1024 images in a batch are trained over four iterations, with single-slot bootstrapping applied, ignoring the cost of packing/unpacking s for bootstrapping (giving favor to F1). The execution time with INS-2 achieves 28.4ms, 1,306, 27 and 5.2 better than Lattigo, 100x and F1+, respectively.

Lattigo 100x F1 F1+ INS-1 INS-2 INS-3
Time (ms) 37,050 775 1,024 148 39.9 28.4 43.5
Speedup 1 48 36 250 929 1,306 852
Table 5. Comparison of performance between BTS and other prior works (EPFL-LDS, 2021; Jung et al., 2021a; Samardzic et al., 2021) for logistic regression training (Han et al., 2019).
ResNet-20 execution time (s) 10,602 1.91 2.02 3.09
Speedup (vs. (Lee et al., 2022)) 1 5,556 5,240 3,427
# of bootstrapping - 53 22 19
Sorting execution time (s) 23,066 15.6 18.8 25.2
Speedup (vs. (Hong et al., 2021)) 1 1,482 1,226 915
# of bootstrapping - 521 306 229
Table 6. Evaluating BTS for ResNet-20 inference (Lee et al., 2022) and sorting (Hong et al., 2021).

ResNet-20 and sorting: BTS performs up to 5,556 and 1,482 faster over the prior works, (Lee et al., 2022) and (Hong et al., 2021) (see Table 6). For ResNet-20, INS-1 without channel packing shows a 311 speedup. By adopting the channel-packing method (Juvekar et al., 2018) exploiting the abundant slots of our target CKKS instances, we reduced the working set and improved the throughput, resulting in an additional 17.8 performance gain and achieving 1.91s of ResNet-20 inference latency on an encrypted image.

Although BTS provides a speedup of more than three orders of magnitude for the most complex applications, these applications still do not fully utilize all

slots due to the small problem size. We anticipate the relative speedup of BTS to improve even further when real-world applications are implemented with FHE. For instance, an ImageNet 

(Deng et al., 2009) image has over data, which requires multiple fully-packed s to encrypt.

Figure 8. Timeline, on-chip scratchpad usage change, and scratchpad bandwidth utilization change when BTS performs HMult with INS-1.
Figure 9. The performance and speedup of Tmult,a/slot of BTS when applying various components incrementally. Small BTS is BTS with just enough scratchpad to hold the temporal data of the HE op with no overlapping between BConv and iNTT. The CKKS instance is specified in parentheses.

Parameter selection in retrospect: In Section 3, we estimated the Tmult,a/slot of CKKS instances assuming an always-hit scratchpad and used it as a proxy for the performance of FHE applications with frequent bootstrapping. While the Tmult,a/slot result from the simulator does not directly match the estimation, the 2GB scratchpad case (Fig 7(a)) does concur. This is because the temporal data of INS-3 constitutes the largest set (Table 4) and the corresponding hit rate is affected by the scratchpad capacity.

However, Tmult,a/slot does not always translate to the application performance for the following reasons. First, when the portion of bootstrapping is relatively small as in ResNet-20 (Fig 7(b)), the complexity of HE ops becomes more influential, and a smaller value is better (INS-1 in Table 6). Second, the better Tmult,a/slot caused by deeper levels from higher s does not translate to better performance when there exists a level imbalance between s. Such an imbalance nullifies the benefit of more available levels (see Table 6 with INS-1 and INS-2).

PE resource utilization over time: Resources populated in PEs are highly utilized while processing HE ops. Fig. 8 presents a detailed timeline of HMult on INS-1 when s are on the scratchpad. HBM achieves 98% of its peak bandwidth. NTTUs are busy processing (i)NTT of three intermediate polynomials (d2, ax, and bx) 76% of the time. BConv is partially pipelined with iNTT and has strong dependency on the subsequent NTT; thus, it occupies BConvU for 33% of the time. The scratchpad bandwidth requirement of BConv is high because it must load the partial sum for all s in Eq. 11 within epochs. BConvU runs SSA while not occupied by BConv.

The bandwidth and capacity utilization of the scratchpad fluctuate over time while being properly provisioned to meet the requirements. The average bandwidth usage was 58.6% over time, peaking at 90% when processing a BConv. The required capacity was also highest at at 183MB.

Figure 10. The bootstrapping time and Energy-Delay Area Product (EDAP) of BTS-1 at various scratchpad SRAM sizes.

Ablation study: To evaluate the impact of various attributes of BTS on its performance, first we evaluated a small baseline BTS ( 230mm2) with just enough scratchpad to hold the temporary data that use Lattigo’s CKKS instance () and without overlapping between BConv and iNTT. The results are 379 faster Tmult,a/slot compared to Lattigo. We incrementally changed the CKKS instance to INS-1 and then increased the scratchpad size to 512MB. These changes resulted in 1.50 and 3.18 speedups, respectively (see Fig. 9). Finally, additionally overlapping BConv and iNTT results in a 1.13 speedup, reaching a total of 2044 speedup compared to Lattigo.

We also evaluated BTS with an HBM bandwidth of 2TB/s. We reduced the scratchpad size to make room for the added HBM2e PHYs so that BTS retains the same total area. The result only shows a 1.26 speedup as a larger fraction of time is bound to computations, despite the fact that load time is halved.

Slowdown of FHE: FHE applications on BTS are still slower than their unencrypted counterparts. HELR is 141 slower and ResNet-20 inference is 440 slower compared to when they are run on a CPU system without FHE. Evaluation of non-polynomial functions such as ReLU, which are costly to evaluate on FHE (Lee et al., 2021b) results in a greater slowdown for ResNet-20. Thus, it is crucial to optimize applications to make them more FHE-friendly.

Impact of the scratchpad size on the performance and EDAP: The performance and energy efficiency of BTS improves as we deploy a larger scratchpad, however becoming saturated as the scratchpad holds most of the HE ops’ working sets. Fig. 10 shows the execution time breakdown and energy-delay-area product (EDAP (Thoziyoor et al., 2008)) for the bootstrapping of INS-1 with various scratchpad sizes. We increased the scratchpad size from 192MB (close to the temporary data for HMult) by 64MB, up to 1GB.

With a 192MB scratchpad, BTS frequently load s from off-chip memory due to capacity misses. At this point, HMult/HRot, which used to be dominant (77% of the bootstrapping time for Lattigo) due to its high computational complexity, now only requires 24% of the execution time. The rest attributes to PMult, HAdd, HRescale, and CMult/CAdd. While BTS greatly reduces the computation time of HMult/HRot with its abundant PEs, the load time, which any HE ops require when SW cache misses occur, is now dominant.

As the scratchpad size increases, the portion of HMult/HRot on bootstrapping increases. This occurs because the SW cache hit rate of s for every HE op gradually increases; 65.6%, 98.8%, 93.7%, 98.6%, 97.5%, and 47.8%, for HMult, HRot, PMult, HAdd, HRescale, and CMult/CAdd, respectively, with a 512MB scratchpad. The execution time of HMult/HRot has a lower-bound of the load time, even during SW cache hits. However, the other HE ops not requiring can take significantly less time due to the ratio of the on-chip over the off-chip bandwidth (), when the necessary s are located on the scratchpad.

7. Related Work

CPU acceleration: (CryptoLab Inc., 2018) parallelized HE ops by multi-threading. (Jung et al., 2021b; Boemer et al., 2021) leveraged short-SIMD support. (EPFL-LDS, 2021) exploited the algorithmic analysis from (Bossuat et al., 2021) for efficient bootstrapping implementation. Yet other platforms outperform CPU implementations.

GPU acceleration: GPUs are a good fit for accelerating HE ops as they are equipped with a massive number of integer units and abundant memory bandwidth. However, a majority of prior works did not handle bootstrapping (Al Badawi et al., 2019, 2018, 2020; Jung et al., 2021b). (Jung et al., 2021a) was the first work that supported CKKS bootstrapping on GPU. By fusing GPU kernels, (Jung et al., 2021a) reduced off-chip accesses and achieved 242 faster bootstrapping over a CPU. However, the lack of on-chip storage forces some kernels to remain unfused (Kim et al., 2020a). BTS holds all temporary data on-chip, minimizing off-chip accesses.

FPGA/ASIC acceleration: A different set of works accelerate HE using FPGA or ASIC, but most of them did not consider bootstrapping (Riazi et al., 2020; Roy et al., 2019; Kim et al., 2020b, 2019; Reagen et al., 2021). HEAX (Riazi et al., 2020) dedicated hardware for CKKS mult on FPGA, reaching a 200 performance gain over a CPU implementation. However, its design is fixed to a limited set of parameters and does not consider bootstrapping. Cheetah (Reagen et al., 2021) introduced algorithmic optimization for an HE-based DNN and proposed an accelerator design suitable for this. Instead of bootstrapping, Cheetah uses multi-party computation (MPC) to mitigate errors during the HE operation. Cheetah sends a ciphertext with error back to the clients and the clients recrypt it as a fresh ciphertext. In MPC, the network latency from the frequent communication with the client limits the performance (van der Hagen and Lucia, 2021), thus introducing a different challenge compared to FHE. The accelerator design of Cheetah targets a small ciphertext for MPC, which is not suitable for FHE (Samardzic et al., 2021). F1 (Samardzic et al., 2021) is the first ASIC design that partially supports bootstrapping. It is a programmable accelerator supporting multiple FHE schemes, including CKKS and BGV. F1 achieves impressive performance on various LHE applications as it provides tailored high-throughput computation units and stores s on-chip, minimizing the number of off-chip accesses. However, F1 targets the parameter sets with low degree , thus supporting only non-packed (single-slot) bootstrapping, the throughput of which is greatly exacerbated compared to BTS. F1 is 151.4mm2 in size at a 12/14nm technology node and shows a TDP of 180.4W excluding the HBM power.

8. Conclusion

We have proposed an accelerator architecture for fully homomorphic encryption (FHE), primarily optimized for the throughput of bootstrapping encrypted data. By analyzing the impact of selecting key parameter values on the bootstrapping performance of CKKS, an emerging HE scheme, we devised the design principles of bootstrappable HE accelerators and suggested BTS, which distributes massively-parallel processing elements connected through a network-on-chip design tailored to the unique traffic patterns of number theoretic transform and automorphism, the critical functions of HE operations. We designed BTS to balance off-chip memory accesses, on-chip data reusability, and the computations required for bootstrapping. With BTS, we obtained a speedup of 2,237 in HE multiplication throughput and 5,556 in CNN inference compared to the state-of-the-art CPU implementations.

This work was supported in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00840, 40%) and National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2020R1A2C2010601, 60%). The EDA tool was supported by the IC Design Education Center (IDEC), Korea. Sangpyo Kim is with the Department of Intelligence and Information, Seoul National University. Jung Ho Ahn, the corresponding author, is with the Department of Intelligence and Information, the Institute of Computer Technology, and the Research Institute for Convergence Science, Seoul National University, Seoul, South Korea.


  • (1)
  • Ahn et al. (2009) Jung Ho Ahn, Nathan L. Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber. 2009. HyperX: Topology, Routing, and Packaging of Efficient Large-Scale Networks. In SC.
  • Al Badawi et al. (2020) Ahmad Al Badawi, Louie Hoang, Chan Fook Mun, Kim Laine, and Khin Mi Mi Aung. 2020. Privft: Private and Fast Text Classification with Homomorphic Encryption. IEEE Access 8 (2020), 226544–226556.
  • Al Badawi et al. (2019) Ahmad Al Badawi, Yuriy Polyakov, Khin Mi Mi Aung, Bharadwaj Veeravalli, and Kurt Rohloff. 2019. Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme. IEEE Transactions on Emerging Topics in Computing 9, 2 (2019), 941–956.
  • Al Badawi et al. (2018) Ahmad Al Badawi, Bharadwaj Veeravalli, Chan Fook Mun, and Khin Mi Mi Aung. 2018. High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation Using CUDA. IACR Transactions on Cryptographic Hardware and Embedded Systems 2018, 2 (2018), 143–163.
  • Albrecht et al. (2019) Martin R. Albrecht, Melissa Chase, Hao Chen, Jintai Ding, Shafi Goldwasser, Sergey Gorbunov, Shai Halevi, Jeffrey Hoffstein, Kim Laine, Kristin E. Lauter, Satya Lokam, Daniele Micciancio, Dustin Moody, Travis Morrison, Amit Sahai, and Vinod Vaikuntanathan. 2019. Homomorphic Encryption Standard. IACR Cryptology ePrint Archive 939 (2019).
  • Auth et al. (2017) Chris Auth, A. Aliyarukunju, M. Asoro, D. Bergstrom, V. Bhagwat, J. Birdsall, N. Bisnik, M. Buehler, V. Chikarmane, G. Ding, Q. Fu, H. Gomez, W. Han, D. Hanken, M. Haran, M. Hattendorf, R. Heussner, H. Hiramatsu, B. Ho, S. Jaloviar, I. Jin, S. Joshi, S. Kirby, S. Kosaraju, H. Kothari, G. Leatherman, K. Lee, J. Leib, A. Madahavan, K. Marla, H. Meyer, T. Mule, C. Parker, S. Parthasarathy, C. Pelto, L. Pipes, I. Post, M. Prince, A. Rahman, S. Rajamani, A. Saha, J. Dacuna Santos, M. Sharma, V. Sharma, J. Shin, P. Sinha, P. Smith, M. Sprinkle, A. St. Amour, C. Staus, R. Suri, D. Towner, A. Tripathi, A. Tura, C. Ward, and A. Yeoh. 2017. A 10nm High Performance and Low-Power CMOS Technology Featuring 3rd Generation FinFET Transistors, Self-Aligned Quad Patterning, Contact over Active Gate and Cobalt Local Interconnects. In IEEE International Electron Devices Meeting.
  • Bajard et al. (2016) Jean-Claude Bajard, Julien Eynard, M. Anwar Hasan, and Vincent Zucca. 2016. A Full RNS Variant of FV Like Somewhat Homomorphic Encryption Schemes. In Selected Areas in Cryptography.
  • Banerjee and Mehrotra (2002) Kaustav Banerjee and Amit Mehrotra. 2002. A Power-Optimal Repeater Insertion Methodology for Global Interconnects in Nanometer Designs. IEEE Transactions on Electron Devices 49, 11 (2002), 2001–2007.
  • Barrett (1986) Paul Barrett. 1986. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Annual International Conference on the Theory and Application of Cryptographic Techniques.
  • Bichan et al. (2020) Mike Bichan, Clifford Ting, Bahram Zand, Jing Wang, Ruslana Shulyzki, James Guthrie, Katya Tyshchenko, Junhong Zhao, Alireza Parsafar, Eric Liu, Aynaz Vatankhahghadim, Shaham Sharifian, Aleksey Tyshchenko, Michael De Vita, Syed Rubab, Sitaraman Iyer, Fulvio Spagna, and Noam Dolev. 2020. A 32Gb/s NRZ 37dB SerDes in 10nm CMOS to Support PCI Express Gen 5 Protocol. In IEEE Custom Integrated Circuits Conference.
  • Boemer et al. (2021) Fabian Boemer, Sejun Kim, Gelila Seifu, Fillipe D. M. de Souza, and Vinodh Gopal. 2021. Intel HEXL: Accelerating Homomorphic Encryption with Intel AVX512-IFMA52. In Workshop on Encrypted Computing & Applied Homomorphic Cryptography.
  • Bossuat et al. (2021) Jean-Philippe Bossuat, Christian Mouchet, Juan Ramón Troncoso-Pastoriza, and Jean-Pierre Hubaux. 2021. Efficient Bootstrapping for Approximate Homomorphic Encryption with Non-sparse Keys. In Annual International Conference on the Theory and Applications of Cryptographic Techniques.
  • Brakerski et al. (2014) Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) Fully Homomorphic Encryption without Bootstrapping. ACM Transactions on Computing Theory 6, 3 (2014).
  • Brakerski and Vaikuntanathan (2014) Zvika Brakerski and Vinod Vaikuntanathan. 2014. Efficient Fully Homomorphic Encryption from (Standard) LWE. SIAM J. Comput. 43, 2 (2014), 831–871.
  • Brutzkus et al. (2019) Alon Brutzkus, Ran Gilad-Bachrach, and Oren Elisha. 2019. Low Latency Privacy Preserving Inference. In International Conference on Machine Learning, Vol. 97. 812–821.
  • Chang et al. (2017) Jonathan Chang, Yen-Huei Chen, Wei-Min Chan, Sahil Preet Singh, Hank Cheng, Hidehiro Fujiwara, Jih-Yu Lin, Kao-Cheng Lin, John Hung, Robin Lee, Hung-Jen Liao, Jhon-Jhy Liaw, Quincy Li, Chih-Yung Lin, Mu-Chi Chiang, and Shien-Yang Wu. 2017. 12.1 A 7nm 256Mb SRAM in High-K Metal-Gate FinFET Technology with Write-Assist Circuitry for Low-VMIN Applications. In IEEE International Solid-State Circuits Conference.
  • Chen et al. (2019) Hao Chen, Ilaria Chillotti, and Yongsoo Song. 2019. Improved Bootstrapping for Approximate Homomorphic Encryption. In Annual International Conference on the Theory and Applications of Cryptographic Techniques.
  • Chen et al. (2016) Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016.

    Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In

  • Cheon et al. (2018a) Jung Hee Cheon, Kyoohyung Han, Andrey Kim, Miran Kim, and Yongsoo Song. 2018a. A Full RNS Variant of Approximate Homomorphic Encryption. In Selected Areas in Cryptography.
  • Cheon et al. (2018b) Jung Hee Cheon, Kyoohyung Han, Andrey Kim, Miran Kim, and Yongsoo Song. 2018b. Bootstrapping for Approximate Homomorphic Encryption. In Annual International Conference on the Theory and Applications of Cryptographic Techniques.
  • Cheon et al. (2019) Jung Hee Cheon, Minki Hhan, Seungwan Hong, and Yongha Son. 2019. A Hybrid of Dual and Meet-in-the-Middle Attack on Sparse and Rernary Secret LWE. IEEE Access 7 (2019), 89497–89506.
  • Cheon et al. (2017) Jung Hee Cheon, Andrey Kim, Miran Kim, and Yong Soo Song. 2017. Homomorphic Encryption for Arithmetic of Approximate Numbers. In International Conference on the Theory and Applications of Cryptology and Information Security.
  • Cheon et al. (2022) Jung Hee Cheon, Yongha Son, and Donggeon Yhee. 2022. Practical FHE Parameters against Lattice Attacks. Journal of the Korean Mathematical Society 59, 1 (2022), 35–51.
  • Chillotti et al. (2020) Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2020. TFHE: Fast Fully Homomorphic Encryption Over the Torus. Journal of Cryptology 33, 1 (2020), 34–91.
  • Choquette et al. (2021) Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29–35.
  • Clark et al. (2017) Lawrence T Clark, Vinay Vashishtha, David M Harris, Samuel Dietrich, and Zunyan Wang. 2017. Design Flows and Collateral for the ASAP7 7nm FinFET Predictive Process Design Kit. In IEEE International Conference on Microelectronic Systems Education.
  • Clark et al. (2016) Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. 2016. ASAP7: A 7-nm FinFET Predictive Process Design Kit. Microelectronics Journal 53 (2016), 105–115.
  • Cooley and Tukey (1965) James W. Cooley and John W. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297–301.
  • CryptoLab Inc. (2018) CryptoLab Inc. 2018. HEAAN v2.1.
  • Curtis and Player (2019) Benjamin R. Curtis and Rachel Player. 2019. On the Feasibility and Impact of Standardising Sparse-secret LWE Parameter Sets for Homomorphic Encryption. In ACM Workshop on Encrypted Computing & Applied Homomorphic Cryptography.
  • Damgård et al. (2012) Ivan Damgård, Valerio Pastro, Nigel P. Smart, and Sarah Zakarias. 2012. Multiparty Computation from Somewhat Homomorphic Encryption. In Annual International Cryptology Conference.
  • Dathathri et al. (2020) Roshan Dathathri, Blagovesta Kostova, Olli Saarikivi, Wei Dai, Kim Laine, and Madan Musuvathi. 2020. EVA: An Encrypted Vector Arithmetic Language and Compiler for Efficient Homomorphic Computation. In ACM SIGPLAN International Conference on Programming Language Design and Implementation.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition

  • Deng (2012) Li Deng. 2012. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Processing Magazine 29, 6 (2012), 141–142.
  • EPFL-LDS (2021) EPFL-LDS. 2021. Lattigo v2.3.0.
  • Fan and Vercauteren (2012) Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homomorphic Encryption. IACR Cryptology ePrint Archive 144 (2012).
  • Gentry (2009) Craig Gentry. 2009. Fully Homomorphic Encryption Using Ideal Lattices. In

    ACM Symposium on Theory of Computing

  • Gentry and Halevi (2011) Craig Gentry and Shai Halevi. 2011. Implementing Gentry’s Fully-Homomorphic Encryption Scheme. In Annual International Conference on the Theory and Applications of Cryptographic Techniques.
  • Han et al. (2019) Kyoohyung Han, Seungwan Hong, Jung Hee Cheon, and Daejun Park. 2019. Logistic Regression on Homomorphic Encrypted Data at Scale. In

    AAAI Conference on Artificial Intelligence

  • Han and Ki (2020) Kyoohyung Han and Dohyeong Ki. 2020. Better Bootstrapping for Approximate Homomorphic Encryption. In Cryptographers’ Track at the RSA Conference.
  • Ho et al. (2001) Ron Ho, Kenneth Mai, and Mark Horowitz. 2001. The Future of Wires. Proc. IEEE 89, 4 (2001), 490–504.
  • Hong et al. (2021) Seungwan Hong, Seunghong Kim, Jiheon Choi, Younho Lee, and Jung Hee Cheon. 2021. Efficient Sorting of Homomorphic Encrypted Data With k-Way Sorting Network. IEEE Transactions on Information Forensics and Security 16 (2021), 4389–4404.
  • IEEE (2018) IEEE. 2018. International Roadmap for Devices and Systems: 2018. Technical Report.
  • JEDEC (2021) JEDEC. 2021. High Bandwidth Memory (HBM) DRAM. Technical Report JESD235D.
  • JEDEC (2022) JEDEC. 2022. High Bandwidth Memory DRAM (HBM3). Technical Report JESD238.
  • Jeong et al. (2018) W.C. Jeong, S. Maeda, H.J. Lee, K.W. Lee, T.J. Lee, D.W. Park, B.S. Kim, J.H. Do, T. Fukai, D.J. Kwon, K.J. Nam, W.J. Rim, M.S. Jang, H.T. Kim, Y.W. Lee, J.S. Park, E.C. Lee, D.W. Ha, C.H. Park, H.J. Cho, S.M. Jung, and H.K. Kang. 2018. True 7nm Platform Technology featuring Smallest FinFET and Smallest SRAM cell by EUV, Special Constructs and 3rd Generation Single Diffusion Break. In IEEE Symposium on VLSI Technology.
  • Jouppi et al. (2021) Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter C. Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David A. Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i: Industrial Product. In ISCA.
  • Jung et al. (2021a) Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee. 2021a. Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs. IACR Transactions on Cryptographic Hardware and Embedded Systems 2021, 4 (2021), 114––148.
  • Jung et al. (2021b) Wonkyung Jung, Eojin Lee, Sangpyo Kim, Jongmin Kim, Namhoon Kim, Keewoo Lee, Chohong Min, Jung Hee Cheon, and Jung Ho Ahn. 2021b. Accelerating Fully Homomorphic Encryption Through Architecture-Centric Analysis and Optimization. IEEE Access 9 (2021), 98772–98789.
  • Juvekar et al. (2018) Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. 2018. GAZELLE

    : A Low Latency Framework for Secure Neural Network Inference. In

    USENIX Security Symposium.
  • Kim et al. (2007) John Kim, James Balfour, and William Dally. 2007. Flattened Butterfly Topology for On-Chip Networks. In MICRO. 172–182.
  • Kim et al. (2020a) Sangpyo Kim, Wonkyung Jung, Jaiyoung Park, and Jung Ho Ahn. 2020a. Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs. In IEEE International Symposium on Workload Characterization.
  • Kim et al. (2019) Sunwoong Kim, Keewoo Lee, Wonhee Cho, Jung Hee Cheon, and Rob A. Rutenbar. 2019. FPGA-based Accelerators of Fully Pipelined Modular Multipliers for Homomorphic Encryption. In International Conference on ReConFigurable Computing and FPGAs.
  • Kim et al. (2020b) Sunwoong Kim, Keewoo Lee, Wonhee Cho, Yujin Nam, Jung Hee Cheon, and Rob A. Rutenbar. 2020b. Hardware Architecture of a Number Theoretic Transform for a Bootstrappable RNS-based Homomorphic Encryption Scheme. In IEEE International Symposium on Field-Programmable Custom Computing Machines.
  • Knowles (2021) Simon Knowles. 2021. Graphcore. In IEEE Hot Chips 33 Symposium.
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
  • Lee et al. (2021b) Junghyun Lee, Eunsang Lee, Joon-Woo Lee, Yongjune Kim, Young-Sik Kim, and Jong-Seon No. 2021b. Precise Approximation of Convolutional Neural Networks for Homomorphically Encrypted Data. arXiv preprint arXiv:2105.10879 (2021).
  • Lee et al. (2021a) Joon-Woo Lee, Eunsang Lee, Yongwoo Lee, Young-Sik Kim, and Jong-Seon No. 2021a. High-Precision Bootstrapping of RNS-CKKS Homomorphic Encryption Using Optimal Minimax Polynomial Approximation and Inverse Sine Function. In Annual International Conference on the Theory and Applications of Cryptographic Techniques.
  • Lee et al. (2022) Joon-Woo Lee, Hyungchul Kang, Yongwoo Lee, Woosuk Choi, Jieun Eom, Maxim Deryabin, Eunsang Lee, Junghyun Lee, Donghoon Yoo, Young-Sik Kim, and Jong-Seon No. 2022. Privacy-Preserving Machine Learning With Fully Homomorphic Encryption for Deep Neural Network. IEEE Access 10 (2022), 30039–30054.
  • Lee et al. (2020) Yongwoo Lee, Joonwoo Lee, Young-Sik Kim, HyungChul Kang, and Jong-Seon No. 2020.

    High-Precision and Low-Complexity Approximate Homomorphic Encryption by Error Variance Minimization.

    IACR Cryptology ePrint Archive 1549 (2020).
  • Medina and Dagan (2020) Eitan Medina and Eran Dagan. 2020. Habana Labs Purpose-Built AI Inference and Training Processor Architectures: Scaling AI Training Systems Using Standard Ethernet With Gaudi Processor. IEEE Micro 40, 2 (2020), 17–24.
  • Micron Technology, Inc. (2020) Micron Technology, Inc. 2020. 8GB/16GB HBM2E with ECC. Technical Report CCM005-1412786195-10301 - Rev. D 08/2020 EN.
  • Moon et al. (2008) Peter Moon, Vinay Chikarmane, Kevin Fischer, Rohit Grover, Tarek A Ibrahim, Doug Ingerly, Kevin J Lee, Chris Litteken, Tony Mule, and Sarah Williams. 2008. Process and Electrical Results for the On-die Interconnect Stack for Intel’s 45nm Process Generation. Intel Technology Journal 12, 2 (2008).
  • Narasimha et al. (2017) S. Narasimha, B. Jagannathan, A. Ogino, D. Jaeger, B. Greene, C. Sheraw, K. Zhao, B. Haran, U. Kwon, A. K. M. Mahalingam, B. Kannan, B. Morganfeld, J. Dechene, C. Radens, A. Tessier, A. Hassan, H. Narisetty, I. Ahsan, M. Aminpur, C. An, M. Aquilino, A. Arya, R. Augur, N. Baliga, R. Bhelkar, G. Biery, A. Blauberg, N. Borjemscaia, A. Bryant, L. Cao, V. Chauhan, M. Chen, L. Cheng, J. Choo, C. Christiansen, T. Chu, B. Cohen, R. Coleman, D. Conklin, S. Crown, A. da Silva, D. Dechene, G. Derderian, S. Deshpande, G. Dilliway, K. Donegan, M. Eller, Y. Fan, Q. Fang, A. Gassaria, R. Gauthier, S. Ghosh, G. Gifford, T. Gordon, M. Gribelyuk, G. Han, J.H. Han, K. Han, M. Hasan, J. Higman, J. Holt, L. Hu, L. Huang, C. Huang, T. Hung, Y. Jin, J. Johnson, S. Johnson, V. Joshi, M. Joshi, P. Justison, S. Kalaga, T. Kim, W. Kim, R. Krishnan, B. Krishnan, K. Anil, M. Kumar, J. Lee, R. Lee, J. Lemon, S.L. Liew, P. Lindo, M. Lingalugari, M. Lipinski, P. Liu, J. Liu, S. Lucarini, W. Ma, E. Maciejewski, S. Madisetti, A. Malinowski, J. Mehta, C. Meng, S. Mitra, C. Montgomery, H. Nayfeh, T. Nigam, G. Northrop, K. Onishi, C. Ordonio, M. Ozbek, R. Pal, S. Parihar, O. Patterson, E. Ramanathan, I. Ramirez, R. Ranjan, J. Sarad, V. Sardesai, S. Saudari, C. Schiller, B. Senapati, C. Serrau, N. Shah, T. Shen, H. Sheng, J. Shepard, Y. Shi, M.C. Silvestre, D. Singh, Z. Song, J. Sporre, P. Srinivasan, Z. Sun, A. Sutton, R. Sweeney, K. Tabakman, M. Tan, X. Wang, E. Woodard, G. Xu, D. Xu, T. Xuan, Y. Yan, J. Yang, K.B. Yeap, M. Yu, A. Zainuddin, J. Zeng, K. Zhang, M. Zhao, Y. Zhong, R. Carter, C.H. Lin, S. Grunow, C. Child, M. Lagus, R. Fox, E. Kaste, G. Gomba, S. Samavedam, P. Agnello, and D. K. Sohn. 2017. A 7nm CMOS Technology Platform for Mobile and High Performance Compute Application. In IEEE International Electron Devices Meeting.
  • NVIDIA Corporation (2017) NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture. Technical Report WP-08608-001_v1.1.
  • O’Connor et al. (2017) Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. 2017. Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems. In MICRO.
  • PALISADE Project (2021) PALISADE Project. 2021. PALISADE Lattice Cryptography Library (release 1.11.5).
  • Passas et al. (2012) Giorgos Passas, Manolis Katevenis, and Dionisios Pnevmatikatos. 2012. Crossbar NoCs are Scalable Beyond 100 Nodes. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4 (2012), 573–585.
  • Prabhakar and Jairath (2021) Raghu Prabhakar and Sumti Jairath. 2021. SambaNova SN10 RDU: Accelerating Software 2.0 with Dataflow. In IEEE Hot Chips 33 Symposium.
  • Ranganathan et al. (2021) Parthasarathy Ranganathan, Daniel Stodolsky, Jeff Calow, Jeremy Dorfman, Marisabel Guevara, Clinton Wills Smullen IV, Aki Kuusela, Raghu Balasubramanian, Sandeep Bhatia, Prakash Chauhan, Anna Cheung, In Suk Chong, Niranjani Dasharathi, Jia Feng, Brian Fosco, Samuel Foss, Ben Gelb, Sara J. Gwin, Yoshiaki Hase, Da-ke He, C. Richard Ho, Roy W. Huffman Jr., Elisha Indupalli, Indira Jayaram, Poonacha Kongetira, Cho Mon Kyaw, Aaron Laursen, Yuan Li, Fong Lou, Kyle A. Lucke, JP Maaninen, Ramon Macias, Maire Mahony, David Alexander Munday, Srikanth Muroor, Narayana Penukonda, Eric Perkins-Argueta, Devin Persaud, Alex Ramirez, Ville-Mikko Rautio, Yolanda Ripley, Amir Salek, Sathish Sekar, Sergey N. Sokolov, Rob Springer, Don Stark, Mercedes Tan, Mark S. Wachsler, Andrew C. Walton, David A. Wickeraad, Alvin Wijaya, and Hon Kwan Wu. 2021. Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems.
  • Reagen et al. (2021) Brandon Reagen, Woo-Seok Choi, Yeongil Ko, Vincent T. Lee, Hsien-Hsin S. Lee, Gu-Yeon Wei, and David Brooks. 2021. Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference. In HPCA.
  • Regev (2009) Oded Regev. 2009. On Lattices, Learning with Errors, Random Linear Codes, and Cryptography. J. ACM 56, 6 (2009), 40 pages.
  • Riazi et al. (2020) M. Sadegh Riazi, Kim Laine, Blake Pelton, and Wei Dai. 2020. HEAX: An Architecture for Computing on Encrypted Data. In ASPLOS.
  • Roy et al. (2019) Sujoy Sinha Roy, Furkan Turan, Kimmo Järvinen, Frederik Vercauteren, and Ingrid Verbauwhede. 2019. FPGA-Based High-Performance Parallel Architecture for Homomorphic Computing on Encrypted Data. In HPCA.
  • Samardzic et al. (2021) Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas, Ronald Dreslinski, Christopher Peikert, and Daniel Sanchez. 2021. F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption. In MICRO.
  • Shafaei et al. (2014) Alireza Shafaei, Yanzhi Wang, Xue Lin, and Massoud Pedram. 2014. FinCACTI: Architectural Analysis and Modeling of Caches with Deeply-Scaled FinFET Devices. In IEEE Computer Society Annual Symposium on VLSI.
  • Son (2021) Yongha Son. 2021. SparseLWE-estimator.
  • Song et al. (2018) Taejoong Song, Jonghoon Jung, Woojin Rim, Hoonki Kim, Yongho Kim, Changnam Park, Jeongho Do, Sunghyun Park, Sungwee Cho, Hyuntaek Jung, Bongjae Kwon, Hyun-Su Choi, Jaeseung Choi, and Jong Shik Yoon. 2018. A 7nm FinFET SRAM Using EUV Lithography with Dual Write-Driver-Assist Circuitry for Low-Voltage Applications. In IEEE International Solid-State Circuits Conference.
  • Thoziyoor et al. (2008) Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. In ISCA.
  • van der Hagen and Lucia (2021) McKenzie van der Hagen and Brandon Lucia. 2021. Practical encrypted computing for iot clients. arXiv preprint arXiv:2103.06743 (2021).
  • Wu et al. (2016) Shien-Yang Wu, C.Y. Lin, M.C. Chiang, J.J. Liaw, J.Y. Cheng, S.H. Yang, C.H. Tsai, P.N. Chen, T. Miyashita, C.H. Chang, V.S. Chang, K.H. Pan, J.H. Chen, Y.S. Mor, K.T. Lai, C.S. Liang, H.F. Chen, S.Y. Chang, C.J. Lin, C.H. Hsieh, R.F. Tsui, C.H. Yao, C.C. Chen, R. Chen, C.H. Lee, H.J. Lin, C.W. Chang, K.W. Chen, M.H. Tsai, K.S. Chen, Y. Ku, and S.M. Jang. 2016. A 7nm CMOS Platform Technology Featuring 4th Generation FinFET Transistors with a 0.027um2 High Density 6-T SRAM cell for Mobile SoC Applications. In IEEE International Electron Devices Meeting.
  • Xin et al. (2020) Guozhu Xin, Jun Han, Tianyu Yin, Yuchao Zhou, Jianwei Yang, Xu Cheng, and Xiaoyang Zeng. 2020. VPQC: A Domain-Specific Vector Processor for Post-Quantum Cryptography Based on RISC-V Architecture. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 8 (2020), 2672–2684.
  • Xin et al. (2021) Guozhu Xin, Yifan Zhao, and Jun Han. 2021. A Multi-Layer Parallel Hardware Architecture for Homomorphic Computation in Machine Learning. In IEEE International Symposium on Circuits and Systems.
  • Xing and Li (2021) Yufei Xing and Shuguo Li. 2021. A Compact Hardware Implementation of CCA-secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA. IACR Transactions on Cryptographic Hardware and Embedded Systems 2021, 2 (2021), 328–356.
  • Zhang et al. (2021) Ye Zhang, Shuo Wang, Xian Zhang, Jiangbin Dong, Xingzhong Mao, Fan Long, Cong Wang, Dong Zhou, Mingyu Gao, and Guangyu Sun. 2021. PipeZK: Accelerating Zero-Knowledge Proof with a Pipelined Architecture. In ISCA.