1. Introduction
Homomorphic encryption (HE) allows computations on encrypted data or ciphertexts (
s). In the machinelearningasaservice (MLaaS) era, HE is highlighted as an enabler for privacypreserving cloud computing, as it allows safe offloading of private data. Because HE schemes are based on the learningwitherrors (LWE)
(Regev, 2009) problem, they are noisy in nature. Noise accumulates as we apply a sequence of computations ons. This limits the number of computations that can be performed and hinders the applicability of HE for practical purposes, such as in deeplearning models with high accuracy
(Lee et al., 2022). To overcome this limitation, fully HE (FHE) (Gentry, 2009) was proposed, featuring an operation (op) called bootstrapping, that “refreshes” the and hence permits an unlimited number of computations on the . Among multiple HE schemes that support FHE, CKKS (Cheon et al., 2017) is one of the prime candidates as it supports fixedpoint real number arithmetic.One of the main barriers to adopting HE has been its high computational and memory overhead. New schemes (Brakerski et al., 2014; Fan and Vercauteren, 2012; Brakerski and Vaikuntanathan, 2014; Chillotti et al., 2020; Cheon et al., 2017) and algorithmic optimizations (Han and Ki, 2020; Bossuat et al., 2021; Al Badawi et al., 2019) (using the residue number system (Cheon et al., 2018a; Bajard et al., 2016)) have reduced this overhead and resulted in a 1,000,000 speedup (Bossuat et al., 2021) at least compared to its first HE implementation (Gentry and Halevi, 2011). However, even with such efforts, HE ops experience tens of thousands of slowdowns compared to unencrypted ops (Jung et al., 2021b). Attempting to tackle this, prior works have sought hardware solutions to accelerate HE ops, including CPU extensions (Jung et al., 2021b; Boemer et al., 2021), GPU (Jung et al., 2021a; Al Badawi et al., 2020, 2019, 2018), FPGA (Riazi et al., 2020; Roy et al., 2019; Kim et al., 2020b, 2019), and ASIC (Samardzic et al., 2021).

Target 

Refreshed 

FHE mult  

len  slots^{†} per  thruput  
()  bootstrap  ()  
Lattigo (EPFLLDS, 2021)  CPU  32,768    610K  
100x (Jung et al., 2021a)  GPU  65,536  SIMT  0.11M  
(Roy et al., 2019)  FPGA    rPLP  
HEAX (Riazi et al., 2020)  FPGA    rPLP  
F1 (Samardzic et al., 2021)  ASIC  1  rPLP  4K  
BTS  ASIC  65,536  CLP  20M 

Data elements that can be packed in a for SIMD execution.

Residuepolynomiallevel parallelism (rPLP) and coefficientlevel parallelism (CLP) can be exploited in parallelizing HE ops (Section 4.3).

F1 only supports singleslot bootstrapping which has low throughput.
However, prior acceleration works mostly targeted small problem sizes, with a small target N (the length of a ), and they are lacking in bootstrapping support. Bootstrapping, which is necessary to reduce the impact of noise, occurs frequently in most FHE applications and represents the highest expense. For example, bootstrapping occurs more than 1,000 times for a single ResNet20 inference (Lee et al., 2022) and each instance of bootstrapping can take dozens of seconds on the stateoftheart CPU (EPFLLDS, 2021) and hundreds of milliseconds on a GPU (Jung et al., 2021a). Most prior custom hardware acceleration works (Roy et al., 2019; Riazi et al., 2020) do not support bootstrapping at all, while F1 (Samardzic et al., 2021) demonstrated a bootstrapping time for CKKS but with limited throughput (Table 1).
We propose BTS, a bootstrappingoriented FHE accelerator that is Bootstrappable, Technologydriven, and Secure. First, we identify the limitations that are imposed by contemporary fabrication technology when designing an HE accelerator, analyzing the implications of various conflicting requirements for the performance and security of FHE under such a constrained design space. This allows us to pinpoint appropriate optimization targets and requirements when designing the FHE accelerator. Second, we build a balanced architecture on top of those observations; we analyze the characteristics of HE functions to determine the appropriate number of processing elements (PEs) and proper data mapping that balances computation and data movement when using our FHEoptimized parameters. We also choose to exploit coefficientlevel parallelism (CLP), instead of residuepolynomiallevel parallelism (rPLP), to evade the load imbalance issue. Finally, we devise a novel PE microarchitecture that efficiently handles HE functions including base conversion, and a timemultiplexed NoC structure that manages both number theoretic transform and automorphism functions.
Through these detailed studies, BTS achieves a 5,714 speedup in multiplicative throughput against F1, the stateoftheart ASIC implementation, when bootstrapping is properly considered. Also, BTS significantly reduces the training time of logistic regression (Han et al., 2019) compared to the CPU (by 1,306) and GPU (by 27) implementations, and can execute a ResNet20 inference 5,556 faster than the prior CPU implementation (Lee et al., 2022).
In this paper, we make the following key contributions:

[noitemsep,leftmargin=0.135in]

We provide a detailed analysis of the interplay of HE parameters impacting the performance of FHE accelerators.

We propose BTS, a novel accelerator architecture equipped with massively parallel compute units and NoCs tailored to the mathematical traits of FHE ops.

BTS is the first accelerator targeting practical bootstrapping, enabling unbounded multiplicative depth, which is essential for complex workloads.
2. Background
We provide a brief overview of HE and CKKS (Cheon et al., 2017) in particular. Table 2 summarizes the key parameters and notations we use in this paper.
Symbol  Definition 

(Prime) moduli product  
(Prime) moduli  
Modulus factors  
Special (prime) moduli product  
Special (prime) moduli  
Evaluation key () for HMult  
for HRot with a rotation amount of  
The degree of a polynomial  
Maximum (multiplicative) level  
Current (multiplicative) level of a ciphertext  
Levels consumed at bootstrapping  
The number of special prime moduli  
Decomposition number  
Security parameter of a given CKKS instance 
2.1. Homomorphic Encryption (HE)
HE enables direct computation on encrypted data, referred to as ciphertext (), without decryption. There are two types of HE. Leveled HE (LHE) supports a limited number of operations (ops) on a due to the noise that accumulates after the ops. In contrast, Fully HE (FHE) allows an unlimited number of ops on s through bootstrapping (Gentry, 2009) that “refreshes” a and lowers the impact of noise. LHE has limited applicability^{1}^{1}1The hybrid use of LHE with multiparty computation (Damgård et al., 2012) allows for a broader range of applications. However, such an approach has a different bottleneck of the communication cost and intense clientside computations.; in the field of privacypreserving deep learning inference, for instance, simple/shallow networks such as LoLa (Brutzkus et al., 2019) can be implemented with LHE, but only with limited accuracy (74.1%). More accurate models such as ResNet20 (Lee et al., 2022) (92.43%) demand much more ops applied to s and thus FHE implementation.
While other FHE schemes support integer (Brakerski et al., 2014; Brakerski and Vaikuntanathan, 2014; Fan and Vercauteren, 2012) or boolean (Chillotti et al., 2020) data types, CKKS (Cheon et al., 2017) supports fixedpoint complex (real) numbers. As many realworld applications such as MLaaS (Machine Learning as a Service) require arithmetic on real numbers, CKKS has become one of the most prominent FHE schemes. In this paper, we focus on accelerating CKKS ops; however, our proposed architecture is applicable to other popular FHE schemes (e.g., BGV (Brakerski et al., 2014) and BFV (Brakerski and Vaikuntanathan, 2014; Fan and Vercauteren, 2012; Bajard et al., 2016)) that share similar core ops.
2.2. CKKS: an emerging HE scheme
CKKS first encodes a message that is a vector of complex numbers, into a plaintext
, which is a polynomial in a cyclotomic polynomial ring . The coefficients are integers modulo and the number of coefficients (or degree) is , where is a poweroftwo integer, typically ranging from to . For a given , a message with up to complex numbers can be packed into a single plaintext in CKKS. Each element within a packed message is referred to as a slot. After encoding (or packing), elementwise multiplication (mult) and addition between two messages can be done through polynomial operations between plaintexts. CKKS then encrypts a plaintext into a based on the following equation,where is a secret key, is a random polynomial, and is a small Gaussian error polynomial required for LWE security guarantee (Albrecht et al., 2019). CKKS decrypts by computing , which approximates to with small errors.
HE is mainly bottlenecked by the high computational complexity of polynomial ops. As each coefficient of a polynomial is a large integer (up to 1,000s of bits) and the degree is high (even surpassing 100,000), an op between two polynomials has high compute and datatransfer costs. To reduce the computational complexity, HE schemes using the residue number system (RNS) (Bajard et al., 2016; Cheon et al., 2018a) have been proposed. For example, FullRNS CKKS (Cheon et al., 2018a) sets as the product of wordsized (prime) moduli , where for a given integer . Using the Chinese remainder theorem (Eq. 1), we represent a polynomial in with residue polynomials in , whose coefficients are residues obtained by performing modulo (represented as ) on the large coefficients:
(1) 
Then, we can convert an op involving two polynomials into ops between the residue polynomials with wordsized coefficients ( 64 bits) corresponding to the same , avoiding costly biginteger arithmetic with carry propagation. FullRNS CKKS provides an 8 speedup over plain CKKS (Cheon et al., 2018a) and thus, we adopt FullRNS CKKS as our CKKS implementation, representing a polynomial in as an matrix of residues, and a as a pair of such matrices.
2.3. Primitive operations (ops) of CKKS
Primitive HE ops of CKKS are introduced here, which can be combined to create more complex HE ops such as linear transformation and convolution. Given two ciphertexts
where and , HE ops can be summarized as follows:HAdd performs an elementwise addition of and :
(2) 
By computing , we recover , albeit with error terms. Keyswitching recombines the tensor product result to be decryptable with using a public key, called an evaluation key (). An is a in with a larger modulus , where for given special (prime) moduli . We express an as a pair of matrices. HMult is then computed using Eq. 4, which involves keyswitching with an for mult, :
(4) 
HRot circularly shifts a message vector by slots. When a encrypts a message vector = , after applying HRot by a rotation amount , the rotated ciphertext encrypts = . HRot consists of an automorphism and keyswitching. is mapped to after an automorphism. This moves the coefficients of a polynomial through the mapping , where is the index of the coefficient and is:
(5) 
Similar to HMult, keyswitching brings back , which was only decryptable with after automorphism, to be decryptable with . An HRot with a different rotation amount each requires a separate , . HRot is computed as follows:
(6) 
HE applications require other HE ops, such as an addition or mult of a with a scalar (CAdd, CMult) or a polynomial (PAdd, PMult) of unencrypted, constant values. Additions are performed by adding the scalar or polynomial to , and mults are performed by multiplying each and by the scalar or polynomial.
2.4. Multiplicative level and HE bootstrapping
Multiplicative level: The error included in a is amplified during HE ops; in particular, HMult multiplies the error with other terms (e.g., and ) and can result in an explosion of the error if not treated properly. CKKS performs HRescale to mitigate this explosion and keep the error tolerable by dividing the by the last prime modulus (Cheon et al., 2018a). After HRescale, the residue polynomial is discarded, and the is reduced in size. The continues losing the residues of with each HRescale while executing an HE application until only one residue polynomial is left when no additional HMult can be performed on the . , or the maximum multiplicative level, determines the maximum number of HMult ops that can be performed without bootstrapping, and the current (multiplicative) level denotes the number of remaining HMult operations that can be performed on the . Thus, a with a level is represented as a pair of matrices.
Bootstrapping: FHE features a bootstrapping op that restores the multiplicative level () of a to enable more ops. Bootstrapping must be commonly performed for the practical usage of HE with a complex sequence of HE ops. Bootstrapping mainly consists of homomorphic linear transforms and approximate sine evaluation (Cheon et al., 2018b), which can be broken down into hundreds of primitive HE ops. HMult and HRot ops account for more than 77% of the bootstrapping time (EPFLLDS, 2021). As bootstrapping itself consumes levels, should be larger than . A larger is beneficial as it requires less frequent bootstrapping. ranges from 10 to 20 depending on the bootstrapping algorithm; a larger allows the use of more precise and faster bootstrapping algorithms(Chen et al., 2019; Bossuat et al., 2021; Lee et al., 2021a; Han and Ki, 2020). The bootstrapping algorithm we use in this paper is based on (Han and Ki, 2020) with updates to meet the latest security and precision requirements (Bossuat et al., 2021; Lee et al., 2020; Cheon et al., 2019), and the value of is 19. Readers are encouraged to refer to the papers for a more detailed explanation of the algorithm. Another CKKSspecific constraint is that the moduli ’s and the special moduli ’s must be large enough to tolerate the error accumulated during bootstrapping, whose typical values range from to (Cheon et al., 2022; EPFLLDS, 2021).
2.5. Modern algorithmic optimizations in CKKS and amortized mult time per slot
Security level (): The level of security for an HE scheme is represented by , a parameter measured by the logarithmic time complexity for an attack (Cheon et al., 2019) to deduce the secret key. A sufficiently high is required for safety; we target of 128 bits, adhering to the standard (Albrecht et al., 2019) established by recent HE studies (Bossuat et al., 2021; Lee et al., 2021a, 2020) and libraries (EPFLLDS, 2021; PALISADE Project, 2021). is a strictly increasing function of (Curtis and Player, 2019).
Dnum: Keyswitching is an expensive function, accounting for most of the time in HRot and HMult (Jung et al., 2021a). We adopt a stateoftheart generalized keyswitching technique (Han and Ki, 2020), which balances , the computational cost, and . (Han and Ki, 2020) factorizes the moduli product into (see Eq. 7) for a given integer (decomposition number). It decomposes a into slices, each consisting of residue polynomials corresponding to the prime moduli (’s) that together compose the modulus factor . We perform keyswitching on each slice in and later accumulate them. The special moduli product should only satisfy for each , allowing us to choose a smaller , leading to a higher . i) Therefore, a larger means a greater level of with fixed values of and because we can increase .
(7) 
A major challenge of generalized keyswitching is that different s () must be prepared for each factor , where each is a pair of matrices and is set to . ii) Thus, the aggregate size becomes , linearly increasing with . iii) The overall computational complexity of a single HE op also increases with . Therefore, choosing an appropriate crucially affects the performance.
Amortized mult time per slot (T_{mult,a/slot}): Changing the HE parameter set has mixed effects on the performance of HE ops. Decreasing reduces the computational complexity and memory usage. However, we should lower and to sustain security, which requires more frequent bootstrapping. Also, because a of degree can encode only up to message slots by packing, the throughput degrades.
Jung et al.(Jung et al., 2021a) introduced a metric called amortized mult time per slot (T_{mult,a/slot}), which is calculated as follows:
(8) 
where T_{boot} is the bootstrapping time and T_{mult} is the time required to perform HMult at a level . This metric initially calculates the average cost of mult including the overhead of bootstrapping, and then divides it by the number of slots in a (). Thus, T_{mult,a/slot} effectively captures the reciprocal throughput of a CKKS instance (CKKS scheme with a certain parameter set).
3. Technologydriven Parameter Selection of Bootstrappable Accelerators
3.1. Technology trends regarding memory hierarchy
Domainspecific architectures (e.g., deeplearning (Jouppi et al., 2021; Medina and Dagan, 2020; Knowles, 2021) and multimedia (Ranganathan et al., 2021) accelerators) are often based on custom logic and an optimized dataflow to provide high computation capabilities. In addition, the memory capacity/bandwidth requirements of the applications are exploited in the design of the memory hierarchy. Recently, onchip SRAM capacities have scaled significantly (Auth et al., 2017) such that the level of hundreds of MBs of onchip SRAM is feasible, providing tens of TB/s of SRAM bandwidth(Jouppi et al., 2021; Prabhakar and Jairath, 2021; Knowles, 2021). While the bandwidth of the mainmemory has also increased, its aggregate throughput is still more than an order of magnitude lower than the onchip SRAM bandwidth (O’Connor et al., 2017), achieving a few TB/s of throughput even with highbandwidth memory (HBM).
Similar to other domainspecific architectures (Chen et al., 2016; Jouppi et al., 2021), HE applications also follow deterministic computational flows, and the locality of the input and output s of HE ops can be maximized through software scheduling (Dathathri et al., 2020). Thus, s can be reused by exploiting a large amount of onchip SRAM enabled by technology scaling. However, even with the increasing onchip SRAM capacity, we observe that the size of onchip SRAM is still insufficient to store s, rendering the offchip memory bandwidth becomes a crucial bottleneck for modern CKKS scheme that supports bootstrapping. In the following sections, we identify the importance of bootstrapping on the overall performance and provide an analysis of how different CKKS parameters impact the amount of data movement during bootstrapping and its final throughput.
3.2. Interplay between key CKKS parameters
= 1). Interpolated results are used for points with noninteger
values. The dotted line in (a) represents the minimum required level of 11 for bootstrapping.Selecting one parameter of a CKKS instance has a multifaceted effect on the other parameters. First, is lowered when is higher, and is raised when is higher. Considering that a bootstrappable CKKS instance requires a high (), and with the sizes of prime moduli and set around and with a 64bit machine word size, exceeds 500. To support 128b security when exceeds 500, must be larger than (Lee et al., 2020).
Second, when is set from fixed values of and , a larger leads to a higher at the cost of a larger size. Considering that equals , the ratio is close to . Therefore, when is fixed, a larger means a larger and finally a larger . However, the size also increases linearly with (see Fig. 1). Because the high level of achieved by increasing saturates quickly, choosing a proper is important.
3.3. Realistic minimum bound of HE accelerator execution time
T_{mult,a/slot} is mainly determined by the bootstrapping time, as bootstrapping is more than 60 longer than a single HMult on conventional systems (EPFLLDS, 2021; Jung et al., 2021a). Unlike simple LHE tasks such as LoLa (Brutzkus et al., 2019), which only requires a handful of s, bootstrapping typically requires more than 40 s, mostly for the long sequence of multiple HRots applied with different ’s during the linear transformation steps of bootstrapping (Bossuat et al., 2021) (). They can amount to GBs of storage and exhibit poor locality.
The bootstrapping time is mostly spent on HMult and HRot. (Jung et al., 2021a) found that HMult and HRot are memorybound, highly dependent on the onchip storage capacity. Given today’s technology with low logic costs and highdensity onchip SRAMs, the performance of HMult and HRot can be improved significantly with an HE accelerator.
Despite such an increase in onchip storage, s, with each possibly taking up several hundreds of MBs (see Fig. 1), cannot easily be stored onchip. Because onchip storage cannot hold all s, they must be stored offchip and be loaded in a streaming fashion upon every HMult/HRot. Therefore, even if every temporal data and s with high locality are assumed to be stored onchip with massive onchip storage, the load time of becomes the minimum execution time for HMult/HRot considering the limited offchip bandwidth.
3.4. Desirable target CKKS parameters for HE accelerators
To understand the impact of CKKS parameters, we simulate T_{mult,a/slot} at multiple points while sweeping the , , and values. With 1TB/s of memory bandwidth (half of NVIDIA A100 (Choquette et al., 2021) and identical to F1 (Samardzic et al., 2021)), a bootstrapping algorithm that consumes 19 levels, and the simulation methodology in Section 6.2, we add two simplifying assumptions based on Section 3.3 such that 1) the computation time of HE ops can be fully hidden by the memory latency of s, and 2) all s of HE ops are stored in onchip SRAM and reused. Fig. 2 reports the results. The xaxis shows determined by (Curtis and Player, 2019)
, as calculated using an estimation tool
(Son, 2021). The yaxis shows T_{mult,a/slot} for different s, s, and s.We make two key observations. First, when other values are fixed, T_{mult,a/slot} decreases as increases, even with the higher memory pressure from the larger s and s because the available level () increases. However, such an effect saturates after . Around our target security level of 128b in Fig. 2, the gain from to is 3.8 (111.4ns to 29.1ns), whereas that from to is 1.3. Second, while a higher can help smaller s to reach our target 128b security level, it comes at the cost of a superlinear increase in T_{mult,a/slot} due to the increasing size and the additional gain in being saturated.
These key observations suggest that a bootstrappable HE accelerator should target CKKS instances with high polynomial degrees () and low values. Our BTS targets the CKKS instances with highlighted in Fig. 2. With these, the simulated HE accelerator achieves T_{mult,a/slot} of 27.7ns, 19.9ns, and 22.1ns with corresponding (, ) pairs of (27, 1), (39, 2), and (44, 3), respectively. Although BTS can support all CKKS instances shown in Fig. 2, it is not optimized for other CKKS instances as they either exhibit worse T_{mult,a/slot}, or require significantly more onchip resources with only a marginal performance gain ().
In this paper, we use the CKKS instance with , , and as a running example. When using the 64bit machine word size, a at the maximum level has a size of 56MB, and an has a size of 112MB.
4. Architecting BTS
We explore the organization of BTS, our HE accelerator architecture. We address the limitations of prior works, F1 (Samardzic et al., 2021) in particular, and suggest a suitable architecture for bootstrappable CKKS instances. Section 3.4 derived the optimality of such CKKS instances assuming that an HE accelerator can hide all the computation time within the loading time of an . BTS exploits massive parallelism innate in HE ops to satisfy that optimality requirement indeed, with enough, but not an excess of, functional units (FUs). To achieve this, first we dissect keyswitching, which appears in both HMult and HRot, and has both heavy computation and memory requirements.
4.1. Computational breakdown of HE ops
We first dissect keyswitching, which appears in both HMult and HRot, the two dominant HE ops for bootstrapping and general HE workloads. Fig. 3(a) shows the computational flow of keyswitching, and Fig. 3(b) shows the corresponding computational complexity breakdown. We focus on three functions, NTT, iNTT, and BConv, which take up most of the computation.
Number Theoretic Transform (NTT): A polynomial mult between polynomials in
translates to a negacyclic convolution of their coefficients. NTT is a variant of the Discrete Fourier Transform (DFT) in
. Similar to DFT, NTT transforms the convolution between two sets of coefficients into an elementwise mult, while inverse NTT (iNTT) is applied to obtain the final result as shown below ( meaning elementwise mult):By applying the wellknown Fast Fourier Transform (FFT) algorithms (Cooley and Tukey, 1965), the computational complexity of (i)NTT is reduced from to . This strategy divides the computation into stages, where data elements are paired into
pairs in a strided manner and butterfly operations are applied to each pair per stage. The stride value changes every stage. Butterfly operations in (i)NTT are as follows:
where (a twiddle factor
) is an odd power (up to
) of the primitive th root of unity . In total, twiddle factors are needed per prime modulus. NTT can be applied concurrently to each residue polynomial (in ) in a .Base Conversion (BConv): BConv (Bajard et al., 2016) converts a set of residue polynomials to another set whose prime moduli are different from the former. A at level has two polynomials, with each consisting of residue polynomials corresponding to prime moduli . We denote this modulus set as , called the polynomial’s base or base in short.
BConv is required in keyswitching to match the base of a with an on base where . BConv from to is performed on s, as expressed in Eq. 9, where for . Likewise, BConv from to is performed after multiplying by .
(9) 
Because BConv cannot be performed on polynomials after NTT (i.e., they are in the NTT domain), iNTT is performed to bring the polynomials back to the RNS domain. BTS keeps polynomials in the NTT domain by default and brings them back to the RNS domain only for BConv. Thus, a sequence of is a common pattern in CKKS.
4.2. Limitations in prior works and the balanced design of BTS
Prior HE acceleration studies (Samardzic et al., 2021; Riazi et al., 2020; Roy et al., 2019; Reagen et al., 2021) identified (i)NTT as the paramount acceleration target and placed multiple NTT units (NTTUs) that can perform both Butterfly_{NTT} and Butterfly_{iNTT}. F1 (Samardzic et al., 2021) in particular populated numerous NTTUs with “the more the better” approach, provisioning 14,336 NTTUs even for a small HE parameter set with . Such an approach was viable because, under the small parameter sets, all , , and temporal data could reside onchip, especially with proper compiler support.
However, we observe that such massive use of NTTUs is wasteful in bootstrappable CKKS instances, where the offchip memory bandwidth becomes the main determinant of the overall performance. The FHEoptimized parameters cause a quadratic increase in , , and the temporal data (e.g., 64 when moving from to of ). This makes it impossible for these components to be located onchip, especially considering that most prior custom hardware works only take into account the max case.
We instead analyze how many fullypipelined NTTUs an HE accelerator requires to finish HMult or HRot within the loading time with our target CKKS instances. We define the minimum required number of NTTUs (min) as . When we assume a nominal operating frequency of 1.2GHz for NTTUs considering prior works (Choquette et al., 2021; Knowles, 2021; Jouppi et al., 2021) in 7nm process nodes, and HBM with an aggregate bandwidth of 1TB/s, min is defined as shown below:
(10) 
The value of is maximized when is 1. For , the value is 1,328. We utilize 2,048 NTTUs in BTS to provide some margin for other operations.
In addition to (i)NTT, the importance of BConv grows as small s are used. The computational complexity of BConv in keyswitching is proportional to (). As a result, the relative computational complexity of BConv, which is 12% at , increases to 34% at (see Fig. 3(b)). Prior works mainly targeted , focusing on the acceleration of (i)NTT. We propose a novel BConv unit (BConvU) to handle the increased significance of BConv, whose details are described later in Section 5.2.
4.3. BTS organization exploiting data parallelism
We can categorize primary HE functions into three groups according to their data access patterns (see Fig. 4). Residuepolynomialwise functions, the (i)NTT and automorphism functions, involve all residues in a residue polynomial to produce an output. Coefficientwise functions (e.g., BConv) involve all residues of a single coefficient to produce an output residue. Elementwise functions such as CMult and PMult only involve residues on the same position over multiple polynomials.
We can exploit two types of data parallelism, residuepolynomiallevel parallelism (rPLP) and coefficientlevel parallelism (CLP), when parallelizing an HE op with multiple processing elements (PEs). rPLP can be exploited by distributing residue polynomials and CLP can be by distributing coefficients to many PEs. Prior works including F1 mostly exploited rPLP as primewise modularization is apparently possible.
When the data access pattern and the type of the parallelism being exploited are not aligned, data exchanges between PEs occur, resulting in global wire communication which has poorly scaled over technology generations (Ho et al., 2001). For the sequence of in keyswitching, CLP will incur data exchanges for (i)NTT and rPLP will incur data exchanges for BConv. The total size of the transferred data is identical at . Thus, there is no clear winner between the two types of parallelism in terms of data exchanges. However, exploiting rPLP is limited in terms of the degree of parallelism due to the fluctuating multiplicative level as an FHE application is executed. This also complicates a fair distribution of jobs among PEs.
Instead, we use CLP in BTS. As is fixed throughout the running of an HE application, we decide on a fixed data distribution methodology, where the residues of a polynomial with the same coefficient index are allocated to the same PE. Then, coefficientwise and elementwise functions are parallelized without interPE data exchanges; only (i)NTT and the automorphism incur interPE data exchanges, with the communication pattern predetermined by the fixed data distribution.
We place 2,048 PEs (Eq. 10) in BTS. Each PE has an NTTU, a BConvU, a modular adder (ModAdd) and a multiplier (ModMult) for elementwise functions, as well as an SRAM scratchpad. residues of a residue polynomial are evenly distributed to the PEs, such that one PE handles residues. Then six out of 17 (i)NTT stages can be solely computed inside a PE. We adopt 3DNTT to minimize the data exchanges between the PEs. A residue polynomial is regarded as a 3D data structure of size . Then, each PE performs a sequence of , , and point (i)NTTs, interleaved with just two rounds of interPE data exchange. Splitting (i)NTT in a more finegrained manner requires more data exchange rounds and is thus less energyefficient. The automorphism function exhibits a different communication pattern from (i)NTT, involving complex data remapping (Eq. 5). Nevertheless, the data distribution methodology and NoC structure of BTS efficiently handle data exchanges for both (i)NTT and the automorphism (see Section 5).
5. BTS Microarchitecture
We devise a massively parallel architecture that distributes PEs in a grid. A PE consists of functional units (FUs) and an SRAM scratchpad. An NTTU in each PE handles a portion of the residues in a residue polynomial during (i)NTT. By exploiting CLP, the coefficientwise or elementwise functions can be computed in a PE without any interPE data exchange.
Fig. 5 presents a highlevel overview of BTS. We arrange 2,048 () PEs in a grid with a vertical height of 32 () and a horizontal width of 64 (). The PEs are interconnected via dimensionwise crossbars in the form of 3232 vertical crossbars (xbar_{v}) and 6464 horizontal crossbars (xbar_{h}). We populate a central, constant memory, storing precomputed values including twiddle factors for (i)NTT and
for BConv. A broadcast unit (BrU) delivers the precomputed values to the PEs at the required moments. Memory controllers are located at the top and bottom sides, each connected to an HBM stack. BTS receives instructions and necessary data from the host via the PCIe interface. The word size in BTS is 64 bits. Modular reduction units use Barrett reduction
(Barrett, 1986) to bring the 128bit multiplied results back to the word size.5.1. Datapath for (i)NTT
BTS maps the coefficients of a polynomial to the PEs suited to 3DNTT. We view the residues in a residue polynomial as a cube. Then in the RNS domain, a residue at the coefficient index (the coefficient of ) is at position in this cube, where . We allocate residues at position of such a cube to the PE of coordinate in the PE grid. 3DNTT is broken down into five steps in BTS. First, we conduct i) NTT_{z} inside a single PE, which corresponds to the NTT along the zaxis of the cube. Next, ii) data exchanges between vertically aligned PEs are executed, corresponding to of yzplane parallel transposition of residues in the cube. iii) NTT_{y} along the zaxis follows. iv) Data exchanges between horizontally aligned PEs are executed, corresponding to of xzplane parallel transposition of residues in the cube. Finally, v) NTT_{x} along the zaxis is carried out. iNTT is performed by the reverse process of NTT.
An NTTU supports both NTT and iNTT by using logic circuits similar to (Xin et al., 2021, 2020; Xing and Li, 2021; Zhang et al., 2021). We employ separate register files (RF_{NTT}s) to reuse data between (i)NTT stages. An NTTU decomposes NTT_{x}, NTT_{y}, and NTT_{z} into radix2 NTTs. It is fully pipelined and performs one butterfly op per clock. An input pair is fed in, and an output pair is stored from the NTTU each cycle, provided by two pairs of RF_{NTT}s.
We hide the time for vertical and horizontal data exchanges of 3DNTT (steps ii) and iv)) through coarsegrained, epoch
based pipelining. As steps i), iii), and v) are executed with the same NTTU, we determine the length of an epoch according to the time required to perform these three steps (
cycles). Within the th epoch, we timemultiplex i) of th, iii) of th, and v) of the th residue polynomials, while exchanging ii) of th and iv) of the th residue polynomials concurrently. Concurrent data exchanges are enabled by separate vertical (ii)) and horizontal (iv)) NoCs. Thus, (i)NTT of a single residue polynomial finishes every epoch.A single (i)NTT on a residue polynomial requires different twiddle factors. Because each prime modulus needs different twiddle factors, the sizes of the twiddle factors for (i)NTT on a ciphertext reach dozens of MBs for our target CKKS instances. We reduce the storage for the twiddle factors by decomposing them by means of onthefly twiddling (OT) (Kim et al., 2020a). OT replaces the sized precomputed twiddlefactor table with two tables: a higherdigit table of where , and a lowerdigit table of where . We can compose any twiddle factor by multiplying two twiddle factors and that satisfy . OT reduces the memory usage by . BTS stores the lowerdigit tables of prime moduli in PEs (each PE having different entries) while storing the higherdigit tables in the BrU (all PEs sharing the entries). The BrU broadcasts a higherdigit table for a prime modulus to PEs for every (i)NTT epoch.
5.2. Base Conversion Unit (BConvU)
BConv consists of two parts. The first part multiplies residue polynomials with and the second part does this with and accumulates them. It is the second part that exhibits the coefficientwise access pattern because it accumulates residues at the same coefficient index in all residue polynomials.
A BConv unit (BConvU) with a modular multiplier (ModMult) for the first part and a modular multiplyaccumulate unit (MMAU) for the second part is placed in each PE. BConv strongly depends on the preceding iNTT (see Fig. 3). Because iNTT is a residuepolynomialwise function, whereas the second part of BConv is a coefficientwise function, the MMAU must wait until iNTT is finished on all residue polynomials. We mitigate this by partially overlapping iNTT and BConv. We modify the righthand side of Eq. 9 as follows:
(11) 
This modification enables the second part to start when the preceding iNTT and the first part of BConv are finished on _{sub}( in BTS) residue polynomials and stored in RF_{MMAU}. The MMAU computes the corresponding partial sum (the inner sum of Eq. 11), and accumulates this result with the previous results (the outer sum), which are loaded from and stored on to a scratchpad inducing a read and write every cycle. Temporal registers and FIFO minimize the bandwidth pressure on RF_{MMAU} and transpose the data for the correct orientation to feed _{sub} lanes into the MMAU. The precomputed values of and (BConv tables) are respectively loaded into the dedicated and from the BrU when needed.
We also leverage the MMAU for other operations. Subtraction, scaling, and / addition at the end of keyswitching (Fig. 3) can be expressed as ; thus, we fuse these three operations to be computed on the MMAU. We refer to this fusion as subtractionscalingaddition (SSA).
5.3. Scratchpad
The perPE scratchpad has three purposes. First, it stores the temporary data generated during the course of the HE ops. The size of the temporal data during keyswitching can be large (e.g., a single (i)NTT or BConv can produce 28MB at , ). If such data does not reside onchip, the additional offchip access would cause severe performance degradation.
Second, the scratchpad also stores the prefetched . To hide the latency of the load time, it must be prefetched beforehand. As is not consumed immediately after being loaded onchip, it takes up a portion of the scratchpad.
Third, the scratchpad functions as a cache for s, controlled explicitly by software (SW caching). s often show high temporal locality during a sequence of HE ops. For instance, during bootstrapping, a is commonly subjected to multiple HRots. Moreover, as HE ops form a deterministic computational flow and the granularity of cache management is as large as a , SW control is manageable.
The scratchpad bandwidth demand of the BConvU is high (as later detailed in Fig. 8) due to the accesses involved when updating the partial sums. Considering that the partial sum size is only proportional to in Eq. 11 and is loaded times, the bandwidth pressure can be relieved by increasing . However, this would also require an increase in the number of lanes in the MMAU (and hence the size of ), resulting in a tradeoff.
5.4. NetworkonChip (NoC) design
BTS has three types of onchip communication: 1) offchip memory traffic to the PEs (PEMem NoC), 2) the distribution of precomputed constants to PEs (BrU NoC), and 3) interPE data exchanges for (i)NTT and the automorphism (PEPE NoC). BTS has a large number of nodes (over 2k endpoints) and requires a high bandwidth. Given the unique communication characteristics of each type of onchip communication, BTS provides three separate NoCs instead of sharing a single NoC to enable deterministic communication while minimizing the NoC overhead.
PEMem NoC: Because data is distributed evenly across the PEs, the offchip memory (i.e., HBM2e (JEDEC, 2021)) is placed on the top and bottom and each HBM only needs to communicate with half of the PEs placed nearby. The PE grid placement is exploited by separating the PEs into 32 regions and connecting each HBM pseudochannel only to a single PE region. An HBM2e stack supports 16 pseudochannels (Micron Technology, Inc., 2020) and thus the upper half of the PEs has 16 regions while the lower half also has 16 regions, with each region consisting of 64 PEs.
BrU NoC: BrU data is globally shared by all PEs and broadcast to all PEs. Given the large number of PEs, the BrU is organized hierarchically with 128 local BrUs. Each local BrU provides higherdigit tables of twiddle factors and BConv tables to 16 PEs. The global BrU is loaded with all precomputed values before an HE application starts and sends data to the local BrUs that serve as temporary storage/repeaters.
PEPE NoC: The PEPE NoC requires support for the highest bandwidth due to the data exchanges necessary between the PEs. The communication pattern is symmetric (i.e., each PE sends and receives the same amount of data), and a single PE is not oversubscribed. In addition, because the traffic pattern is known (e.g., alltoall or a fixed, permutation traffic), the NoC can be greatly simplified. BTS implements a logical 2D flattened butterfly (Kim et al., 2007; Ahn et al., 2009) given that communication to other PEs within each row and within each column is limited. However, instead of having a router at each PE, a single “router” xbar_{h} (respectively, xbar_{v}) is shared by all PEs within each row (column); it is placed in the center of each row (column) and used for horizontal (vertical) data exchange steps of (i)NTT (steps ii), iv)). Each xbar_{h} (xbar_{v}) does not require any allocation because the traffic pattern is known ahead of time and can be scheduled through predetermined arbitration.
5.5. Automorphism
We identify that BTS can handle the automorphism for HRots efficiently. All residues mapped to a single PE always move to another single destination PE under the BTS’ PEcoefficient mapping scheme; i.e., the interPE communication of the automorphism exhibits a permutation pattern. A PE of the PEgrid coordinate holds the residues at positions , corresponding to coefficient indices (Section 5.1). s in binary format only differ in the higher bitfield (), meaning that the automorphism destination indices (’s in Eq. 5) also only differ in the higher bitfield; the residues are mapped to the same destination PE corresponding to the lower bitfield ().
We can decompose such a permutation pattern into three steps to fit the PEPE NoC structure of BTS: intraPE permutation (zaxis), vertical permutation (yaxis), and horizontal permutation (xaxis). Each step gradually updates the s to s from higher to lower bitfields. The intraPE permutation process does not use the NoC. The vertical/horizontal permutations can be handled by xbar_{v}/xbar_{h}. The PEPE NoC can support an arbitrary HRot with any rotation amount () without data contention, whose property is similar to that of 3DNTT.
6. Evaluation
6.1. Hardware modeling of BTS
Area  Power  Freq  
Component  (m^{2})  (mW)  (GHz)  
Scratchpad SRAM  114,724  9.86  1.2  
RFs  12,479  2.29  Various  
NTTU  9,501  12.17  1.2  
ModMult (BConvU)  4,070  0.56  0.3  
MMAU (BConvU)  9,511  8.42  1.2  
Exchange unit  421  1.03  1.2  
ModMult  3,833  1.35  0.6  
ModAdd  325  0.08  0.6  
1 PE  154,863  35.75    
Area  Power  Freq  
Component  (mm^{2})  (W)  (GHz)  
2048 PEs  317.2  73.21    
InterPE NoC  3.06  45.93  1.2  
Global BrU + NoC  0.42  0.10  0.6  
128 local BrUs  3.69  0.04  0.6  
HBM2e NoC  0.10  6.81  1.2  
2 HBM2e stacks  29.6  (Jouppi et al., 2021)  31.76  (O’Connor et al., 2017)   
PCIe5x16 interface  19.6  (Jouppi et al., 2021)  5.37  (Bichan et al., 2020)   
Total  373.6  163.2   
We used the ASAP7 (Clark et al., 2016, 2017) design library to synthesize the logic units and datapath components in a 7nm technology node. We simulated the RFs and scratchpads using FinCACTI (Shafaei et al., 2014) due to the absence of a public 7nm memory compiler. We updated the analytic models and technology constants of FinCACTI to match ASAP7 and the IRDS roadmap (IEEE, 2018). We validated the RTL synthesis and SRAM simulation results against published information (Chang et al., 2017; Song et al., 2018; Auth et al., 2017; Wu et al., 2016; Narasimha et al., 2017; Jouppi et al., 2021; Jeong et al., 2018).
BTS uses singleported 128bit wide 1.2GHz SRAMs for the scratchpads, providing a total capacity of 512MB and a bandwidth of 38.4TB/s chipwide. RFs are implemented in singleported SRAMs with variable sizes, port widths, and operating frequencies following the requirements of the FUs. 22MBs of RFs are used chipwide, providing 292TB/s. Crossbars in the PEPE NoC have 12bit wide ports and run at 1.2GHz, providing a bisection bandwidth of 3.6TB/s. The NoC wires are routed over other components (Passas et al., 2012). We analyzed the cost of wires and crossbars using FinCACTI and prior works (Banerjee and Mehrotra, 2002; IEEE, 2018; Moon et al., 2008; Passas et al., 2012). Two HBM2e stacks are used (JEDEC, 2021), but with a modest 11% speedup assumed, considering the latest technology (JEDEC, 2022). The peak power and area estimation results are shown in Table 3. BTS is 373.6mm^{2} in size and consumes up to 163.2W of power.
6.2. Experimental setup
We developed a cyclelevel simulator to model the compute capability, latency, and bandwidth of the FUs and the memory components composing BTS. When an HE op is called, the simulator converts the op into a computational graph with primary HE functions. Based on the derived computation and data dependencies, the simulator schedules functions and data loads in epoch granularity while minimizing the temporary data hold time. Utilization rates are also collected and combined with the power model to calculate the energy. The scratchpad space is prioritized in the order of the temporary data, prefetched , and finally, caching with an LRU policy.
We measured T_{mult,a/slot} as a microbenchmark and evaluated the most complex applications currently available on CKKS: logistic regression (HELR (Han et al., 2019)), CNN inference (ResNet20 (Lee et al., 2022)), and sorting (Hong et al., 2021)
. HELR trains a binary classification model with MNIST
(Deng, 2012) for 30 iterations, each with a batch containing 1,024 1414pixel images. ResNet20 performs homomorphic convolution, linear transform, and ReLU. It achieves 92.43% accuracy on CIFAR10 classification
(Krizhevsky and Hinton, 2009). We used the channel packing method proposed in (Juvekar et al., 2018) to pack all of the feature map channels into a single to improve the performance further. Sorting uses a 2way sorting network to sort data. Because nonlinear functions such as ReLU and comparisons are approximated by highdegree polynomial functions in CKKS, they consume many levels and induce hundreds of bootstrapping for ResNet20 and sorting, respectively.CKKS instance  Temp data  

INS1  27  1  3090  133.4  183MB  
INS2  39  2  3210  128.7  304MB  
INS3  44  3  3160  130.8  365MB 
We compared BTS with the stateoftheart implementations on a CPU (Lattigo (EPFLLDS, 2021)), a GPU (100x (Jung et al., 2021a)), and an ASIC (F1 (Samardzic et al., 2021)) for T_{mult,a/slot} and HELR. We ran Lattigo on a system with an Intel Skylake CPU (Xeon Platinum 8160) and 256GB of DDR42666 memory. We used the 128bsecure CKKS instance preset of Lattigo and newly implemented HELR on Lattigo. For 100x and F1, the execution times reported in each paper were used. 100x (Jung et al., 2021a) used NVIDIA V100 (NVIDIA Corporation, 2017) for the evaluation. We also compared BTS with F1+, whose execution times are optimistically scaled from F1 to have the same area as BTS at 7nm (Narasimha et al., 2017). For other applications, we compared BTS with reported multithreaded CPU performance from each paper due to the absence of available implementations. We used the CKKS instances shown in Table 4 to evaluate BTS. They all have the same degree and satisfy 128b security but use different values of and . As and increase, the temporary data increases, requiring more scratchpad space.
6.3. Performance and efficiency of BTS
Amortized mult time per slot: BTS outperforms the stateoftheart CPU/GPU/ASIC implementations by tens to thousands of times in terms of the throughput of HMult. Fig. 6 shows the T_{mult,a/slot} values of Lattigo, 100x, F1, F1+ and BTS. The best T_{mult,a/slot} is achieved with INS2 at 45.5ns, 2,237 better than Lattigo. F1 is even 2.5 slower than Lattigo; this occurs because F1 only supports singleslot bootstrapping.^{2}^{2}2We call a sparselypacked if its corresponding message occupies far fewer slots compared to the maximum number of available (). Bootstrapping a sparselypacked reduces the computational complexity and consumes fewer levels (Chen et al., 2019). In an extreme case using a singleslot, such an effect is maximized. F1 only supports singleslot bootstrapping due to the lack of multiplicative levels, as it targets support of small parameter sets. F1+ is better but shows 824 lower performance than BTS. T_{mult,a/slot} of 100x is 743ns, reporting the best performance among prior works. However, this is for a 97bsecure parameter set; when using a 173bsecure CKKS instance, 100x reported a 8s T_{mult,a/slot}.
The performance of INSx is higher than the minimum bound performance shown in Fig. 2 because s are not always on the scratchpad with limited capacity. Fig. 7(a) shows the minimum and actual T_{mult,a/slot} using 512MB and 2GB of scratchpad for INSx. INS2 always performs the best. INS1 performs better than INS3 with a 512MB scratchpad because the former requires less temporary data, leading to a higher hit rate for s. With an enough (albeit not practical) scratchpad capacity of 2GB, s mostly hit, reaching a performance close to the minimum.
Logistic regression: Table 5 reports the average training time per iteration in HELR. Due to the limited parameter set F1 supports, F1 only reported the HELR training time for a single iteration with 256 images, which does not require bootstrapping but is not enough for training. We estimated F1’s endtoend HELR performance by assuming that 1024 images in a batch are trained over four iterations, with singleslot bootstrapping applied, ignoring the cost of packing/unpacking s for bootstrapping (giving favor to F1). The execution time with INS2 achieves 28.4ms, 1,306, 27 and 5.2 better than Lattigo, 100x and F1+, respectively.
Lattigo  100x  F1  F1+  INS1  INS2  INS3  

Time (ms)  37,050  775  1,024  148  39.9  28.4  43.5 
Speedup  1  48  36  250  929  1,306  852 
CPU  INS1  INS2  INS3  

ResNet20 execution time (s)  10,602  1.91  2.02  3.09 
Speedup (vs. (Lee et al., 2022))  1  5,556  5,240  3,427 
# of bootstrapping    53  22  19 
Sorting execution time (s)  23,066  15.6  18.8  25.2 
Speedup (vs. (Hong et al., 2021))  1  1,482  1,226  915 
# of bootstrapping    521  306  229 
ResNet20 and sorting: BTS performs up to 5,556 and 1,482 faster over the prior works, (Lee et al., 2022) and (Hong et al., 2021) (see Table 6). For ResNet20, INS1 without channel packing shows a 311 speedup. By adopting the channelpacking method (Juvekar et al., 2018) exploiting the abundant slots of our target CKKS instances, we reduced the working set and improved the throughput, resulting in an additional 17.8 performance gain and achieving 1.91s of ResNet20 inference latency on an encrypted image.
Although BTS provides a speedup of more than three orders of magnitude for the most complex applications, these applications still do not fully utilize all
slots due to the small problem size. We anticipate the relative speedup of BTS to improve even further when realworld applications are implemented with FHE. For instance, an ImageNet
(Deng et al., 2009) image has over data, which requires multiple fullypacked s to encrypt.Parameter selection in retrospect: In Section 3, we estimated the T_{mult,a/slot} of CKKS instances assuming an alwayshit scratchpad and used it as a proxy for the performance of FHE applications with frequent bootstrapping. While the T_{mult,a/slot} result from the simulator does not directly match the estimation, the 2GB scratchpad case (Fig 7(a)) does concur. This is because the temporal data of INS3 constitutes the largest set (Table 4) and the corresponding hit rate is affected by the scratchpad capacity.
However, T_{mult,a/slot} does not always translate to the application performance for the following reasons. First, when the portion of bootstrapping is relatively small as in ResNet20 (Fig 7(b)), the complexity of HE ops becomes more influential, and a smaller value is better (INS1 in Table 6). Second, the better T_{mult,a/slot} caused by deeper levels from higher s does not translate to better performance when there exists a level imbalance between s. Such an imbalance nullifies the benefit of more available levels (see Table 6 with INS1 and INS2).
PE resource utilization over time: Resources populated in PEs are highly utilized while processing HE ops. Fig. 8 presents a detailed timeline of HMult on INS1 when s are on the scratchpad. HBM achieves 98% of its peak bandwidth. NTTUs are busy processing (i)NTT of three intermediate polynomials (d2, ax, and bx) 76% of the time. BConv is partially pipelined with iNTT and has strong dependency on the subsequent NTT; thus, it occupies BConvU for 33% of the time. The scratchpad bandwidth requirement of BConv is high because it must load the partial sum for all s in Eq. 11 within epochs. BConvU runs SSA while not occupied by BConv.
The bandwidth and capacity utilization of the scratchpad fluctuate over time while being properly provisioned to meet the requirements. The average bandwidth usage was 58.6% over time, peaking at 90% when processing a BConv. The required capacity was also highest at BConv.ax at 183MB.
Ablation study: To evaluate the impact of various attributes of BTS on its performance, first we evaluated a small baseline BTS ( 230mm^{2}) with just enough scratchpad to hold the temporary data that use Lattigo’s CKKS instance () and without overlapping between BConv and iNTT. The results are 379 faster T_{mult,a/slot} compared to Lattigo. We incrementally changed the CKKS instance to INS1 and then increased the scratchpad size to 512MB. These changes resulted in 1.50 and 3.18 speedups, respectively (see Fig. 9). Finally, additionally overlapping BConv and iNTT results in a 1.13 speedup, reaching a total of 2044 speedup compared to Lattigo.
We also evaluated BTS with an HBM bandwidth of 2TB/s. We reduced the scratchpad size to make room for the added HBM2e PHYs so that BTS retains the same total area. The result only shows a 1.26 speedup as a larger fraction of time is bound to computations, despite the fact that load time is halved.
Slowdown of FHE: FHE applications on BTS are still slower than their unencrypted counterparts. HELR is 141 slower and ResNet20 inference is 440 slower compared to when they are run on a CPU system without FHE. Evaluation of nonpolynomial functions such as ReLU, which are costly to evaluate on FHE (Lee et al., 2021b) results in a greater slowdown for ResNet20. Thus, it is crucial to optimize applications to make them more FHEfriendly.
Impact of the scratchpad size on the performance and EDAP: The performance and energy efficiency of BTS improves as we deploy a larger scratchpad, however becoming saturated as the scratchpad holds most of the HE ops’ working sets. Fig. 10 shows the execution time breakdown and energydelayarea product (EDAP (Thoziyoor et al., 2008)) for the bootstrapping of INS1 with various scratchpad sizes. We increased the scratchpad size from 192MB (close to the temporary data for HMult) by 64MB, up to 1GB.
With a 192MB scratchpad, BTS frequently load s from offchip memory due to capacity misses. At this point, HMult/HRot, which used to be dominant (77% of the bootstrapping time for Lattigo) due to its high computational complexity, now only requires 24% of the execution time. The rest attributes to PMult, HAdd, HRescale, and CMult/CAdd. While BTS greatly reduces the computation time of HMult/HRot with its abundant PEs, the load time, which any HE ops require when SW cache misses occur, is now dominant.
As the scratchpad size increases, the portion of HMult/HRot on bootstrapping increases. This occurs because the SW cache hit rate of s for every HE op gradually increases; 65.6%, 98.8%, 93.7%, 98.6%, 97.5%, and 47.8%, for HMult, HRot, PMult, HAdd, HRescale, and CMult/CAdd, respectively, with a 512MB scratchpad. The execution time of HMult/HRot has a lowerbound of the load time, even during SW cache hits. However, the other HE ops not requiring can take significantly less time due to the ratio of the onchip over the offchip bandwidth (), when the necessary s are located on the scratchpad.
7. Related Work
CPU acceleration: (CryptoLab Inc., 2018) parallelized HE ops by multithreading. (Jung et al., 2021b; Boemer et al., 2021) leveraged shortSIMD support. (EPFLLDS, 2021) exploited the algorithmic analysis from (Bossuat et al., 2021) for efficient bootstrapping implementation. Yet other platforms outperform CPU implementations.
GPU acceleration: GPUs are a good fit for accelerating HE ops as they are equipped with a massive number of integer units and abundant memory bandwidth. However, a majority of prior works did not handle bootstrapping (Al Badawi et al., 2019, 2018, 2020; Jung et al., 2021b). (Jung et al., 2021a) was the first work that supported CKKS bootstrapping on GPU. By fusing GPU kernels, (Jung et al., 2021a) reduced offchip accesses and achieved 242 faster bootstrapping over a CPU. However, the lack of onchip storage forces some kernels to remain unfused (Kim et al., 2020a). BTS holds all temporary data onchip, minimizing offchip accesses.
FPGA/ASIC acceleration: A different set of works accelerate HE using FPGA or ASIC, but most of them did not consider bootstrapping (Riazi et al., 2020; Roy et al., 2019; Kim et al., 2020b, 2019; Reagen et al., 2021). HEAX (Riazi et al., 2020) dedicated hardware for CKKS mult on FPGA, reaching a 200 performance gain over a CPU implementation. However, its design is fixed to a limited set of parameters and does not consider bootstrapping. Cheetah (Reagen et al., 2021) introduced algorithmic optimization for an HEbased DNN and proposed an accelerator design suitable for this. Instead of bootstrapping, Cheetah uses multiparty computation (MPC) to mitigate errors during the HE operation. Cheetah sends a ciphertext with error back to the clients and the clients recrypt it as a fresh ciphertext. In MPC, the network latency from the frequent communication with the client limits the performance (van der Hagen and Lucia, 2021), thus introducing a different challenge compared to FHE. The accelerator design of Cheetah targets a small ciphertext for MPC, which is not suitable for FHE (Samardzic et al., 2021). F1 (Samardzic et al., 2021) is the first ASIC design that partially supports bootstrapping. It is a programmable accelerator supporting multiple FHE schemes, including CKKS and BGV. F1 achieves impressive performance on various LHE applications as it provides tailored highthroughput computation units and stores s onchip, minimizing the number of offchip accesses. However, F1 targets the parameter sets with low degree , thus supporting only nonpacked (singleslot) bootstrapping, the throughput of which is greatly exacerbated compared to BTS. F1 is 151.4mm^{2} in size at a 12/14nm technology node and shows a TDP of 180.4W excluding the HBM power.
8. Conclusion
We have proposed an accelerator architecture for fully homomorphic encryption (FHE), primarily optimized for the throughput of bootstrapping encrypted data. By analyzing the impact of selecting key parameter values on the bootstrapping performance of CKKS, an emerging HE scheme, we devised the design principles of bootstrappable HE accelerators and suggested BTS, which distributes massivelyparallel processing elements connected through a networkonchip design tailored to the unique traffic patterns of number theoretic transform and automorphism, the critical functions of HE operations. We designed BTS to balance offchip memory accesses, onchip data reusability, and the computations required for bootstrapping. With BTS, we obtained a speedup of 2,237 in HE multiplication throughput and 5,556 in CNN inference compared to the stateoftheart CPU implementations.
Acknowledgements.
This work was supported in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020000840, 40%) and National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2020R1A2C2010601, 60%). The EDA tool was supported by the IC Design Education Center (IDEC), Korea. Sangpyo Kim is with the Department of Intelligence and Information, Seoul National University. Jung Ho Ahn, the corresponding author, is with the Department of Intelligence and Information, the Institute of Computer Technology, and the Research Institute for Convergence Science, Seoul National University, Seoul, South Korea.References
 (1)
 Ahn et al. (2009) Jung Ho Ahn, Nathan L. Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber. 2009. HyperX: Topology, Routing, and Packaging of Efficient LargeScale Networks. In SC. https://doi.org/10.1145/1654059.1654101
 Al Badawi et al. (2020) Ahmad Al Badawi, Louie Hoang, Chan Fook Mun, Kim Laine, and Khin Mi Mi Aung. 2020. Privft: Private and Fast Text Classification with Homomorphic Encryption. IEEE Access 8 (2020), 226544–226556. https://doi.org/10.1109/ACCESS.2020.3045465
 Al Badawi et al. (2019) Ahmad Al Badawi, Yuriy Polyakov, Khin Mi Mi Aung, Bharadwaj Veeravalli, and Kurt Rohloff. 2019. Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme. IEEE Transactions on Emerging Topics in Computing 9, 2 (2019), 941–956. https://doi.org/10.1109/TETC.2019.2902799
 Al Badawi et al. (2018) Ahmad Al Badawi, Bharadwaj Veeravalli, Chan Fook Mun, and Khin Mi Mi Aung. 2018. HighPerformance FV Somewhat Homomorphic Encryption on GPUs: An Implementation Using CUDA. IACR Transactions on Cryptographic Hardware and Embedded Systems 2018, 2 (2018), 143–163. https://doi.org/10.13154/tches.v2018.i2.7095
 Albrecht et al. (2019) Martin R. Albrecht, Melissa Chase, Hao Chen, Jintai Ding, Shafi Goldwasser, Sergey Gorbunov, Shai Halevi, Jeffrey Hoffstein, Kim Laine, Kristin E. Lauter, Satya Lokam, Daniele Micciancio, Dustin Moody, Travis Morrison, Amit Sahai, and Vinod Vaikuntanathan. 2019. Homomorphic Encryption Standard. IACR Cryptology ePrint Archive 939 (2019).
 Auth et al. (2017) Chris Auth, A. Aliyarukunju, M. Asoro, D. Bergstrom, V. Bhagwat, J. Birdsall, N. Bisnik, M. Buehler, V. Chikarmane, G. Ding, Q. Fu, H. Gomez, W. Han, D. Hanken, M. Haran, M. Hattendorf, R. Heussner, H. Hiramatsu, B. Ho, S. Jaloviar, I. Jin, S. Joshi, S. Kirby, S. Kosaraju, H. Kothari, G. Leatherman, K. Lee, J. Leib, A. Madahavan, K. Marla, H. Meyer, T. Mule, C. Parker, S. Parthasarathy, C. Pelto, L. Pipes, I. Post, M. Prince, A. Rahman, S. Rajamani, A. Saha, J. Dacuna Santos, M. Sharma, V. Sharma, J. Shin, P. Sinha, P. Smith, M. Sprinkle, A. St. Amour, C. Staus, R. Suri, D. Towner, A. Tripathi, A. Tura, C. Ward, and A. Yeoh. 2017. A 10nm High Performance and LowPower CMOS Technology Featuring 3rd Generation FinFET Transistors, SelfAligned Quad Patterning, Contact over Active Gate and Cobalt Local Interconnects. In IEEE International Electron Devices Meeting. https://doi.org/10.1109/IEDM.2017.8268472
 Bajard et al. (2016) JeanClaude Bajard, Julien Eynard, M. Anwar Hasan, and Vincent Zucca. 2016. A Full RNS Variant of FV Like Somewhat Homomorphic Encryption Schemes. In Selected Areas in Cryptography. https://doi.org/10.1007/9783319694535_23
 Banerjee and Mehrotra (2002) Kaustav Banerjee and Amit Mehrotra. 2002. A PowerOptimal Repeater Insertion Methodology for Global Interconnects in Nanometer Designs. IEEE Transactions on Electron Devices 49, 11 (2002), 2001–2007. https://doi.org/10.1109/TED.2002.804706
 Barrett (1986) Paul Barrett. 1986. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Annual International Conference on the Theory and Application of Cryptographic Techniques. https://doi.org/10.5555/36664.36688
 Bichan et al. (2020) Mike Bichan, Clifford Ting, Bahram Zand, Jing Wang, Ruslana Shulyzki, James Guthrie, Katya Tyshchenko, Junhong Zhao, Alireza Parsafar, Eric Liu, Aynaz Vatankhahghadim, Shaham Sharifian, Aleksey Tyshchenko, Michael De Vita, Syed Rubab, Sitaraman Iyer, Fulvio Spagna, and Noam Dolev. 2020. A 32Gb/s NRZ 37dB SerDes in 10nm CMOS to Support PCI Express Gen 5 Protocol. In IEEE Custom Integrated Circuits Conference. https://doi.org/10.1109/CICC48029.2020.9075947
 Boemer et al. (2021) Fabian Boemer, Sejun Kim, Gelila Seifu, Fillipe D. M. de Souza, and Vinodh Gopal. 2021. Intel HEXL: Accelerating Homomorphic Encryption with Intel AVX512IFMA52. In Workshop on Encrypted Computing & Applied Homomorphic Cryptography. https://doi.org/10.1145/3474366.3486926
 Bossuat et al. (2021) JeanPhilippe Bossuat, Christian Mouchet, Juan Ramón TroncosoPastoriza, and JeanPierre Hubaux. 2021. Efficient Bootstrapping for Approximate Homomorphic Encryption with Nonsparse Keys. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. https://doi.org/10.1007/9783030778705_21
 Brakerski et al. (2014) Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) Fully Homomorphic Encryption without Bootstrapping. ACM Transactions on Computing Theory 6, 3 (2014). https://doi.org/10.1145/2633600
 Brakerski and Vaikuntanathan (2014) Zvika Brakerski and Vinod Vaikuntanathan. 2014. Efficient Fully Homomorphic Encryption from (Standard) LWE. SIAM J. Comput. 43, 2 (2014), 831–871. https://doi.org/10.1137/120868669
 Brutzkus et al. (2019) Alon Brutzkus, Ran GiladBachrach, and Oren Elisha. 2019. Low Latency Privacy Preserving Inference. In International Conference on Machine Learning, Vol. 97. 812–821.
 Chang et al. (2017) Jonathan Chang, YenHuei Chen, WeiMin Chan, Sahil Preet Singh, Hank Cheng, Hidehiro Fujiwara, JihYu Lin, KaoCheng Lin, John Hung, Robin Lee, HungJen Liao, JhonJhy Liaw, Quincy Li, ChihYung Lin, MuChi Chiang, and ShienYang Wu. 2017. 12.1 A 7nm 256Mb SRAM in HighK MetalGate FinFET Technology with WriteAssist Circuitry for LowVMIN Applications. In IEEE International SolidState Circuits Conference. https://doi.org/10.1109/ISSCC.2017.7870333
 Chen et al. (2019) Hao Chen, Ilaria Chillotti, and Yongsoo Song. 2019. Improved Bootstrapping for Approximate Homomorphic Encryption. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. https://doi.org/10.1007/9783030176563_2

Chen et al. (2016)
YuHsin Chen, Joel
Emer, and Vivienne Sze.
2016.
Eyeriss: A Spatial Architecture for EnergyEfficient Dataflow for Convolutional Neural Networks. In
ISCA. https://doi.org/10.1109/ISCA.2016.40  Cheon et al. (2018a) Jung Hee Cheon, Kyoohyung Han, Andrey Kim, Miran Kim, and Yongsoo Song. 2018a. A Full RNS Variant of Approximate Homomorphic Encryption. In Selected Areas in Cryptography. https://doi.org/10.1007/9783030109707_16
 Cheon et al. (2018b) Jung Hee Cheon, Kyoohyung Han, Andrey Kim, Miran Kim, and Yongsoo Song. 2018b. Bootstrapping for Approximate Homomorphic Encryption. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. https://doi.org/10.1007/9783319783819_14
 Cheon et al. (2019) Jung Hee Cheon, Minki Hhan, Seungwan Hong, and Yongha Son. 2019. A Hybrid of Dual and MeetintheMiddle Attack on Sparse and Rernary Secret LWE. IEEE Access 7 (2019), 89497–89506. https://doi.org/10.1109/ACCESS.2019.2925425
 Cheon et al. (2017) Jung Hee Cheon, Andrey Kim, Miran Kim, and Yong Soo Song. 2017. Homomorphic Encryption for Arithmetic of Approximate Numbers. In International Conference on the Theory and Applications of Cryptology and Information Security. https://doi.org/10.1007/9783319706948_15
 Cheon et al. (2022) Jung Hee Cheon, Yongha Son, and Donggeon Yhee. 2022. Practical FHE Parameters against Lattice Attacks. Journal of the Korean Mathematical Society 59, 1 (2022), 35–51. https://doi.org/10.4134/JKMS.j200650
 Chillotti et al. (2020) Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2020. TFHE: Fast Fully Homomorphic Encryption Over the Torus. Journal of Cryptology 33, 1 (2020), 34–91. https://doi.org/10.1007/s0014501909319x
 Choquette et al. (2021) Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29–35. https://doi.org/10.1109/MM.2021.3061394
 Clark et al. (2017) Lawrence T Clark, Vinay Vashishtha, David M Harris, Samuel Dietrich, and Zunyan Wang. 2017. Design Flows and Collateral for the ASAP7 7nm FinFET Predictive Process Design Kit. In IEEE International Conference on Microelectronic Systems Education. https://doi.org/10.1109/MSE.2017.7945071
 Clark et al. (2016) Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. 2016. ASAP7: A 7nm FinFET Predictive Process Design Kit. Microelectronics Journal 53 (2016), 105–115. https://doi.org/10.1016/j.mejo.2016.04.006
 Cooley and Tukey (1965) James W. Cooley and John W. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297–301. https://doi.org/10.1090/s00255718196501785861
 CryptoLab Inc. (2018) CryptoLab Inc. 2018. HEAAN v2.1. https://github.com/snucrypto/HEAAN
 Curtis and Player (2019) Benjamin R. Curtis and Rachel Player. 2019. On the Feasibility and Impact of Standardising Sparsesecret LWE Parameter Sets for Homomorphic Encryption. In ACM Workshop on Encrypted Computing & Applied Homomorphic Cryptography. https://doi.org/10.1145/3338469.3358940
 Damgård et al. (2012) Ivan Damgård, Valerio Pastro, Nigel P. Smart, and Sarah Zakarias. 2012. Multiparty Computation from Somewhat Homomorphic Encryption. In Annual International Cryptology Conference. https://doi.org/10.1007/9783642320095_38
 Dathathri et al. (2020) Roshan Dathathri, Blagovesta Kostova, Olli Saarikivi, Wei Dai, Kim Laine, and Madan Musuvathi. 2020. EVA: An Encrypted Vector Arithmetic Language and Compiler for Efficient Homomorphic Computation. In ACM SIGPLAN International Conference on Programming Language Design and Implementation. https://doi.org/10.1145/3385412.3386023

Deng et al. (2009)
Jia Deng, Wei Dong,
Richard Socher, LiJia Li,
Kai Li, and Li FeiFei.
2009.
ImageNet: A largescale hierarchical image
database. In
IEEE Conference on Computer Vision and Pattern Recognition
. https://doi.org/10.1109/CVPR.2009.5206848  Deng (2012) Li Deng. 2012. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Processing Magazine 29, 6 (2012), 141–142. https://doi.org/10.1109/MSP.2012.2211477
 EPFLLDS (2021) EPFLLDS. 2021. Lattigo v2.3.0. https://github.com/ldsec/lattigo
 Fan and Vercauteren (2012) Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homomorphic Encryption. IACR Cryptology ePrint Archive 144 (2012).

Gentry (2009)
Craig Gentry.
2009.
Fully Homomorphic Encryption Using Ideal
Lattices. In
ACM Symposium on Theory of Computing
. https://doi.org/10.1145/1536414.1536440  Gentry and Halevi (2011) Craig Gentry and Shai Halevi. 2011. Implementing Gentry’s FullyHomomorphic Encryption Scheme. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. https://doi.org/10.1007/9783642204654_9

Han et al. (2019)
Kyoohyung Han, Seungwan
Hong, Jung Hee Cheon, and Daejun
Park. 2019.
Logistic Regression on Homomorphic Encrypted Data
at Scale. In
AAAI Conference on Artificial Intelligence
. https://doi.org/10.1609/aaai.v33i01.33019466  Han and Ki (2020) Kyoohyung Han and Dohyeong Ki. 2020. Better Bootstrapping for Approximate Homomorphic Encryption. In Cryptographers’ Track at the RSA Conference. https://doi.org/10.1007/9783030401863_16
 Ho et al. (2001) Ron Ho, Kenneth Mai, and Mark Horowitz. 2001. The Future of Wires. Proc. IEEE 89, 4 (2001), 490–504. https://doi.org/10.1109/5.920580
 Hong et al. (2021) Seungwan Hong, Seunghong Kim, Jiheon Choi, Younho Lee, and Jung Hee Cheon. 2021. Efficient Sorting of Homomorphic Encrypted Data With kWay Sorting Network. IEEE Transactions on Information Forensics and Security 16 (2021), 4389–4404. https://doi.org/10.1109/TIFS.2021.3106167
 IEEE (2018) IEEE. 2018. International Roadmap for Devices and Systems: 2018. Technical Report. https://irds.ieee.org/editions/2018/
 JEDEC (2021) JEDEC. 2021. High Bandwidth Memory (HBM) DRAM. Technical Report JESD235D.
 JEDEC (2022) JEDEC. 2022. High Bandwidth Memory DRAM (HBM3). Technical Report JESD238.
 Jeong et al. (2018) W.C. Jeong, S. Maeda, H.J. Lee, K.W. Lee, T.J. Lee, D.W. Park, B.S. Kim, J.H. Do, T. Fukai, D.J. Kwon, K.J. Nam, W.J. Rim, M.S. Jang, H.T. Kim, Y.W. Lee, J.S. Park, E.C. Lee, D.W. Ha, C.H. Park, H.J. Cho, S.M. Jung, and H.K. Kang. 2018. True 7nm Platform Technology featuring Smallest FinFET and Smallest SRAM cell by EUV, Special Constructs and 3rd Generation Single Diffusion Break. In IEEE Symposium on VLSI Technology. https://doi.org/10.1109/VLSIT.2018.8510682
 Jouppi et al. (2021) Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter C. Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David A. Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i: Industrial Product. In ISCA. https://doi.org/10.1109/ISCA52012.2021.00010
 Jung et al. (2021a) Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee. 2021a. Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memorycentric Optimization with GPUs. IACR Transactions on Cryptographic Hardware and Embedded Systems 2021, 4 (2021), 114––148. https://doi.org/10.46586/tches.v2021.i4.114148
 Jung et al. (2021b) Wonkyung Jung, Eojin Lee, Sangpyo Kim, Jongmin Kim, Namhoon Kim, Keewoo Lee, Chohong Min, Jung Hee Cheon, and Jung Ho Ahn. 2021b. Accelerating Fully Homomorphic Encryption Through ArchitectureCentric Analysis and Optimization. IEEE Access 9 (2021), 98772–98789. https://doi.org/10.1109/ACCESS.2021.3096189

Juvekar et al. (2018)
Chiraag Juvekar, Vinod
Vaikuntanathan, and Anantha Chandrakasan.
2018.
GAZELLE
: A Low Latency Framework for Secure Neural Network Inference. In
USENIX Security Symposium.  Kim et al. (2007) John Kim, James Balfour, and William Dally. 2007. Flattened Butterfly Topology for OnChip Networks. In MICRO. 172–182. https://doi.org/10.1109/MICRO.2007.29
 Kim et al. (2020a) Sangpyo Kim, Wonkyung Jung, Jaiyoung Park, and Jung Ho Ahn. 2020a. Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs. In IEEE International Symposium on Workload Characterization. https://doi.org/10.1109/IISWC50251.2020.00033
 Kim et al. (2019) Sunwoong Kim, Keewoo Lee, Wonhee Cho, Jung Hee Cheon, and Rob A. Rutenbar. 2019. FPGAbased Accelerators of Fully Pipelined Modular Multipliers for Homomorphic Encryption. In International Conference on ReConFigurable Computing and FPGAs. https://doi.org/10.1109/ReConFig48160.2019.8994793
 Kim et al. (2020b) Sunwoong Kim, Keewoo Lee, Wonhee Cho, Yujin Nam, Jung Hee Cheon, and Rob A. Rutenbar. 2020b. Hardware Architecture of a Number Theoretic Transform for a Bootstrappable RNSbased Homomorphic Encryption Scheme. In IEEE International Symposium on FieldProgrammable Custom Computing Machines. https://doi.org/10.1109/FCCM48280.2020.00017
 Knowles (2021) Simon Knowles. 2021. Graphcore. In IEEE Hot Chips 33 Symposium. https://doi.org/10.1109/HCS52781.2021.9567075
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
 Lee et al. (2021b) Junghyun Lee, Eunsang Lee, JoonWoo Lee, Yongjune Kim, YoungSik Kim, and JongSeon No. 2021b. Precise Approximation of Convolutional Neural Networks for Homomorphically Encrypted Data. arXiv preprint arXiv:2105.10879 (2021).
 Lee et al. (2021a) JoonWoo Lee, Eunsang Lee, Yongwoo Lee, YoungSik Kim, and JongSeon No. 2021a. HighPrecision Bootstrapping of RNSCKKS Homomorphic Encryption Using Optimal Minimax Polynomial Approximation and Inverse Sine Function. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. https://doi.org/10.1007/9783030778705_22
 Lee et al. (2022) JoonWoo Lee, Hyungchul Kang, Yongwoo Lee, Woosuk Choi, Jieun Eom, Maxim Deryabin, Eunsang Lee, Junghyun Lee, Donghoon Yoo, YoungSik Kim, and JongSeon No. 2022. PrivacyPreserving Machine Learning With Fully Homomorphic Encryption for Deep Neural Network. IEEE Access 10 (2022), 30039–30054. https://doi.org/10.1109/ACCESS.2022.3159694

Lee et al. (2020)
Yongwoo Lee, Joonwoo Lee,
YoungSik Kim, HyungChul Kang, and
JongSeon No. 2020.
HighPrecision and LowComplexity Approximate Homomorphic Encryption by Error Variance Minimization.
IACR Cryptology ePrint Archive 1549 (2020).  Medina and Dagan (2020) Eitan Medina and Eran Dagan. 2020. Habana Labs PurposeBuilt AI Inference and Training Processor Architectures: Scaling AI Training Systems Using Standard Ethernet With Gaudi Processor. IEEE Micro 40, 2 (2020), 17–24. https://doi.org/10.1109/MM.2020.2975185
 Micron Technology, Inc. (2020) Micron Technology, Inc. 2020. 8GB/16GB HBM2E with ECC. Technical Report CCM005141278619510301  Rev. D 08/2020 EN. https://mediawww.micron.com//media/client/global/documents/products/datasheet/dram/hbm2e/8gb_and_16gb_hbm2e_dram.pdf?rev=dbfcf653271041a497e5f1bef1a169ca
 Moon et al. (2008) Peter Moon, Vinay Chikarmane, Kevin Fischer, Rohit Grover, Tarek A Ibrahim, Doug Ingerly, Kevin J Lee, Chris Litteken, Tony Mule, and Sarah Williams. 2008. Process and Electrical Results for the Ondie Interconnect Stack for Intel’s 45nm Process Generation. Intel Technology Journal 12, 2 (2008).
 Narasimha et al. (2017) S. Narasimha, B. Jagannathan, A. Ogino, D. Jaeger, B. Greene, C. Sheraw, K. Zhao, B. Haran, U. Kwon, A. K. M. Mahalingam, B. Kannan, B. Morganfeld, J. Dechene, C. Radens, A. Tessier, A. Hassan, H. Narisetty, I. Ahsan, M. Aminpur, C. An, M. Aquilino, A. Arya, R. Augur, N. Baliga, R. Bhelkar, G. Biery, A. Blauberg, N. Borjemscaia, A. Bryant, L. Cao, V. Chauhan, M. Chen, L. Cheng, J. Choo, C. Christiansen, T. Chu, B. Cohen, R. Coleman, D. Conklin, S. Crown, A. da Silva, D. Dechene, G. Derderian, S. Deshpande, G. Dilliway, K. Donegan, M. Eller, Y. Fan, Q. Fang, A. Gassaria, R. Gauthier, S. Ghosh, G. Gifford, T. Gordon, M. Gribelyuk, G. Han, J.H. Han, K. Han, M. Hasan, J. Higman, J. Holt, L. Hu, L. Huang, C. Huang, T. Hung, Y. Jin, J. Johnson, S. Johnson, V. Joshi, M. Joshi, P. Justison, S. Kalaga, T. Kim, W. Kim, R. Krishnan, B. Krishnan, K. Anil, M. Kumar, J. Lee, R. Lee, J. Lemon, S.L. Liew, P. Lindo, M. Lingalugari, M. Lipinski, P. Liu, J. Liu, S. Lucarini, W. Ma, E. Maciejewski, S. Madisetti, A. Malinowski, J. Mehta, C. Meng, S. Mitra, C. Montgomery, H. Nayfeh, T. Nigam, G. Northrop, K. Onishi, C. Ordonio, M. Ozbek, R. Pal, S. Parihar, O. Patterson, E. Ramanathan, I. Ramirez, R. Ranjan, J. Sarad, V. Sardesai, S. Saudari, C. Schiller, B. Senapati, C. Serrau, N. Shah, T. Shen, H. Sheng, J. Shepard, Y. Shi, M.C. Silvestre, D. Singh, Z. Song, J. Sporre, P. Srinivasan, Z. Sun, A. Sutton, R. Sweeney, K. Tabakman, M. Tan, X. Wang, E. Woodard, G. Xu, D. Xu, T. Xuan, Y. Yan, J. Yang, K.B. Yeap, M. Yu, A. Zainuddin, J. Zeng, K. Zhang, M. Zhao, Y. Zhong, R. Carter, C.H. Lin, S. Grunow, C. Child, M. Lagus, R. Fox, E. Kaste, G. Gomba, S. Samavedam, P. Agnello, and D. K. Sohn. 2017. A 7nm CMOS Technology Platform for Mobile and High Performance Compute Application. In IEEE International Electron Devices Meeting. https://doi.org/10.1109/IEDM.2017.8268476
 NVIDIA Corporation (2017) NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture. Technical Report WP08608001_v1.1. https://images.nvidia.com/content/voltaarchitecture/pdf/voltaarchitecturewhitepaper.pdf
 O’Connor et al. (2017) Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. 2017. FineGrained DRAM: EnergyEfficient DRAM for Extreme Bandwidth Systems. In MICRO. https://doi.org/10.1145/3123939.3124545
 PALISADE Project (2021) PALISADE Project. 2021. PALISADE Lattice Cryptography Library (release 1.11.5). https://palisadecrypto.org/
 Passas et al. (2012) Giorgos Passas, Manolis Katevenis, and Dionisios Pnevmatikatos. 2012. Crossbar NoCs are Scalable Beyond 100 Nodes. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 31, 4 (2012), 573–585. https://doi.org/10.1109/TCAD.2011.2176730
 Prabhakar and Jairath (2021) Raghu Prabhakar and Sumti Jairath. 2021. SambaNova SN10 RDU: Accelerating Software 2.0 with Dataflow. In IEEE Hot Chips 33 Symposium. https://doi.org/10.1109/HCS52781.2021.9567250
 Ranganathan et al. (2021) Parthasarathy Ranganathan, Daniel Stodolsky, Jeff Calow, Jeremy Dorfman, Marisabel Guevara, Clinton Wills Smullen IV, Aki Kuusela, Raghu Balasubramanian, Sandeep Bhatia, Prakash Chauhan, Anna Cheung, In Suk Chong, Niranjani Dasharathi, Jia Feng, Brian Fosco, Samuel Foss, Ben Gelb, Sara J. Gwin, Yoshiaki Hase, Dake He, C. Richard Ho, Roy W. Huffman Jr., Elisha Indupalli, Indira Jayaram, Poonacha Kongetira, Cho Mon Kyaw, Aaron Laursen, Yuan Li, Fong Lou, Kyle A. Lucke, JP Maaninen, Ramon Macias, Maire Mahony, David Alexander Munday, Srikanth Muroor, Narayana Penukonda, Eric PerkinsArgueta, Devin Persaud, Alex Ramirez, VilleMikko Rautio, Yolanda Ripley, Amir Salek, Sathish Sekar, Sergey N. Sokolov, Rob Springer, Don Stark, Mercedes Tan, Mark S. Wachsler, Andrew C. Walton, David A. Wickeraad, Alvin Wijaya, and Hon Kwan Wu. 2021. WarehouseScale Video Acceleration: CoDesign and Deployment in the Wild. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems. https://doi.org/10.1145/3445814.3446723
 Reagen et al. (2021) Brandon Reagen, WooSeok Choi, Yeongil Ko, Vincent T. Lee, HsienHsin S. Lee, GuYeon Wei, and David Brooks. 2021. Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference. In HPCA. https://doi.org/10.1109/HPCA51647.2021.00013
 Regev (2009) Oded Regev. 2009. On Lattices, Learning with Errors, Random Linear Codes, and Cryptography. J. ACM 56, 6 (2009), 40 pages. https://doi.org/10.1145/1568318.1568324
 Riazi et al. (2020) M. Sadegh Riazi, Kim Laine, Blake Pelton, and Wei Dai. 2020. HEAX: An Architecture for Computing on Encrypted Data. In ASPLOS. https://doi.org/10.1145/3373376.3378523
 Roy et al. (2019) Sujoy Sinha Roy, Furkan Turan, Kimmo Järvinen, Frederik Vercauteren, and Ingrid Verbauwhede. 2019. FPGABased HighPerformance Parallel Architecture for Homomorphic Computing on Encrypted Data. In HPCA. https://doi.org/10.1109/HPCA.2019.00052
 Samardzic et al. (2021) Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas, Ronald Dreslinski, Christopher Peikert, and Daniel Sanchez. 2021. F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption. In MICRO. https://doi.org/10.1145/3466752.3480070
 Shafaei et al. (2014) Alireza Shafaei, Yanzhi Wang, Xue Lin, and Massoud Pedram. 2014. FinCACTI: Architectural Analysis and Modeling of Caches with DeeplyScaled FinFET Devices. In IEEE Computer Society Annual Symposium on VLSI. https://doi.org/10.1109/ISVLSI.2014.94
 Son (2021) Yongha Son. 2021. SparseLWEestimator. https://github.com/Yongyongha/SparseLWEestimator
 Song et al. (2018) Taejoong Song, Jonghoon Jung, Woojin Rim, Hoonki Kim, Yongho Kim, Changnam Park, Jeongho Do, Sunghyun Park, Sungwee Cho, Hyuntaek Jung, Bongjae Kwon, HyunSu Choi, Jaeseung Choi, and Jong Shik Yoon. 2018. A 7nm FinFET SRAM Using EUV Lithography with Dual WriteDriverAssist Circuitry for LowVoltage Applications. In IEEE International SolidState Circuits Conference. https://doi.org/10.1109/ISSCC.2018.8310252
 Thoziyoor et al. (2008) Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. In ISCA. https://doi.org/10.1145/1394608.1382127
 van der Hagen and Lucia (2021) McKenzie van der Hagen and Brandon Lucia. 2021. Practical encrypted computing for iot clients. arXiv preprint arXiv:2103.06743 (2021).
 Wu et al. (2016) ShienYang Wu, C.Y. Lin, M.C. Chiang, J.J. Liaw, J.Y. Cheng, S.H. Yang, C.H. Tsai, P.N. Chen, T. Miyashita, C.H. Chang, V.S. Chang, K.H. Pan, J.H. Chen, Y.S. Mor, K.T. Lai, C.S. Liang, H.F. Chen, S.Y. Chang, C.J. Lin, C.H. Hsieh, R.F. Tsui, C.H. Yao, C.C. Chen, R. Chen, C.H. Lee, H.J. Lin, C.W. Chang, K.W. Chen, M.H. Tsai, K.S. Chen, Y. Ku, and S.M. Jang. 2016. A 7nm CMOS Platform Technology Featuring 4th Generation FinFET Transistors with a 0.027um2 High Density 6T SRAM cell for Mobile SoC Applications. In IEEE International Electron Devices Meeting. https://doi.org/10.1109/IEDM.2016.7838333
 Xin et al. (2020) Guozhu Xin, Jun Han, Tianyu Yin, Yuchao Zhou, Jianwei Yang, Xu Cheng, and Xiaoyang Zeng. 2020. VPQC: A DomainSpecific Vector Processor for PostQuantum Cryptography Based on RISCV Architecture. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 8 (2020), 2672–2684. https://doi.org/10.1109/TCSI.2020.2983185
 Xin et al. (2021) Guozhu Xin, Yifan Zhao, and Jun Han. 2021. A MultiLayer Parallel Hardware Architecture for Homomorphic Computation in Machine Learning. In IEEE International Symposium on Circuits and Systems. https://doi.org/10.1109/ISCAS51556.2021.9401623
 Xing and Li (2021) Yufei Xing and Shuguo Li. 2021. A Compact Hardware Implementation of CCAsecure Key Exchange Mechanism CRYSTALSKYBER on FPGA. IACR Transactions on Cryptographic Hardware and Embedded Systems 2021, 2 (2021), 328–356. https://doi.org/10.46586/tches.v2021.i2.328356
 Zhang et al. (2021) Ye Zhang, Shuo Wang, Xian Zhang, Jiangbin Dong, Xingzhong Mao, Fan Long, Cong Wang, Dong Zhou, Mingyu Gao, and Guangyu Sun. 2021. PipeZK: Accelerating ZeroKnowledge Proof with a Pipelined Architecture. In ISCA. https://doi.org/10.1109/ISCA52012.2021.00040