## I Introduction

As cloud computing becomes an increasingly dominant way of providing computing resources, numerous computations are performed on datacenter servers rather than on personal devices [comm-2008-cloud, comm-2010-view]. It enables the client without expensive hardware to receive services that require complex computations. However, security and privacy issues are also emerging with the growth of cloud computing [aina-2010-cloud, nca-2011-cloud]. When a client sends private data to a server, security issues in data transfers can be resolved by sending the data after encryption. However, the data encoded by a conventional encryption method must be decrypted to perform the computation in the server. Therefore, a user have no choice but to use the cloud service with a risk of security or privacy attack (e.g., abusing) that occurs during the computation of unencrypted data.

Homomorphic Encryption (HE) [fsc-1978-he], an encryption scheme that enables computation between encrypted data, draws significant attention as a solution to this privacy problem. By adopting HE, service providers no longer need to decrypt the clients’ private data for computation. The concept of HE was first suggested in 1978 [fsc-1978-he]. However, the early proposals of HE were either unsafe [fsc-1978-he] or support only one type of HE operation, namely HE addition (HE Add) or HE multiplication (HE Mul) (e.g., ElGamal [ieee-1985-elgamal] and Paillier [ictact-1999-pascal]). In this aspect, it was difficult to put HE into serious applications for a while. However, fully HE (FHE) [stanford-2009-gentry] proposed in 2009 made a major breakthrough by supporting both HE Add and HE Mul. Moreover, FHE supports bootstrapping, a method of initializing noise in encrypted data, enabling an unbounded number of HE Add and Mul without decryption.

Among numerous FHE schemes to date [acm-toct-2014-bgv, acm-stc-2012-fly, iacr-2012-bfv, ictacis-2017-heaan, aictact-2015-fhew, jc-2018-tfhe], HE for Arithmetic of Approximate Numbers (HEAAN [ictacis-2017-heaan]), also known as CKKS (Cheon-Kim-Kim-Song) is rapidly gaining popularity [idash-2018] as it supports the approximate computation of real numbers. HEAAN enables HE Add and Mul of approximate data, where the result is almost the same as that of the original operation with a tiny error. Using HEAAN, computations can be performed without data decryption in the datacenter. However, the execution time for computation on encrypted data (ciphertext) increases by from 100s to 10,000s of times compared to that on native, unencrypted messages. Therefore, it is highly desired to reduce the computation time of HE operations to use HE practically.

There have been a large body of research on accelerating HE operations using FPGAs or GPUs. However, FPGA-based acceleration studies targeted the HE schemes (e.g., BGV [acm-toct-2014-bgv], LTV [acm-stc-2012-fly], and BFV [iacr-2012-bfv]) that only support computations of integer numbers [iwches-2015-ltv, ieee-2016-customaccelerator, ieee-2016-coprocessor, ieee-2018-hepcloud, ieee-2019-fpga], or operate with only limited parameter sizes [arxiv-2019-heax]

. They all target performing a small number of HE Mul without bootstrapping, inhibiting their applicability to a wide range of applications requiring hundreds to thousands of multiplication be performed (e.g., deep learning). GPU implementation studies

[iccis-2015-cuhe, gpgpu-2016-yashe, iacr-2018-fvgpu, ieee-2019-rnsvariants] do not take advantage of the algorithm’s internal parallelism sufficiently, operate on only small or limited parameters, or do not consider the cost of modulo operations.In this paper, we demystify HEAAN, a representative FHE scheme, by describing, analyzing, and optimizing it in a manner friendly to computer architects. We first explain the pertinent details of the encryption, decryption, and computation aspects of HEAAN, and identify that the following four functions take more than 95% of HE Mul, the most computationally expensive operation of HE: CRT (Chinese Remainder Theorem), NTT (Number Theoretic Transform), iNTT (inverse NTT), and iCRT (inverse CRT). we conduct an in-depth and disciplined analysis of the aforementioned primary functions to understand their computational complexity and access patterns on input, output, and precomputed data, which are critical for operation (e.g., modular multiplication) strength reduction, across a range of key HE parameters.

The parallelism exposed through the analysis is exploited to accelerate HE Mul on CPU and GPU, the most popular computing platforms, which are already equipped with hundreds to thousands of ALUs. In CPU, we utilize multiple cores (inter-core parallelism) and AVX-512 instructions supported by the latest Intel architectures (intra-core parallelism). In GPU, we utilize massive thread-level parallelism expressible through the CUDA programming model. We further improve performance by proposing a series of architecture-centric optimizations, such as matrix transposition to better exploit memory access locality, loop reordering to expose more parallelism, and taking a synergy between precomputation and delayed modulo operations; and we estimate how much more performance gains are attainable through natively support currently emulated instructions. We achieve 42.9 and 134.1 speedup of HE Mul on CPU and GPU, respectively, compared to the single-thread reference HEAAN [repo-2019-heaan], setting a new baseline for the future HE acceleration research.

## Ii Background: Computational Challenges of Homomorphic Encryption (HE)

HE can be categorized into two groups, somewhat HE (SHE) and fully HE (FHE), by whether there is a limitation on the number of arithmetic operations applicable to the ciphertext. In a HE scheme, noise is accumulated during each operation; this makes the ciphertext of a SHE scheme indecipherable after performing a certain number of operations. On the contrary, FHE schemes support a bootstrapping algorithm [aictact-2018-bootstrapping]

, which refreshes the accumulated noise. Therefore, although there is an upper bound in the number of arithmetic operations that can be consecutively applied to a ciphertext, by periodically bootstrapping it, we can continue manipulating the ciphertext with no need for decrypting it. This property makes FHE well-tailored to meet the demands of a wide range of general applications (e.g., training/inference of deep neural networks

[icml-2016-cryptonet, arxiv-2018-fastercryptonet, arxiv-2018-hcnn, wahc-2019-ngraph]), which require a massive number of operations applied to encrypted data.Representative FHE schemes include Brakerski-Gentry-Vaikuntanathan (BGV) [acm-toct-2014-bgv], Lopez-Alt, Tromer, and Vaikuntanathan (LTV) [acm-stc-2012-fly], Brakerski/Fan-Vercauteren (BFV) [iacr-2012-bfv], fast FHE over the torus (TFHE) [jc-2018-tfhe], and Cheon-Kim-Kim-Song (CKKS) [ictacis-2017-heaan]. Among these, only CKKS supports approximate computation on real numbers, and hence it is a top candidate for many real-world applications requiring a bunch of operations on data that can tolerate tiny errors due to approximate computation. CKKS is rapidly gaining popularity in a wide range of applications exploiting HE, such as machine learning [bmcmg-2018-logistic]. For example, the winner and the most runner-ups of a recent HE challenge about secure genome analysis competition (iDASH 2018 [idash-2018] and 2019 [idash-2019]) used CKKS or its hybrid versions. Therefore, we investigate HEAAN (HE for Arithmetic of Approximate Numbers) scheme [repo-2019-heaan], which is developed by the authors of CKKS.

HEAAN is able to perform arbitrary computation types by combining HE multiplication
(simply *HE Mul*) and HE addition (simply *HE Add*), which are
multiplication and addition on ciphertexts.
However, the execution time of HE operations increases significantly compared to the
corresponding ones on the original unencrypted messages.
Table I compares the execution time for addition/multiplication of
original messages and ciphertexts using a single core from the system specified in
Section VI.
We measure the average execution time of addition/multiplication of a complex
number in a message consisting of 32,768 complex numbers.
HE Mul (Add) is 36,112 (168) slower than native message
operation.
When the message consists of fewer numbers, the slowdown is even higher.
Considering that most approximate number operations consist of multiplication and addition,
the long execution time of HE operations is an obstacle to the practical use of HE.
Therefore,
it is essential to accelerate HE operations,
especially HE Mul as it is 448 slower than HE Add.

Operation type | Message | Ciphertext | Slowdown |

Addition | 2.1 ns | 348.2 ns | 168.2 |

Multiplication | 4.3 ns | 155883.8 ns | 36112.7 |

## Iii A Brief Introduction to HEAAN

Prior to accelerating the most compute-intensive HEAAN operation, HE Mul, we introduce the pertinent details of HEAAN, focusing on how to convert an input message to a ciphertext through encoding/encryption steps and how to perform arithmetic operations on ciphertext in HEAAN.

### Iii-a HEAAN encryption

HEAAN converts an input message to a ciphertext through encoding and encryption steps. An input message consists of complex numbers, each composed of a double-type real and imaginary number. The encoding step first converts an input message to a plaintext (t), a polynomial of at most degree (-1) with integer coefficients. t is placed in a cyclotomic polynomial ring () space with the magnitude of each coefficient bounded by ciphertext modulus, integer (). Therefore, each coefficient () is the residue number by , where is a BigInt (big integer) much larger than . The encoding step converts the floating-point numbers of an message to integer numbers after multiplying with a scaling factor () and then rounding down the remaining fraction numbers. Each coefficient is a log-bit BigInt, and .

Then, the encryption step converts a plaintext to a ciphertext consisting of a pair of polynomials c.ax and c.bx using a public key pair ( and ) as follows:

The public key pair is generated from a secret key (sk).

In the equations above, u is a polynomial of at most degree (-1), where each coefficient is either -1, 0, or 1 following a distribution described in [ictacis-2017-heaan]. and are also polynomials of at most degree (

-1) polynomials with random error values to ensure security, which follow a Gaussian distribution with a small standard deviation value (e.g.,

= in [ictacis-2017-heaan]).To extract the original message from a ciphertext, we first convert the ciphertext to the plaintext exploiting the following relationship between c.ax and c.bx:

Then, the plaintext t can be returned to the original message through decoding; in this case, the inverse of scaling factor () is multiplied to get the approximate values.

HEAAN limits the maximum size of the ciphertext modulus to a constant value . HEAAN chooses for , where is multiplicative depth, the number of consecutive HE Mul operations applicable to a ciphertext before it loses encrypted data, and is rescaling factor. The size of the message contained in a ciphertext increases exponentially as the ciphertext is multiplied repeatedly. To prevent the explosion of message size, HEAAN performs rescaling after each HE Mul by dividing the coefficients of the output ciphertext by . Then the size of , the ciphertext modulus, is adjusted to where log = log - log. Therefore, log of a ciphertext that is just encrypted starts at log, decreases by log every time an HE Mul is applied, and becomes 0 after experiencing HE Mul operations, losing data. When is fixed, more HE Mul can be applied to a ciphertext with a larger value.

log | Multiplicative depth () | required |

300 | 10 | |

600 | 20 | |

1,200 | 40 | |

2,400 | 80 |

To apply more HE Mul operations to a ciphertext, FHE re-initialize
the ciphertext through bootstrapping.
However, bootstrapping is a costly operation (reported to take in the order of
minutes [iacr-2018-faster]) because it consists of dozens of HE
Mul^{1}^{1}1This again reinforces the importance of accelerating HE
Mul. and shift operations.
To reduce the overhead of bootstrapping for practical use of HE, we should
use large values to increase .
Also, should increase as increases to guarantee a certain
level of security in HE (see Table II).
As larger and values require more computation and data storage
costs per HE Mul, it is not efficient to use too large ;
we discuss a further detail of this trade off in Section VIII.
In this paper, we use (, , , ) of (, 40, , ),
respectively, which are the default values the official HEAAN
repository [repo-2019-heaan] uses.

### Iii-B HEAAN computation

Arithmetic operations in HEAAN, HE Add and Mul, are through computation between the polynomials of operand ciphertexts. Here we assume that the two operand ciphertexts in HE operation have the same ciphertext modulus value .

HE Add computes an output ciphertext (c3) from two input ciphertexts (c1 and c2) through the following operations:

HE Add is relatively simple because it performs the element-wise addition of BigInt coefficients and then the modulo of for each output coefficient; means modulo . Here the modulo operation is implemented simply by subtracting when each output coefficient is larger than because the result of each addition is smaller than .

HE Mul is much more costly than HE Add because the former
requires multiplying BigInt coefficients by times.
In this paper, we assume that is the size of an integer data type
that a computer natively supports with high performance (e.g.,
= for 64-bit CPUs), which is often called a word size.
A log-bit BigInt is represented as
(=loglog) words (see Figure 1).
Then, one BigInt multiplication (simply mul) consists of log-bit
word mul and word addition (simply add) operations.
For example, because is 19 when using the representative
parameters (=, log=1,200, log=64) summarized in Table III,
361 64-bit mul and 721 64-bit add operations are required
besides carry propagation per BigInt mul.
As described above, polynomial mul requires (=4.3 billion)
BigInt mul, which requires at least 4.6 Tera 64-bit operations.
To reduce the complexity of this polynomial mul, HEAAN and other
HE schemes use *Chinese Remainder Theorem* (CRT [world-1996-crt])
and *Number Theoretic Transform* (NTT [mc-1965-ntt]).

Symbol | Description | Representative values |

Scaling factor is multiplied to the floating-point number of message to convert to integer number. | ||

Rescaling factor linearly reduces the size of messages that grow exponentially during computation. | ||

Multiplicative depth is the maximum number of possible mul of a ciphertext without bootstrapping. | 40 | |

The maximum ciphertext modulus is equal to the initial ciphertext modulus after encryption. | 1,200 | |

Ciphertext modulus starts from and divided by a rescaling factor at each mul. | 1,200, 1,170,, 0 | |

The number of coefficients of ciphertext polynomial. The degree of ciphertext polynomial is at most -1. | 65,536 | |

The number of messages. messages are encrypted in one ciphertext. | 32, 64,, 32,768 | |

Word size. It is machine-dependent ( for CPU, for GPU). | or | |

The number of limbs of . To represent -bit integer, limbs are required. | 1,200/64 19 | |

The number of prime numbers. We use prime numbers to represent big integer in RNS domain. | 2,400/58 42 | |

Product of prime numbers that are used to represent big integer. | ||

The number of limbs of . To represent -bit integer, limbs are used. |

CRT for reducing the complexity of BigInt mul: CRT states that for integers which are coprime with each other, the residue set = of any integer is unique. HEAAN exploits CRT by defining a set of integers , where each modulus is a prime number smaller than and = . Then, a log-bit BigInt number , which is the coefficient of the ciphertext polynomial, can be represented in the residue number system (RNS) by the set of remainders where =.

A key property of RNS is that for adding, subtracting, and multiplying numbers
represented in RNS, it is sufficient to perform the same modular operation on
each pair of residues (called a *congruence relation*).
That is, for a pair of log-bit BigInt numbers (, ) and their
corresponding RNS representations
(, ), the product of
and is represented by such that
=.
This relationship holds because we set not smaller than and the
product of two log-bit BigInt numbers are smaller than .

Therefore, a log-bit BigInt number is converted into log-bit
data (see Figure 1) in an RNS domain, where we call the
conversion *a CRT function* or simply CRT, and a BigInt
mul is changed into log-bit modular mul;
and hence, the time complexity per BigInt mul is changed from
to .
In general, (see Table III), so the
number of operations required for BigInt mul can be greatly
reduced by CRT.
However, multiplying two BigInt polynomials still has a
complexity of because polynomial mul requires
coefficient mul operations.

NTT for reducing the complexity of polynomial mul:

NTT is a discrete Fourier transform over a finite field (integer). It is well known that fast polynomial mul can be implemented by using Fast Fourier Transform (FFT)

[iants-2008-fftpoly]. Therefore, we can translate polynomial mul with complexity into element-wise mul with complexity by using fast NTT, a variant of FFT limited to integer values. Although fast NTT (or simply NTT) requires transformation cost with ) complexity, it is beneficial to use NTT when is large enough.The overall flow of HE Mul in HEAAN, including CRT and NTT, is depicted in Figure 2. It consists of 5 polynomial mul, where each polynomial mul performs (1) CRT, (2) NTT, and (3) element-wise modular mul, followed by (4) inverse fast NTT (iNTT) and (5) RNS-to-BigInt conversion (iCRT) to return to the polynomial with BigInt coefficients. We divide the whole process into region 1 and region 2; the former multiplies and adds input ciphertexts whereas the latter transforms , which is a region 1’s byproduct that is decrypted with , into a form that is decrypted with sk (called key-switching). The evaluation key (evk) used in region 2 consists of two polynomials (evk.ax and evk.bx) where each coefficient is log-bit (= 2log-bit) long, encrypting the square of sk multiplied by a constant Q. The key-switching procedure with evk is mandatory to decrypt correctly, as it removes the need for in decryption stage, which, without the key-switching, would be required after sequential multiplications.

In region 1, is configured to deal with log-bit BigInt, the intermediate result of polynomial mul between two input ciphertexts whose coefficient size is log-bit. By contrast, in region 2, is set larger to represent (log+ log)-bit BigInt because the coefficient size of evk is log-bit long. The shift operations in region 2 reduce the amount of error that was accumulated during mul. The following additions and subtractions between the results of polynomial mul produce the result of HE Mul (c3.ax and c3.bx); we can get an approximate value of mul between two original messages by decrypting the result using sk.

Figure 3 shows the execution time breakdown of HE
Mul using a single-threaded reference HEAAN in the system and configuration
described in Section VI.
CRT, NTT, iNTT, and iCRT account for
95.8% of the total execution time.
The remaining operations, such as element-wise modular mul,
account for only 4.2% of execution time.
The total execution time is 5,108 ms, which is about 36,000 slower than
the original message mul, as discussed in Section II.
*Therefore, accelerating HE Mul is essential for practical use of
HE, and it is necessary to accelerate CRT, NTT,
iNTT, and iCRT.*

## Iv An In-depth Analysis of Major Functions in HEAAN Multiplication

To accelerate the primary functions (CRT, NTT, iNTT, and iCRT) in HE Mul, we first conduct an in-depth analysis of how each function works. In the following descriptions, represents an by matrix used as input/output of a function, while represents a precomputed table of by matrix.

CRT (Algo. 1) takes representing -bit BigInt numbers and produces , the result of modulo operation on each BigInt with different primes . The operation consists of two stages: (1) computing matrix-matrix mul of with and (2) applying modulo operations to each output element.

We first explain how to perform a modulo operation on a BigInt with a modulus smaller than . A BigInt is expressed by -bit words, i.e. where . Then the modulo operation on the BigInt is as follows:

Here because and are independent of the input, HEAAN precomputes , for all and . Therefore, is performed by multiplying and .

We can exploit Shoup’s modular mul (Shoup’s ModMul [ieee-2019-fpga]) for the modulo operation in line 6 of Algo. 1. Shoup’s ModMul (Algo. 2) computes with 3 muls and a single correction step if a value () is known in advance. It replaces a costly division operation with relatively cheaper mul, comparison, and subtraction operations. We apply the algorithm for the modular mul on spanning up to 3 limbs (), using precomputed values on . The operations of CRT can be performed in parallel for each coefficient (total ) and each prime number (total ).

NTT implements Cooley-Tukey algorithm [mc-1965-ntt],
which recursively divides an -point FFT to -point FFTs
and combines their results (called *radix-k* FFT).
An exemplar radix-2 NTT in Algo. 3
takes a matrix as an input
and performs a butterfly algorithm butt.
It uses a precomputed table () of powers of the 2-th root of
unity for all prime numbers.
For each prime, butt (Algo. 4) is called
log times.
As butt requires modular mul, it also
uses Shoup’s ModMul as was done in CRT.

iNTT is slightly different from NTT. It has a different loop order, calls inverse butterfly (ibutt) instead of butt, deals with a different precomputed table (consisting of the inverse of powers of the primitive root of unity ()), and finally divides each element by . However, except for the last element-wise division by , iNTT is symmetric to NTT in terms of the number and the kind of operations. Both NTT and iNTT are completely parallelizable for each prime number.

CRT | NTT | iNTT | iCRT | |

Multiplication | - | - | ||

Modular mul | + | |||

ADC (add with carry) | - | - | ||

Add, Sub | - | - | ||

Computation complexity |

CRT | NTT & iNTT | iCRT | |

Input data | |||

Precomputed | |||

data |

iCRT converts the matrix , where each element is a remainder smaller than , back to -bit BigInts (see Algo. 5). It starts with (1) the Hadamard product between an input matrix and a precomputed table whose elements are modular inverses of for all , followed by an element-wise modular mul with each . Shoup’s ModMul can also be applied here for efficient modular mul. (2) Then, each output element of (1), a scalar value, is multiplied by a BigInt according to its and accumulated to a temporary BigInt . Here each is precomputed and stored in table , where . (3) Finally, a reduction of modulo and ( in Region 2) is performed.

## V Architecture-aware Optimizations to Maximize HE Mul performance on CPUs and GPUs

Previous HE studies [arxiv-2019-heax, ieee-2019-fpga] pursued proposing new hardware architectures (e.g., through FPGA implementation) for performance improvement. By contrast, we first improve the performance of HE by utilizing the most popular computation platforms, CPUs and GPUs, which are already equipped with hundreds to thousands of ALUs.

All the major functions (CRT, iCRT, NTT, and iNTT) of HE Mul of HEAAN have massive parallelism that can be exploited by CPUs and GPUs. All the residual numbers () can be computed in parallel on CRT. NTT and iNTT perform independent transformations and leverage the algorithmic optimization of FFT, where groups can be computed in parallel at each individual stage during FFT. Henceforth, we identify key challenges and solutions we devise in accelerating HE Mul on CPUs and GPUs.

### V-a Loop reordering to expose massive parallelism in iCRT

iCRT recombines the residual numbers into the integers of size log for each coefficient of the resulting ciphertext and hence it might be regarded that the degree of parallelism is smaller than CRT ( vs. ). However, we discover that the limited -degree parallelism can be expanded to -degree parallelism by reordering two loops in iCRT (line 6 and 7 in Algo. 5); the modified algorithm is shown in Algo. 6. After reordering, the sequence of original mul between a scalar and a BigInt now becomes a matrix-matrix mul between a matrix and a matrix (line 9 in Algo. 6). Then, iCRT should be modified such that the partial sum in the inner-most loop is accumulated into (double or triple words), rather than (BigInt), which is aggregated to at the end of the loop.

The range of modulus and determine whether should be a double word or a triple word; with our representative parameters where , using a double word is sufficient. With our loop reordering, the resulting matrix-matrix mul exposes a massive parallelism of degree in iCRT, providing abundant parallelization opportunities to contemporary hardware platforms.

### V-B Accelerating HE Mul on CPUs

The strategies a modern CPU takes to exploit parallelism from an application are twofold: populating (1) multiple cores and (2) ALUs supporting short-SIMD instructions within each core. For example, the Skylake-based Intel Xeon CPU we use has 24 cores per socket and each core support AVX-512 instructions, each executing eight 64-bit integer operations [micro-2017-skylake], which could lead to over 100 performance improvement compared to the baseline implementation not exploiting these parallelism. To achieve a near-maximum performance CPU provides, we exploit intra-core parallelism via utilizing AVX-512 instructions and inter-core parallelism via multi-threading. We carefully distribute operations to multiple threads and AVX-512 SIMD lanes per thread to minimize performance degradation by an inferior cache performance caused by poor data access patterns; in this, whether an input matrix follows a column- or a row-major order can vastly affect performance.

During CRT, a CPU thread takes responsibility of a portion of the coefficients (line 1 in Algo. 1), whereas each lane of an AVX-512 port performs operations on different prime numbers (line 2 in Algo. 1). During NTT and iNTT, a thread does its job on a portion of the prime numbers, (line 1 in Algo. 3), whereas each lane of AVX-512 computes a part of the coefficients (line 8 in Algo. 3). In case of iCRT, we take different approaches for the two iteration phases. During the first phase (line 1 to 3 in Algo. 5), each thread and a AVX-512 lane performs computation on a part of the coefficients. During the second iteration phase (line 4 to 11 in Algo. 6) each thread also computes on a part of the coefficients (line 4), but each lane of an AVX-512 port is on different , the positional index on the limbs of (line 6).

Although the reference HEAAN library [repo-2019-heaan] supports multi-threading and exploits the same parallelism type on NTT, iNTT, and iCRT with our work, (in terms of distributing works to threads; the reference HEAAN does not implements SIMD instructions) CRT takes a different strategy.

Emulating arithmetic operations: However, AVX-512 does not support parallel 64-bit mul and 64-bit ADC (addition with carry) yet. Therefore, 64-bit mul is emulated with four parallel 32-bit mul, five 64-bit add, and five 64-bit shift instructions. Also, one 64-bit compare and one additional 64-bit add are required to handle carry per addition. This emulation narrows the performance gap between the reference HEAAN and the AVX-512 implementation. To further improve performance under this constraint, we modify Shoup’s ModMul as follows.

The original Shoup’s ModMul algorithm requires three operations: one 64-bit mulhi operation to compute , and two 64-bit mullo operations to compute and , where is an estimation of a quotient, and 64-bit mulhi (mullo) returns the upper (lower) 64-bit of a mul result which is 128-bit long. A single 64-bit mulhi operation can be emulated with four 32-bit mul (, , , ), four 64-bit add, and five 64-bit shift operations. In this case, the estimated remainder can be either or , which lies in a range of .

However, one of the 32-bit mul () for emulating 64-bit mulhi is used only for computing a carry from low 64-bit of mul. We can remove this mul if the carry is ignored, and produce an approximated 64-bit mulhi () with only three mul instead of four. By applying this optimization, the estimated remainder lies in a range of . As the upper bound of a remainder grows to , one more correction step (conditional subtraction) is needed, but the number of more expensive operations are reduced: one 32-bit mul, two 64-bit add, and one 64-bit shift instruction.

Matrix transposition in iCRT:

SIMD instructions might lead to a poor cache utilization if they access a matrix storing elements in a row-major order by column direction (or vice versa), as this demands multiple cache lines at once. The resulting performance degradation is more prominent when the access stride is too large for a hardware prefetcher to be effective.

iCRT experiences this issue because the matrix-matrix mul (line 9 in Algo. 6) accesses the matrix storing elements in a row-major order by column direction. We implicitly transfer the matrix using the*scatter*instructions in AVX-512 to address this issue.

Reducing the size of to : We can remove the overhead of emulating 64-bit operations by using of instead of as AVX-512 naturally supports 32-bit mul and ADC. Using the smaller beta changes the number of instructions for mul and ADC to one, and reduces the number of instructions for modular mul to below the half. However, it also has shortcomings that should roughly be doubled to express numbers whose size is up to . also needs to be doubled because the upper bound of each prime number is . Larger and imply bigger precomputed tables and more iterations for CRT, NTT, iNTT, and iCRT. It also increases the number of operations other than mul, modular mul, and ADC. A preliminary implementation shows no improvement in performance by using the small for CPUs.

### V-C Accelerating HE Mul on GPUs

Modern GPUs such as Volta [nvidia-2017-volta] and Turing [nvidia-2018-turing] have as many integer (INT32) processing units as single-precision floating-point (FP32) processing units, resulting in the multiply-accumulate throughput for INT32 numbers the same as that for FP32 numbers. Such massive throughput of integer operations makes GPU an attractive candidate for accelerating HE operations. We identify the following key points in fully exploiting the performance potential of GPUs when accelerating HE operations.

Parallelization strategies:
CUDA programming model [nvidia-2019-cudac] for GPU has the
following hierarchical structure of threads: multiple GPU threads are
grouped to form a *thread block* and multiple thread blocks
comprise a *grid*.
A thread block is allocated to one of Streaming Multiprocessors (SMs).
The threads in a thread block share the resources (e.g., shared memory) of
the SM.
Each thread block in a grid is allocated to each SM in a round-robin fashion,
and the number of thread blocks (*grid dimension*) and the number of
threads in a block (*block dimension*) are configured
at each GPU kernel launch.

The basic parallelization strategy is to assign each independently computable output element in a function to a GPU thread. We launch threads for CRT so that each thread computes one output element, which is a residue. NTT and iNTT launch threads each, where one thread performs one butterfly operation (Algo. 4) for each butterfly step using a simple radix-2 iterative NTT algorithm [jsc-2014-harvey] that is also used for CPU implementation.

In case of iCRT, a naïve parallelism strategy uses -degree parallelism where one thread handles one output BigInt type coefficient. Prior studies [iccis-2015-cuhe, iacr-2018-fvgpu] took the same strategy. However, by changing the loop order as described above, we transform its core operation into a matrix-matrix mul operation, thereby taking advantage of -degree parallelism to maximize TLP. We can also exploit various optimization strategies developed for matrix-matrix mul in GPU (e.g., tiling principles in NVIDIA’s cutlass library [nvidia-2013-cutlass]).

64-bit emulation vs. 32-bit word:
As opposed to most CPUs that natively support 64-bit words, modern GPUs natively
support 32-bit words and emulate 64-bit integer operations.
To avoid the overhead of 64-bit emulation (whose throughput is more than an
order of magnitude lower than that of the 32-bit counterpart), prior studies
accelerating HE on GPU [iacr-2018-fvgpu, iccis-2015-cuhe] use 32-bit words
() and the prime numbers smaller than .
We also use 32-bit words and operations.
Another advantage of using 32-bit words on GPU is that the operations with carry-in and
carry-out can be used without emulation;
modern NVIDIA GPUs support carry operations (e.g., addc, subc, and madc
in assembly-like virtual ISA, PTX [nvidia-2019-ptx]
where the operations are called *extended-precision integer arithmetic instructions*).
The throughput of these instructions is the same as
the instructions without carry in recent GPU
architectures [nvidia-2017-volta, nvidia-2018-turing],
enabling efficient computation on large integers without emulation.

Different strategies for BigInt modulo in CRT:
A naïve BigInt modulo is done by repetitive log-bit shift, add, and modulo operation;
for example, cuHE [iccis-2015-cuhe], takes this approach.
By contrast, HEAAN accumulates the result of modulo on each
using a precomputed table in CRT.
In this case, the BigInt variable (line 3 in Algo. 1)
can span two or three words depending on and the size of each prime number.
In the CPU implementation, is guaranteed to span two words when using the
representative parameters specified in Table VI,
as an overflow does not happen for with prime numbers smaller
than .
To guarantee to be two-word long, we use as
a lower bound of a prime for AVX-512 implementation instead of
which is the default value of the reference HEAAN.
However in GPU with a 32-bit , with primes smaller than
,^{2}^{2}2We used as a lower bound of a prime when
. If the lower bound is excessively reduced (e.g., using
) to avoid any overflow, the overhead from a growing becomes
too high, negating the advantage of removing overflows.
only up to 4 () accumulation is allowed to guarantee that the
overflow does not happen, which is nearly impossible as is 90 or higher
when using the representative parameters.

To prevent the overflow, one might (1) use three-word with an additional ADC operation added in the inner-most loop to avoid expensive modulo operations, or (2) do modulo operations intermittently in the inner-most loop (e.g., for every 4 accumulation in our case), to ensure that spans only two words. We compare these strategies in Section VII.

Per-thread storage for accumulation in iCRT:
The baseline implementation of iCRT with N-degree parallelism
allocates a BigInt (line 8 in Algo. 5) as a long array, in a per-GPU-thread manner.
If is not carefully allocated to fast storage,
frequent cache thrashing might occur, leading to a significant performance degradation.
The latest NVIDIA GPUs [nvidia-2017-volta, nvidia-2018-turing] have
a variety of storage types including register, L1 cache, L2 cache,
device memory, and read-only constant memory.
In the algorithm of original iCRT (Algo. 5), is
stored in register memory,
so that it can be loaded quickly in a single cycle.
On the other hand, because is declared as a thread-local array
and also it is dynamically indexed in the algorithm (e.g., used as
where is a variable),
it is not stored in register, which is the fastest storage on GPU.
Instead, CUDA compiler stores it in global memory and caches into L1 and L2 (in CUDA programming
model this is called *local memory*).

However, heavily using local memory can lead to cache thrashing when the grid dimension and block dimension increase, leading to a number of threads competing for cache, degrading overall performance. In order to mitigate the cache miss penalty, we suggest two different optimizations when using parallelism in iCRT: (1) using fewer threads by simply reducing block dimension and grid dimension, using the grid-stride loop method [nvidia-2013-gridstride], or (2) pinning each array in L1 cache through allocating the array in shared memory; this is possible as the shared memory shares capacity with the L1 unified cache. (2) is similar to the implementation of CRT kernel in cuHE [iccis-2015-cuhe], which uses the shared memory for storing thread-local arrays. We compare the methods on cuHE’s iCRT kernel [iccis-2015-cuhe] which implements Algo. 5 with parallelism, along with loop reordering with parallelism which is explained in Section V-A.

High-radix NTT (iNTT): Radix-2 NTT is memory-bound; GPU reads and writes a large input (dozens of megabytes with typical and values specified in Table III, exceeding the size of L1 and L2 caches of GPU) by times. At each butt function in Algo. 3, a GPU thread reads two values of from the device memory and writes two output values back to the device memory.

Using high radix NTT (radix- with ) can mitigate the memory bandwidth bottleneck because each GPU thread reads and writes values within in the butt operation performing -point NTT. It changes the number of transferring from above to , reducing the number of main memory accesses needed for NTT. However, increasing the radix is not always beneficial because, as each thread takes more than two inputs, register pressure on each GPU thread increases. Using registers more than the register file size of a streaming multiprocessor causes register spilling to local memory, leading to performance degradation with additional data loads from the main memory. Given the constraint, we use an appropriate size of radix (radix-32) to alleviate the high pressure of main memory bandwidth and conduct a sensitivity study in Section VII.

Parameters | CPU (AVX-512) | GPU |

& | ||

& | & 19 | & 38 |

Prime number size | ||

(Region 1) | 42 (43) | 90 |

(Region 2) | 63 (64) | 134 |

## Vi Experimental Setup

We compared the performance of the reference HEAAN [repo-2019-heaan], our AVX-512 implementation with multi-threading, and GPU implementation. We used Intel Xeon CPU (Skylake-based Xeon Platinum 8160 operating at 2.1 GHz) and NVIDIA GPU (Turing Titan RTX operating at 1.35 GHz). The CPU system consists of 24 cores per socket and each core has two AVX-512 FMA units, achieving a peak 64-bit integer performance of 1.61 TOPS per socket. Each socket has six memory channels, each equipped with DDR4-2666 DRAM modules. We did not use HyperThreading; the number of cores utilized was the same as the number of CPU threads populated. The GPU system consists of 72 streaming multiprocessors (SMs), each with 64 CUDA cores, performing up to 4,608 32-bit integer operations per cycle. The size of each L1 unified cache and L2 shared cache in the GPU is 128 KB (per SM) and 6 MB, respectively. Even if the CPU system has two CPU sockets, we only used one socket to compare it with the GPU system with a single discrete GPU.

We tabulate the key parameters for HE Mul of HEAAN on CPU and GPU in Table VI. We measured the execution time of HE Mul, excluding time for memory operations, such as malloc, free, and data transfers from host to the device for GPU. We conducted each experiment 32 times and reported the average.

Execution time (ms) and [Relative speedup] | |||||||

Function | Ref-1 (baseline) | Ref-24 | AVX-MT-24 | GPU-CLH | |||

CRT | [15.8] | [36.6] | [156.1] | ||||

NTT | [21.0] | [177.1] | [416.5] | ||||

iNTT | [20.7] | [57.3] | [124.4] | ||||

iCRT | [16.3] | [46.8] | [109.6] | ||||

Extra | [3.0] | [5.8] | [34.8] | ||||

Total | [14.8] | [42.9] | [134.1] |

## Vii Evaluation

We evaluated the effectiveness of the proposed optimizations in accelerating HEAAN mul by comparing against the performance of the reference HEAAN (Ref). For CPU, our optimizations exploit both intra-core and inter-core parallelism. We compared the basic implementation utilizing AVX-512 instructions (AVX), the one with the modified Shoup’s ModMul on top of AVX (AVX-M), and the one transposing the matrix on top of AVX-M (AVX-MT). In the basic GPU implementation (GPU), we adopted radix-2 iterative NTT for NTT and iNTT. We modified the CRT kernel of cuHE [iccis-2015-cuhe], which only exploits -degree parallelism, to exploit -degree parallelism. Also, we used the iCRT kernel of cuHE. We compared GPU with the followings: the implementation optimizing CRT by using ADC instead of intermittently conducting modulo operations (GPU-C), the one adjusting the number of launching threads on top of GPU-C (GPU-CT), the one using shared memory to pin the arrays of each thread to L1 unified cache on top of GPU-C (GPU-CP), the one applying loop reordering (Algo. 6) to translate a majority of iCRT computation into matrix-matrix mul to use -degree parallelism on top of GPU-C (GPU-CL), and the one implemented with high-radix NTT and iNTT to reduce main memory accesses and utilize GPU’s computing power more efficiently on top of GPU-CL (GPU-CLH).

We made the following key observations.
*First, exploiting the massive parallelism supported by modern CPUs and GPUs gives
even more than 100 performance improvement in HE Mul.*
Table VII shows the execution time and the relative speedup
of the CPU and GPU implementations after applying a series of architecture-aware
optimizations.
AVX-MT and GPU-CLH, the implementations giving the best
performance for CPU and GPU, achieve 42.9 and 134.1
speedup, respectively, compared to the single-thread Ref.
GPU-CLH performs 4.3 and 2.3 better than AVX-MT
on CRT and iCRT thanks to more ALUs populated on GPU.
Also, by reducing the main memory accesses through increasing the radix,
GPU-CLH achieves 2.35 and 2.17 performance improvement
in NTT and iNTT, respectively, compared to AVX-MT’s implementation.

*Second, our CPU implementations are highly scalable across both intra-core
and inter-core dimensions.*
AVX is effective regardless of the number of CPU threads populated,
providing 2.0 and 2.7 performance gain over Ref
when a single and 24 threads are utilized, respectively (see Figure 4(a) and (b)).
Among the primary functions, NTT is the best in scalability, leading
to 7.7 speedup for AVX over Ref when 24 threads
are populated.
Overall, AVX experiences 19.4 speedup when the number of
populated threads increases from 1 to 24, exhibiting a better scalability
than Ref (14.8) because of the following reasons.
AVX and Ref exploit parallelism in different ways for
CRT as described in Section V.
In Ref, each thread operates on different prime numbers
(-degree parallelism) where is not large (e.g., 42 or 63), and
hence Ref is more susceptible to a load imbalance across threads.
By contrast, each thread operates on different coefficients (-degree
parallelism) in AVX, exhibiting better scalability.
For iCRT, data accesses for the matrix occur in column direction
during matrix-matrix mul, causing its performance memory-bound because
hardware prefetching becomes ineffective.
However, with 24 threads being utilized, hardware prefetching hits more
frequently because a thread might access the data in the adjacent columns
that are prefetched by other threads, leading to better performance.

The additional optimizations applied to the AVX-512 implementation are effective as well. Figure 4(c) shows the impact of these optimizations on each major function when 24 threads are used. In AVX-M, both NTT and iNTT are 10% faster than AVX because these functions compute modular mul frequently. iCRT experiences a 18% speedup in AVX-MT compared to AVX-M because the matrix transposition alleviates the memory-bound issue.

*Third, the performance of GPU reaches close to full potential through
our microarchitecture-aware optimizations.*
Figure 5(a) shows the execution time of HE Mul on various GPU
implementations compared to that of the reference HEAAN running on CPU
with 24 threads (Ref-24).
Even the baseline GPU implementation (GPU) outperforms Ref-24
by 1.52.
Most of the speedup comes from accelerating NTT and iNTT.
iCRT in GPU, whose implementation we adopt
from cuHE [iccis-2015-cuhe], performs poorly; it is even 1.43
slower than that in Ref-24 and takes 81.2% of total HE Mul
execution time.

To reduce the execution time of iCRT, we devised the following optimizations and compared their performance in Figure 5(c). By adjusting the number of launching threads, we reduced the degree of performance impact due to cache thrashing, achieving a speedup of 4.22 (GPU-CT) compared to that of GPU. Pinning thread-local arrays to L1 cache (GPU-CP) performs better than GPU-CT, reaching 6.79 of speedup. Finally, GPU-CL was the best among all the iCRT optimizations (9.58 of speedup), by effectively exploiting -degree parallelism through the loop reordering explained in Section V.

Time for CRT (ms) | Speedup | |

GPU | 1.00 | |

GPU-Mod1 | 0.89 | |

GPU-Mod2 | 1.43 | |

GPU-Mod4 | 1.98 | |

GPU-C | 3.64 |

For CRT, performing fewer modulo operations led to a better performance. Table VIII compared the execution time of CRT when using different strategies to convert a BigInt number to an array of residue numbers in an RNS domain. GPU does not precompute and performs a modulo operation on every limb (one limb for ) of the BigInt. GPU-Modx means applying modulo operation every iteration in the inner-most loop of line 5 in Algo. 1. Even if using the precomputed table, conducting a modulo operation per iteration (GPU-Mod1) performs even worse than GPU by 1.1. This is because both implementations compute the same number of modulo operations whereas GPU-Mod1 imposes more pressure on local memory utilization. Fewer modulo operations led to better performance; GPU-Mod4 performs 1.98 better than GPU. By letting the partial sum () span three words instead of two words utilizing ADC in every iteration, GPU-C performs best with the speedup of 3.64.

NTT | iNTT | |||

Time (ms) | Speedup | Time (ms) | Speedup | |

Radix-2 | 8.73 | 1.00 | 9.76 | 1.00 |

Radix-4 | 4.74 | 1.84 | 5.47 | 1.78 |

Radix-16 | 3.65 | 2.39 | 4.88 | 2.00 |

Radix-32 | 3.72 | 2.35 | 4.67 | 2.09 |

Figure 5(d) shows the performance improvement of NTT and iNTT by increasing their radix values. Because NTT and iNTT algorithms are mostly symmetrical, they experience a similar degree of performance improvement with 2.35 (NTT) and 2.09 (iNTT). iNTT improves slightly less because iNTT includes additional computations such as element-wise division by , which does not gain speedup from using higher radix. Table IX shows the execution time of NTT and iNTT when varying the radix values. As the radix grows from 2 to 16, the performance of NTT and iNTT increases, taking advantage of reducing the memory accesses and hence alleviating the memory bandwidth bottleneck. However, the speedup saturates when the radix rises from 16 to 32 because of the register spilling to local memory; when radix increases beyond 32, the performance deteriorates due to the higher register pressure.

## Viii Discussion

CRT | NTT | iNTT | iCRT | |

by emulation | 155M | 48M | 47M | 319M |

by single native instr. | 27M | 17M | 17M | 51M |

The impact of supporting the emulated operations natively on AVX-512: As described in Section V, the cost of emulating 64-bit mul, modular mul, and ADC operations is significant in AVX-512 implementations. If some future CPUs support these instructions natively, the execution time of HE Mul could be reduced substantially. Table X summarizes the number of AVX-512 instructions required to perform each major function by comparing the cases where each of 64-bit mul, modular mul, and ADC is supported by either emulation and by a single native instruction. CRT and iCRT require just 17.3% and 15.8% of AVX-512 instructions if a future CPU supports these instructions natively, not through emulation. NTT and iNTT require one third of instructions by the instruction extension. Because HE Mul is mostly computation-bound, the significant reduction in the number of instructions would lead to a similar degree of performance improvement.

Impact of on the characteristics of HE Mul: determines multiplicative depth ; a larger depth requires a bigger . However, must increase proportionally to log to ensure a certain level of security (see Table II). Also, , , and increase in proportion to log. Based on these relationships, the computational complexity of Table IV can be expressed in terms of log. The complexity of CRT and iCRT is whereas that of NTT and iNTT is . Figure 6 shows the estimated number of operations for HE Mul according to log. When is small (e.g., log=150), CRT, NTT, iNTT, and iCRT require a similar number of operations, but as increases, CRT and iCRT become more dominant. Overall, the total number of operations for HE Mul is proportional to . When an application requires a large number (e.g., billions) of HE Mul, using large amortizes the cost of the expensive bootstrapping operation. However, using too large is costly because the maximum number of messages () that can be multiplied together by a HE Mul is , where the complexity of a HE Mul is super-linear, . The value we mainly target is 1,200, which is large enough to amortize the cost of bootstrapping; other HE accelerators [arxiv-2019-heax, ieee-2019-fpga] focused on much smaller values. For example, [ieee-2019-fpga] used the value of 180 without considering bootstrapping.

Impact of on the characteristics of HE Mul: As described in Section III, rescaling, which decreases log by log, is performed after each HE Mul to prevent the amount of message information in the ciphertext from increasing exponentially. As HE Mul is repeated, decreases, so does and . In region 1 of HE Mul, decreases linearly with log because two log-bit BitInt numbers are multiplied per ciphertext coefficient. By contrast, in region 2, should be set to represent (log + log)-bit BigInt (to multiply over the evaluation key polynomial). has the same trend as . Figure 7 shows the computation amount of HE Mul according to the log in the AVX-512 configuration of Section VI calculated based on Table IV. As is proportional to (log + log) in region 2, the number of operations for HE Mul when log becomes 30 (the smallest number where no more HE Mul is applicable) is still 24% of that when log is 1,200. Also, iCRT is dominant regardless the size of .

CKKS type | Full-RNS | Non-RNS |

[icfcds-2017-seal, repo-2020-palisade, repo-2020-lattigo] | [ictacis-2017-heaan, repo-2020-helib] | |

Parameter ( and ) selection | Rigid | Flexible |

Multiplicative depth | Lower | Deeper |

HE operation speed | Faster | Slower |

Main function of HE Mul | NTT, iNTT, | CRT, iCRT, |

mod up/down | NTT, iNTT |

The applicability of optimization techniques to other libraries supporting CKKS: In addition to HEAAN, there are several other libraries [repo-2020-helib, icfcds-2017-seal, repo-2020-palisade, repo-2020-lattigo] that support CKKS. The libraries fall into one of the two groups: libraries supporting full-RNS and non-RNS CKKS, as shown in Table XI. Non-RNS types manage each coefficient of polynomials of ciphertext in a big integer domain. By contrast, full-RNS types let the coefficient stay in an RNS domain during the whole procedure of homomorphic operations, resulting in faster operations. However, using full-RNS is less flexible because there exist rigid limitations while choosing the rescaling factor and the modulus . Each prime composing an RNS representation of in full-RNS CKKS should be set to be close to the rescaling factor [icsac-2018-rnsheaan]. In non-RNS CKKS, by contrast, one can freely choose a rescaling factor independent of without considering the approximation error existing in full-RNS CKKS. Moreover, due to these parameter limitations, the multiplicative depth of full-RNS variants is lower than an non-RNS scheme for a given security bit and error bound.

For the libraries supporting CKKS other than HEAAN, our optimizations described in Section V are partially applicable. Other non-RNS CKKS libraries can apply our optimization techniques because they use the same primary functions as HEAAN. Also, although full-RNS variants do not require CRT and iCRT, the optimization techniques for NTT and iNTT can be applied. Also, we can partially apply our techniques in CRT to the functions that change the number of primes in the RNS domain (mod up/down in Table XI). mod up increases the number of primes of an RNS representation of a given big integer, whereas mod down decreases it with an additional division operation [icsac-2018-rnsheaan, icsac-2016-rnsbfv]. Both functions can take advantage of our optimizations as their core functions are similar to applying CRT right after iCRT.

## Ix Related Work

FPGA-based HE accelerators: There have been numerous studies [iwches-2015-ltv, ieee-2016-customaccelerator, ieee-2016-coprocessor, ieee-2018-hepcloud, ieee-2019-fpga, arxiv-2019-heax] to accelerate HE operations using FPGA. [iwches-2015-ltv, ieee-2016-customaccelerator, ieee-2016-coprocessor] accelerate LTV-based FHE schemes whereas [ieee-2018-hepcloud, ieee-2019-fpga] acclerate FV-based FHE schemes. However, LTV and FV schemes have a limitation in practical use because they cannot perform approximate computations. HEAX [arxiv-2019-heax] uses FPGA to accelerate Microsoft SEAL, which supports a full-RNS variant of the CKKS scheme; however, HEAX considers only small parameter sizes ( and ), and the full-RNS variant it targets is not as versatile as the original HEAAN we accelerate in this paper due to the limitations in choosing rescaling factors and prime numbers.

GPU libraries for HE: [iccis-2015-cuhe, gpgpu-2016-yashe, iacr-2018-fvgpu, ieee-2019-rnsvariants] propose to accelerate the HE operations using GPUs. However, they either do not take advantage of the algorithm’s internal parallelism sufficiently, operate on only small or limited parameters, or do not consider the cost of modulo operations in GPUs. Moreover, all the aforementioned studies did not conduct a rigorous analysis of computational complexity and data access patterns of HE Mul, making it hard to assess the effectiveness of the proposed accelerators compared to the CPU or GPU implementations applying architecture-aware optimizations.

## X Conclusion

We have demystified the key operations of HEAAN, a representative and popular FHE scheme, from a computer architect’s perspective. After identifying that multiplying the encrypted data (ciphertext) is the most computationally demanding, we accelerated the major functions of HE Mul (CRT, NTT, iNTT, and iCRT) on CPU and GPU. To accelerate the major functions on CPU, we populate multiple cores by using multi-threading (inter-core parallelism) and AVX-512 instructions (intra-core parallelism). We accelerate HE Mul on GPU by effectively exploiting massive thread-level parallelism. Moreover, based on the in-depth analysis of the major functions for HE Mul, we introduced a series of architecture-aware optimization techniques such as loop reordering and matrix transposition for iCRT and taking a synergy between precomputation and delayed modulo operations for CRT. Our accelerated HEAAN on CPU and GPU outperforms the reference single-thread HEAAN on CPU by 42.9 and 134.1, respectively, in HE Mul, setting a new baseline for HE acceleration studies targeting practical usage.