1 Introduction
The past decades have witnessed the increasingly wide deployment of data centers. With their excellent storage capacity and computational performance, they have assumed a fundamental role in almost every aspect of human life. Nowadays, it is common to outsource data or computation extensive workloads to large data centers. However, for servers to work on the data from a client, the encrypted client data must be decrypted first, which reveals the private data to a potentially untrusted cloud server. Thus, as a growing number of services are moving online, especially after the COVID19 pandemic, privacypreserving computation, which allows secure operation on encrypted client data, is expected to play a pivotal part of this industry in the future. In this work, two privacypreserving schemes and their hardware acceleration are discussed.
1.1 Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) [FHE1, FHE2, FHE3, FHE4, FHE5, FHE6, FHE7, FHE8, FHE9, FHE10, FHE11], which permits secure computation on encrypted data without decrypting it, has been given extensive research in recent years to enable privacypreserving computation. To be precise, for a given function , a homomorphic encryption scheme satisfies . If it is homomorphic to any function, it is characterized as fully homomorphic encryption.
The very first FHE scheme was not devised until 2009, when Gentry proposed a general FHE framework [FHE1, FHE2]. It has been proven that from a Boolean circuit model perspective of computation, if an encryption scheme is homomorphic to its own decryption function followed by a universal logic gate, then it is homomorphic to any function. The operation that fulfills this property by refreshing the noise level of the ciphertext after each operation, thus transforming a Leveled HE (LHE) into an FHE, is called bootstrapping or recryption. Based on this idea, Gentry also presented a concrete construction that takes around 30 minutes per bootstrapping [FHE3].
Following Gentry’s blueprint, various schemes have been proposed for better efficiency. Among them, the most wellknown are BGV [FHE4], BFV [FHE5, FHE6], and CKKS [FHE7]. These secondgeneration schemes differ from Gentry’s approach in relying on the Ring Learning with Errors (RLWE) problem for its betterstudied hardness analysis and efficiency obtained via SIMDstyled computation [FHE8]
. These schemes focus on evaluating polynomials homomorphically. Several opensource implementations
[PALISADE, HElib, SEAL] can potentially reduce the recryption time to minutes depending on the security parameters.After the proposal of GSW [FHE9] in 2013, FHEW [FHE10] and TFHE [FHE11]
were published as thirdgeneration approaches. The thirdgeneration schemes focus on efficient homomorphic boolean logic evaluation. Although performancewise the third generation may not be superior to earlier schemes (since the amortized cost of the SIMDstyled second generation is estimated to be within the same order of magnitude as that of the third generation), the third generation is well accepted for its simplicity and flexibility in terms of both concept and implementation. Typically, to achieve a 128bit security level, the degree of the polynomial is less than 2048, which is much shorter than the ~10,000 in the second generation. Further, the ciphertext modulus is less than 64 bits, compared to ~200 bits in the second generation. The reported recryption time of the third generation is generally around 0.1s to 1s
[FHEWlike].Although several improvements have been presented to increase the efficiency of FHE, secure computation on encrypted data is still many orders of magnitudes slower compared to direct computation on plaintext, due to the prohibitive requirement of compute power that challenges current computing systems, which hinders practical application of FHE. For example, the time of bootstrapping one homomorphic NAND gate is subseconds for thirdgeneration schemes [FHE10, FHE11], compared to picoseconds of a CMOS NAND.This still renders FHE largely impractical. Thus, various hardware solutions have been proposed in recent years [FHEHW1, FHEHW2, FHEHW3, FHEHW4, FHEHW5, FHEHW6, FHEHW7, FHEHW8, FHEHW9, FHEHW10, FHEHW11]. [FHEHW1, FHEHW3] focused on encryption/decryption of RLWE in a postquantum scenario, which is less computation demanding due to the smaller size of the polynomial. Reference [FHEHW2] presented a cryptoengine for the encryption/decryption of RLWE for homomorphic encryption, which is less heavy lifting compared to homomorphic evaluation. The authors in [FHEHW10] explored acceleration for large number multiplication, while [FHEHW8, FHEHW9] discussed approaches to accelerate long polynomial multiplications in homomorphic encryption. Other works [FHEHW5, FHEHW6, FHEHW7, FHEHW11] implemented accelerators for LHE schemes based on the BFV scheme, with limited computation depth and security levels. Finally, an architecture for the CKKS scheme was proposed in [FHEHW4]. To date, there has been no hardware acceleration published for thirdgeneration FHE schemes.
1.2 Private Set Intersection
PSI is another security primitive that preserves privacy of operation. It allows two parties (Sender and Receiver) to exchange the intersection of their private sets without leaking any excess information other than the intersection set. Thus, its applications include private human genome testing, contact list discovery of social media apps, and conversion rate measuring of online advertisement. Recently, Microsoft introduced Password Monitor in the latest release of the Edge web browser, which compares a user’s private passwords saved to Edge with a known database of leaked passwords to figure out whether there is a leak in the user’s passwords. With the underlying PSI protocol, the server that facilitates the comparison learns nothing about the user’s passwords.
The PSI problem has been explored extensively, seeking efficient protocols [PSI1, PSI2, PSI3, PSI4, PSI5]. However, in an unbalanced scenario where one set is significantly smaller than the other, these protocols perform linearly on the size of the large set. In recent years, unbalanced PSI protocols based on secondgeneration FHE [PSI6, PSI7] were proposed that provide significant communication overhead reduction compared to previous approaches but maintain comparable performance. However, they still suffer from the large encryption parameters of secondgeneration FHE. While thirdgeneration FHE can perform boolean logic more efficiently, it is a natural candidate for performing the comparison in PSI. However, this has not been explored before.
1.3 Our Contributions
This paper presents the following contributions:

We present the first accelerator architecture for thirdgeneration FHE, targeting the operation (defined in the following section), which is a fundamental function of both secondgeneration and thirdgeneration FHE. By exploiting the asymmetric nature of the operation, the architecture is capable of maintaining high throughput with less resource usage while addressing different parameter sets. An extensive analysis of the architecture is included.

We propose a novel unbalanced PSI protocol that is based on thirdgeneration FHE and is demonstrated with the proposed hardware. The proposed PSI protocol makes the computation cost independent of the Sender’s set size. The core block of the PSI that facilitates the cross comparison of the PSI in [PSI6] is replaced with a homomorphic lookup table (LUT) implemented with thirdgeneration FHE. Unlike the multiplication used in [PSI6], which returns a nonzero value when the cross comparison misses and potentially leaks the content of the Sender’s set, the LUT only returns one bit indicating whether an element is inside the Sender’s set; and thus, avoiding sending any excess information about the Sender’s set. Therefore, the noise flooding process adopted in [PSI6] is not necessary. We introduce several additional algorithmarchitecture cooptimizations to reduce the computation and communication costs, rendering a practical application of the proposed PSI protocol.

A prototype of the proposed architecture is implemented with AWS cloud FPGA service. We develop all necessary highlevel functions in C++ and benchmark the implemented architecture with different parameter sets. We make the SystemVerilog HDL code of the proposed accelerator and supporting software code publicly available at [MYREPO].

We quantify and analyze the performance of the proposed hardware acceleator and PSI protocol. The measurements show over performance improvement compared to a software implementation for various subroutines of the thirdgeneration FHE and the proposed PSI.
2 Preliminaries
2.1 Notation
Throughout the paper, boldface lowercase letters
are used to denote vectors or polynomials depending on the context, and boldface uppercase letters
are used for matrices. The set of integers is denoted by , and the quotient ring of integers modulo is denoted by . The polynomial ring is denoted by , where N is a power of two. And represents the residue ring of modulo an integer . “” denotes the scalar multiplication with either another scalar or a vector/polynomial. “” denotes the vector inner product or polynomial product depending on the context, while “” denotes the outer product or element wise product of a polynomial. Lastly, “” represents the product of an RLWE ciphertext and an RGSW ciphertext, which will be detailed in the next section.2.2 Latticebased Cryptography: LWE, RLWE and RGSW
Almost all FHE schemes published so far are built upon the LWE and/or RLWE problem, which can be reduced to a lattice problem that is proven to be quantum safe within polynomial time [LATTICE, IDEALATTICE].
2.2.1 Learning with Errors Encryption
In practice, given a plaintext modulus and a ciphertext modulus , an LWE encryption of a plaintext with secret vector is defined as:
(1) 
with the vector , of dimension , sampled uniformly from the integer vector space and error sampled from an error distribution [LATTICE]. As long as , the plaintext can be successfully recovered by , which rounds off the noise.
2.2.2 Ring Learning with Errors Encryption
However, additive homomorphism is not enough to construct the bootstrap function. RLWE that is potentially multiplicative homomorphic is also incorporated in thirdgeneration FHE. Similar to the definition of LWE, given a plaintext modulus and a ciphertext modulus , an RLWE encryption of a plaintext polynomial with secret polynomial is defined as follows:
(2) 
with the polynomial sampled uniformly from the ring , and , a noise polynomial, sampled from an error distribution [IDEALATTICE]. As long as , the plaintext can be successfully recovered by , which rounds off the noise. In some contexts, the scale is omitted for clarity.
Since RLWE is a special form of LWE, the coefficients of the polynomial of an RLWE ciphertext can be converted into multiple separate LWE ciphertexts under the same secret key with some transformation of polynomial , which is detailed in Appendix A.
2.2.3 NTT and INTT
The polynomial multiplication of RLWE can be efficiently computed with NTT. NTT is an adaptation of the wellknown FFT algorithm, which reduces the complexity of polynomial multiplication from to . However, to perform polynomial multiplication modulo , negacyclic/anticyclic convolution is adopted [NTTTRICK]. The optimized NTT/INTT algorithms summarized in [NTTTRICK] are adopted and implemented on hardware in this work. The algorithms are given in Appendix E and D.
2.2.4 Ring GSW Encryption
Lastly, RingGSW (RGSW) encryption is widely adopted in thirdgeneration FHE to facilitate homomorphic polynomial multiplication [FHE10, FHE11]. It is defined as a matrix of RLWEs (in some literature, the two columns are concatenated as a onedimensional vector):
(3) 
with defined as a vector of RLWEs:
(4) 
where is a predefined decomposition base and denotes the length of vector .
The multiplication of an and an is defined in Equation 5, with the two polynomials of the RLWE being decomposed by the base into two vectors of polynomials, and , that satisfy and . Further, the product of a polynomial and an RLWE ciphertext is defined as . The operator is used extensively in the bootstrap process, and is the main focus of our hardware implementation.
(5) 
2.3 Bootstrap in ThirdGeneration FHE
As shown in Section 2.2, LWE is additive homomorphic, meaning that , so it is used to homomorphically evaluate Boolean logic in thirdgeneration FHE. Take for example, with being either or . The result of the NAND can be extracted from the sum . If the sum is or , then . Otherwise, if the sum is , then . Thus, the NAND is encoded in the MSB of the sum. However, further addition cannot be applied to the resulting LWE ciphertext due to both the mismatch of the data format (LSB vs. MSB) and the increased noise, which can potentially contaminate the message. Therefore, the bootstrap process introduced in Section 1.1 is required to reset the data format and noise level.
The bootstrap process of FHEW [FHE10] is implemented in this work for its integer operation that better serves the purpose of hardware acceleration. It is composed of three steps, homomorphic accumulation, RLWE to LWE key switch, and LWE modulus switch. The homomorphic accumulation takes of the processing time [FHEWlike], therefore, this subroutine is deployed on the hardware, while others are done in software and will not be discussed here. The reader is referred to [FHEWlike] for further details.
Figure 1 illustrates the data flow of homomorphic accumulation, with the operation highlighted in the dotted red box. At first, the bootstrap key (BT key, an array of RGSW ciphertexts) is generated by the local user and transferred to the server. This is a onetime process. For the server to bootstrap one LWE ciphertext, a homomorphic accumulator is initialized based on of the LWE, in INTT domain. Then, the accumulator is multiplied with a element of the BT key by the operation. The element of the BT key is indexed by of the LWE and . The product is accumulated and looped back for next multiplication. After the loop finishes, the output is passed through RLWE to LWE key switch function and LWE modulus switch function (not shown in Figure 1) to complete the whole bootstrap process.
2.4 Augmented Subroutines
To build the proposed PSI protocol, we adopt some additional features from another thirdgeneration scheme TFHE [FHE11].
2.4.1
The operation between an RLWE and a RGSW, defined in Section 2.2.4, can be used to construct a homomorphic MUX gate [FHE11]. Let be the selection signal of the MUX gate with equal to either or , and let and be two input RLWE ciphertexts for the MUX. The CMUX function is defined in Equation 6, which output an RLWE ciphertext corresponding to the encrypted selection signal.
(6) 
2.4.2
Following the definition of the , is formulated to homomorphically rotate an encrypted polynomial by multiplying the polynomial with a power of . A simplified version is shown in Equation 7. Let be the selection signal of the CMUX gate, and let be the input RLWE ciphertext. Parameter denotes the number of steps for the rotation. Thus, the output RLWE encrypts a plaintext that is either rotated or not based on the selection. A comprehensive definition can be found in [FHE11].
(7) 
2.4.3 Homomorphic LUT and Plaintext Packing
Intuitively, the CMUX gate can be concatenated into a CMUX tree to evaluate an arbitrary binary function homomorphically as shown in Figure 2 (a) [FHE11]. The function is precomputed and encrypted into an LUT of RLWE ciphertexts, and after traversing the CMUX tree indexed by RGSW encryptions of the binary representation of an input , an RLWE ciphertext that encrypts the corresponding is output.
However, the size of the LUT is large if each RLWE ciphertext only encrypts one function value, resulting in RLWE ciphertexts. Also, the amount of CMUX is . This exponential size can be reduced by a factor of the length of the polynomial in an RLWE ciphertext if several function values are packed into an RLWE ciphertext. For example, if each coefficient of a plaintext polynomial is taken as a plaintext slot, then a contiguous block of function values can be packed into one polynomial, such as , where is the length of the polynomial. Thus, an RLWE ciphertext can encrypt at most function values, which reduces the size of the LUT and the amount of CMUX by a factor of . Figure 2 (b) details this packing scheme using an example in which each RLWE encrypts two function values, reducing the number of CMUXs by a factor of . In the example, the MSBs of the input are first used to find the desired RLWE ciphertext, and then the target slot is rotated to the position dictated by the LSB of with . Lastly, the desired slot is extracted from the RLWE into an LWE ciphertext as described in Section 2.2.2.
2.4.4 RLWE Key Switch
The last included subroutine is the RLWE key switch that converts an RLWE ciphertext encrypted under a secret key into another RLWE ciphertext encrypted by a different secret key . Given a decomposition base , an RLWE keyswitch key (KS key) is created by encrypting the secret key into a vector of RLWE ciphertexts, as shown in Equation 8, with denoting the length of the vector.
(8) 
For an RLWE ciphertext encrypted with key , to switch to key , the new ciphertext is calculated by Equation 9, which is basically an inner product of the decomposed and the keyswitch key, with . The multiplication of a polynomial and an RLWE ciphertext is defined in Section 2.2.4. A formal definition of the process can be found in [FHE4, FHE5, FHE6]. Note that this operation resembles the operation.
(9) 
3 Unbalanced PSI with (Leveled) Augmented FHEW
3.1 High Level Construction
For two parties, the Receiver and the Sender, to find the intersection of their private sets and w.l.o.g. assuming each contains some 32bit integers, as show in Figure 3 (a), each element of the Receiver’s set is compared with the elements of the Sender’s set. In the case of a match, the element is added to the intersection.
However, in an unencrypted scenario, one of the parties needs to reveal all its content to the other party, which is undesirable. So, in [PSI6], the comparison is fulfilled by a homomorphic product of the difference between elements in the two sets. For each RLWE encrypted , the Sender evaluates homomorphically the product of the difference , as shown in Equation 10. After the Receiver decrypts the result, the product evaluates to if finds a match in the Sender’s set .
(10) 
In this work, the comparison is facilitated with the homomorphic LUT described in Section 2.4. As shown in Figure 3 (b), on the Sender’s side, an LUT is precomputed based on the content of the Sender’s set , with ; otherwise, the entry is set to . On the Receiver’s side, each element is decomposed into its binary representation and encrypted with a vector of RGSW ciphertext, and sent to the Sender. Then, the RGSW encrypted is passed into the CMUX tree to index the LUT on the Sender’s side, and the result is sent back to the Receiver. After decryption, indicates that is in the intersection, otherwise, it is not. Since the proposed protocol follows the highlevel construction of [PSI6], the attack model, security implications, and proof in [PSI6] also apply to this work with the exception that the noise flooding process is unnecessary because the LUT only returns whether is in set and therefore no excess information about set is leaked.
Let denote the bit width of the elements inside both sets. The communication complexity is linearly dependent on and the size of the Receiver’s set, resulting in . The computation cost is , which is independent of the size of the Sender’s set. So, this naïve construction is very inefficient in both computation and communication traffic. For example, if , 32 RGSW ciphertexts have to be transferred for each element in the Receiver’s set, resulting in a low ciphertext utilization. Additionally, CMUXs are evaluated for each element in the Receiver’s set. Several optimizations can be adopted to mitigate these problems and render practical application of the protocol.
3.2 RLWE Substitution and RLWE Expansion
Before tackling the problems, two additional subroutines need to be discussed. The first is RLWE substitution, which transforms an RLWE ciphertext into
for an odd integer
. An RLWE keyswitch key from to is precomputed based on the substituted secret key . In the process, an RLWE ciphertext is first substituted to get , and then keyswitched to encrypted with the original secret key. A formal definition can be found in [ORAM2].The RLWE substitution is used extensively in the RLWE expansion operation [ORAM2], which expands an RLWE ciphertext from into a vector . An example of how RLWE substitution fulfills the expansion is detailed in Appendix B.
The data flow of RLWE substitution is shown in Figure 4, with the key switch highlighted in the dotted red box. An RLWE ciphertext in the NTT domain is first transformed into INTT form and substituted. Then it is decomposed with base and keyswitched to the original secret key to get a substituted RLWE ciphertext. After that, the output ciphertext is postprocessed for RLWE expansion. Based on our experiment, of the processing time of RLWE expansion is dedicated to substitution and key switch functions, so these two functions are offloaded to a FPGA. Note that the key switch data flow is very similar to the bootstrap data flow. Therefore, the proposed architecture merges both data flows, which will be detailed in Section 4.
3.3 Optimizations for The PSI Protocol
For the proposed PSI, both the computation and communication costs depend directly on the bit width of the elements in the set. Hence, the first optimization is to reduce the bit width of the element with permutationbased hashing [permhash]. In permutationbased hashing, to insert a 32bit element from the Receiver’s set into bins, it is divided into , with consisting of bits. The position of the element is calculated by Equation 11, where is a hash function. Therefore, the position of an element also stores some information about the element. And instead of inserting into the hash table, only is inserted, which reduces the bit width to .The correctness of the comparison in the homomorphic LUT holds with permutationbased hashing, which is detailed in Appendix C. With permutationbased hashing, the amount of transferred RGSWs is reduced by and the amount of CMUXes is reduced by a factor of .
(11) 
The second optimization achieves further computational reduction by exploiting the vertical packing described in Section 2.4. With vertical packing, at most LUT elements can be packed into one RLWE ciphertext, which shrinks the amount of CMUXs by roughly a factor of . For example, after the permutation hashing with , the bit width of the elements in each bin is , reducing the size of the CMUX tree to . At , the vertical packing reduces it to . Compared to the original size, , a reduction by times is achieved in total.
The last optimization aims at decreasing the communication payload. Instead of transferring an RGSW ciphertext, containing RLWE ciphertexts, for each bit of an element in the Receiver’s set, it is observed that the first column of an RGSW ciphertext, as shown in Equation 3 and Section Equation 4, can be calculated from the second column, which is detailed in Equation 12. Thus, only the second column needs to be transferred together with a shared RGSW encryption of the secret key [ORAM2]. The transaction size is therefore reduced by a factor of .
(12) 
However, the ciphertext utilization is still very low because for each element in the Receiver’s set, RLWEs are transferred. So, elements, for example, , from the Receiver’s set are packed into a 2D array of RLWE ciphertexts for better utilization. Each element of the array is formed as and is indexed by , (assuming after applying permutationbased hashing, the bit width is ). Upon receiving the array, the Sender unpacks it, with the RLWE expansion described in Section 3.2, into arrays of RLWEs for each element , of the form . Finally, the RLWEs are converted into RGSWs with Equation 12 and passed into the LUT to complete the PSI. Note that we set the of the RGSW to be equal to the of the keyswitch key used in the RLWE expansion.
Together, compared to transferring complete RGSWs, the communication overhead is reduced from RGSWs ( RLWEs) per element to RLWEs per elements, amounting to a reduction if , , at the cost of increased computation on the Sender’s side to unpack and reconstruct the RGSWs.
Figure 5 shows the data flow of the proposed homomorphic LUTbased PSI. It assumes that after the permutationbased hashing, the data bit width is bits and the polynomial length is . The Receiver packs all the necessary bits into an array of RLWE ciphertexts and sends it to the Sender. The Sender then unpacks and the array of RLWEs into an array of RGSWs in which each column encrypts the binary representation of an element in the Receiver’s set. Then, each column of RGSW is passed through the LUT, and an RLWE that encrypts the lookup result at index 0 is generated. Finally, the LWEs that encrypt lookup results are extracted from the RLWEs, as described in Section 2.2, and sent back to the Receiver. The hashing process is not shown in the figure. Other optimizations utilized in [PSI6] can also be applied to our protocol, such as prehashing both parties’ sets into smaller sets to reduce the set sizes, using modulus switching to reduce reply ciphertext size, etc.
In summary, after applying the above optimizations, the communication overhead of the scheme is , assuming, on the Receiver’s side, bins after hashing and at most one element in each bin, with dummy elements filling up the empty bins. The computation cost is .
4 Architecture of The Proposed Accelerator
4.1 Overall Architecture
Figure 6 (a) shows the overall architecture of the proposed accelerator with a zoomin view of the compute pipeline in Figure 6 (b). Implemented with AWS F1 instance, the accelerator is controlled and monitored by the host software running on an x86 processor through various AXI interfaces. The configure parameters and instructions are programed with the AXILite interface, and the FIFO states are also read from it. The DMA module communicates with the FPGA through the AXI bus to program the FPGA DDR and read/write the RLWE FIFOs. The RLWEs streamed in and out of the FPGA are in the NTT domain. Further, the modulo multiplication in the accelerator is facilitated by the standard Barrett Reduction [Barrettreduction].
The architecture works in a pipelined fashion, with necessary interstage double buffering. Upon an input instruction, the key load module reads the corresponding key from the preprogramed FPGA DRAM into its own key load FIFO. In parallel, the INTT/NTT modules inside the compute pipeline manipulate the input RLWEs, hiding the DDR access delay of the key load module since the keys are only needed at the poly MAC stage, which facilitates the polynomial and RLWE vector inner product introduced in Section 2.2.4 and Section 3.2. Once the computation finishes, the output RLWEs are written back to the RLWE FIFO dictated by the mode of the accelerator, which will be detailed later, and then streamed out to the host.
As mentioned in Section 3.2, the accelerator merges the two data flows, the RLWE substitution and the bootstrap process. Note that the data flow of evaluating the homomorphic LUT introduced in Section 2.4.3 is mostly the same as the bootstrap flow since they both incorporate the operation, so they will not be differentiated in the remaining text. There are three primary differences between the two data flows. The first is the RLWE key switch versus the operation as highlighted in Figure 4 and Figure 1, respectively. Second, in RLWE substitution, after INTT, the subroutine that transforms the into , as stated in Section 3.2, is needed; this subroutine is unnecessary in the bootstrap process. Lastly, an RLWE ciphertext, streamed into the in/out FIFO, only passes through the compute pipeline once for RLWE substitution and is then streamed out from the output FIFO after the computation. In contrast, in the bootstrap process, after initialization, the same RLWE (homomorphic accumulator) must be looped times through the compute pipeline before being streamed out, meaning that the output RLWE from the compute pipeline should go to the same FIFO as the input RLWE.
The first two differences regarding the computation are automatically taken care of by the different instructions passed into the compute pipeline. For the third one, a mode configuration is added to the FIFOs to differentiate the situations, as shown by the dotted lines in Figure 6 (a). In mode, the in/out FIFO acts only as an input FIFO that receives the input RLWEs, whereas the output FIFO holds the processed RLWEs. While in the mode, the output FIFO is turned off and the in/out FIFO holds the intermediate RLWEs. The compute pipeline continuously reads and writes the in/out FIFO until the loop finishes. Then the RLWEs in the FIFO are streamed out to the host.
4.2 INTT Module
Figure 7 (a) details the structure of the INTT module, which follows the algorithm in Appendix D, except for the first outer loop where the input RLWE is read from the global in/out RLWE FIFO, and the intermediate result is written to its own two polynomial buffers since each RLWE contains two polynomials. Starting from the second outer loop, the input is read from the polynomial buffers and written back after being processed by the butterflies. Each INTT module stores its own copy of the twiddle factors (TFs) in its local memory.
Two parallel butterfly units are included in each INTT module to achieve better performance while maintaining a reasonable FPGA resource usage. Thus, to feed enough data, each address of the polynomial buffer contains two consecutive coefficients of a polynomial. The BRAMs of the FPGA that are used to build the polynomial buffers are inherently composed of two read/write ports, fitting the butterfly data access pattern and allowing it to read/write two different addresses at the same time. However, the read and write can only be done in separate clock cycles, resulting in butterfly utilization and halving the throughput. Therefore, we time interleave the two polynomial buffers, as shown in Figure 7 (b), to achieve full utilization of the butterflies.
Due to the variation of the data access pattern of the butterfly units in each outer loop of the INTT algorithm, there is a mismatch between the data access pattern and data storage pattern, resulting in two different data flows from the buffers to the butterflies. As shown in Figure 7 (c), in pattern 1, the data passes into a butterfly are from different addresses, while in pattern 2, they are from the same address. Therefore, necessary data MUXs are appended to the butterfly units to reorder the input/output data as needed. All the necessary loop counters and step counters are implemented inside the control block, together with the control of the MUXs.
Besides the INTT functionality, the INTT module also incorporates an init block for the homomorphic accumulator initialization function mentioned in Section 2.3.
4.3 Pipelined NTT Module
The NTT algorithm (Appendix E) is very similar to the INTT algorithm, except for the last scaling loop [NTTTRICK]. But a different construction from the INTT module is adopted for the NTT module. The structure of the module is shown in Figure 8, and a discussion of this construction is included in a later section.
To achieve higher throughput for the NTT module, the outer loop of the NTT algorithm is unrolled into pipeline stages, with each stage only processing one fixed data access pattern, greatly reducing the control complexity of each stage. Compared to the structure of the INTT module, this implementation offers the same processing latency for an input polynomial but times higher throughput.
Each stage reads the input from the polynomial buffer of the previous stage and processes it with a predetermined data access pattern that is specific to that stage at design time. So, there is no onthefly control/MUXs for the data flow, which not only reduces resource usage but also allows a better timing requirement. Note that there is no read/write from/to the same buffer memory; therefore, it is not necessary to employ the timeinterleave trick as in the INTT module.
Although the internal structures of the stages are mostly the same, except for the loop counter and step counter inside the control block, extra care should be taken in actual implementation. First, to adapt to different polynomial lengths, MUXs are needed to skip the leading stages for short polynomials (Figure 8). Second, the leading stages also incorporate the decomposition functionality as stated in Section 2.2.4 and Section 2.4.4, which is just a bitwise AND with a binary decomposition basis and is not detailed in the figure.
4.4 Compute Pipeline Analysis: Asymmetric INTT and NTT
In the following section, we refer to the overall latency of the INTT/NTT algorithms as one NTT latency (ONL) and the latency of one outer loop of the algorithm as one stage latency (OSL). Therefore, . And our compute pipeline architecture, utilizing the pipelined NTT module (Figure 8) with nonpipelined INTT modules, is referred to as asymmetric structure due to the throughput difference of the two styles. The conventional implementation of using similar structures and latencies for both NTT and INTT modules (Figure 7) is defined as a symmetric structure.
The design of our compute pipeline concentrates on balancing high throughput with optimized resource usage and parameter flexibility. So, the main compute pipeline is built around an asymmetric structure, as shown in Figure 6 (b). A comparison of the symmetric and asymmetric structures is given in Figure 9, with the poly subs block omitted as it is not a throughput bottleneck. The dataflows of the (Figure 1) and RLWE substitution (Figure 4) can be mapped to both architectures with the same throughput. However, the asymmetric structure consumes less resources than its symmetric counterpart.
In the symmetric pipeline (Figure 9 (a)), to have balanced throughput, one INTT module is accompanied with many NTT modules since each input polynomial is decomposed into polynomials after the INTT operation. The throughput of both modules is one polynomial per ONL due to the nonpipelined construction. The NTTs are also followed by many polynomial/RLWE multiplication blocks to facilitate the inner product of the two dataflows. Although the trailing stages can operate with higher throughput, the overall throughput of the whole pipeline is capped by the first two stages, resulting in a throughput of one polynomial per ONL. Higher throughput can be achieved by operating multiple pipeline instances in parallel.
Most of the prior arts implemented an architecture that is similar to the symmetric structure with the INTT and NTT modules separated without considering the data flow connecting the modules. We take it one level up and make use of the asymmetric structure to cope with the different throughput requirements of the INTT and NTT modules, as shown in Figure 9 (b). Since the throughput of the pipelined NTT is OSL, the overall throughput of the whole pipeline is one input polynomial per because of the polynomial decomposition, with one caveat that to balance the throughput between the INTT and NTT, INTT modules should operate in parallel. Note that in the asymmetric scenario, the trailing stages are also changed to the pipelined form ( vs. poly mult RLWE and an accumulation vs. a wide addition). In practice, is always greater than ; therefore, the asymmetric pipeline enables higher throughput than a single instance of the symmetric pipeline.
Though the symmetric structure can achieve the same throughput as the asymmetric one, with many instances operating in parallel, as seen in Figure 9 (a), the asymmetric pipeline uses less FPGA resources. The reduced resources stem from three sources. First, it is clear that in both cases, the number of INTT modules is the same, amounting to . The number of NTT modules seems to be the same as well since there are NTT modules for the symmetric pipeline and the asymmetric one also incorporates NTT stages. However, in the symmetric case, the NTT module has a similar structure as the INTT module shown in Figure 7 (a), which is much more complex than the NTT stage used in the pipelined NTT. Synthesis shows that with pipelined NTT, less LUT usage is achieved.
Furthermore, the pipelined NTT module has not only smaller control logic but also lower memory requirements. Part of the savings comes from less TF memory in pipelined NTT. Each of the NTT modules used in the symmetric pipeline stores a complete copy of the polynomial of the TF in its own local memory, similar to what is shown in Figure 7 (a), so that they can operate independently. Therefore, in total, copies of the TF are stored. In contrast, in the asymmetric version, there is only one complete copy of the TF. Because each stage of the pipelined NTT is only responsible for one outer loop of the NTT algorithm, it only needs to store the portion of the TF that is used in that outer loop. For example, in the first stage of the pipelined NTT, instead of a complete polynomial of TF with coefficients, only one TF needs to be stored. Thus, overall, a times reduction of the TF memory usage is achieved with pipelined NTT, equivalent to over reduction in practice. It is possible to reduce the memory usage in the symmetric version by sharing one TF memory within one pipeline and force all the NTT modules to act at the same pace, but that implies stricter timing requirements since the capacitive load of the memory output is times higher, exacerbating performance. Also forcing all NTTs to synchronize degrades the flexibility of the architecture.
The memory size of the pipelined NTT module is also reduced due to fewer polynomial buffers. In the nonpipelined NTT module, similar to the INTT module in Section 4.2, two polynomial buffers are instantiated for timeinterleaved buffer access to maintain butterfly utilization. In contrast, each stage of the pipelined NTT module reads and writes different buffers; therefore, timeinterleaving is unnecessary. So, the pipelined NTT poses a saving on the polynomial buffer compared to the nonpipelined version.
Lastly, the trailing stages of the asymmetric pipeline are also less complex than that of its symmetric counterpart. As shown in Figure 9, since the pipelined NTT outputs one polynomial at a time, only one poly mult RLWE module is needed in the asymmetric structure, compared to parallel mult modules in the symmetric one. In practice, it reduces the amount of mult modules by times, with . Although the amount of poly mult RLWE modules can be reduced in the symmetric pipeline by reusing one mult module across different NTTs in a timeinterleaved manner due to the higher throughput compared to the INTT/NTT, a very wide MUX, to one, must be inserted between the stages, which would greatly impact timing and performance and introduce more control complexity. In the asymmetric structure, there is a similar MUX between the INTT and NTT modules; however, it is only to one, which is much smaller. In addition, the wide RLWE addition in the symmetric pipeline is also replaced with an RLWE accumulation with ordinary wordsize modulo addition.
Besides the resource savings, the asymmetric construction also automatically adjusts to different parameter settings. In the symmetric pipeline, the number of NTTs should be set as the largest possible number of dc of the application at design time. If at design time, the parallelism is 3 for NTT, when at run time, the utilization of the NTTs is only . Extra effort can be applied to remap the connection between INTT and NTT to reach utilization, but that comes with more control overhead, negatively impacting performance. However, with the pipelined NTT module, as long as the INTT continuously feeds input to it, utilization is always maintained with no extra control overhead involved since the design space of the pipelined NTT itself is independent of the parameter . In fact, even when the of run time is higher than the designated of design time, the pipelined NTT requires no extra control to handle it. However, it should be noted that in the above case, the INTT of the asymmetric pipeline can be underutilized. But since the number of INTT blocks is less than the number of NTT stages in general, it is not optimized in this work.
5 Measurement
5.1 Experiment Setup
The proposed architecture is implemented at 125MHz system frequency on an AWS F1 instance. The implementation supports up to bit input data word size. But to reduce the complexity of the modulo multiplication block in the butterfly, only a subset of the bit widths is implemented,as detailed in the following sections. Two polynomial lengths, and , which are typical for thirdgeneration FHE and fit our experiment for the PSI protocol, are supported natively. The polynomial buffers in the FIFOs, implemented with BRAM, are configured to the size of the longer length, . In addition, since butterfly units operate at the same time in the INTT and NTT, each buffer line contains two consecutive polynomial coefficients. Thus, the size of each polynomial buffer is predefined as bit. However, in this prototype, no optimization on the BRAM utilization is devised, so when the input polynomial length is , only the first half of the buffer is used. Following the analysis of Section 4.4, the number of INTT is set to largest possible to keep a balanced throughput, which is in our implementation.
Parameter Set  
MEDIUM  256  512  1024  27  25  23  
STD128_AP  512  512  1024  27  25  23  
STD192  512  512  2048  37  25  23  
STD256  1024  1024  2048  29  25  32  
STD192Q  1024  1024  2048  35  25  32  
STD256Q  1024  1024  2048  27  25  32 
Parameter Set 
Amortized Processing Time ()  Amortized Stream Out Time ()  Software [FHEWlike] ()  Improvement 
MEDIUM  6615  49  141100  21.1 
STD128_AP  13238  48  283800  21.3 
STD192  26253  54  578400  22.0 
STD256  52523  54  1180800  22.4 
STD192Q  52524  59  1270500  24.2 
STD256Q  70031  58  1571500  22.4 
5.2 Measurement of Bootstrap of The Third Generation FHE
The parameter sets used to benchmark our implementation of the thirdgeneration FHE are listed in Table 1 and adopted from [FHEWlike]. Since our work only implements the homomorphic accumulation of the bootstrap process (including evaluation, accumulation, and key switch) on the hardware, we only report the measurement of this operation to emphasize our advancement. It is composed of two parts, the processing time and the time of streaming out the result to host. Table 2 summarizes the measurement of the homomorphic accumulation function. Due to the pipelined nature of the proposed accelerator, the maximum parallelism is 12 accumulations. So. the reported time is amortized over 12 inputs. Because the homomorphic accumulation function is independent of the input binary gate, we do not differentiate it during the measurement. The reported time is averaged over all measured input gates.
The software implementation [FHEWlike] of the FHEW scheme from the PALISADE library [PALISADE] operates on the same host machine and is used for comparison. Table 2 gives the comparison result of the proposed accelerator over software implementation. As stated above, only the homomorphic accumulation part is compared. On average, a speedup for the homomorphic accumulation function is achieved compared with the software implementation.
5.3 Measurement of The Proposed PSI
In our implementation of the proposed PSI protocol, we set the encryptionrelated parameters to be , , with , which achieves around 128bit security level according to the LWE estimator [LWEestimator]. The of the RGSW and of the RLWE keyswitch key are both set to . Since the proposed PSI is not directly available in opensource libraries, we developed the necessary components of the scheme ourselves for baseline comparison.
The average processing time of the two basic operations of the proposed PSI, RLWE substitution and , as deployed on the hardware are shown in Table 3. A comparison to our own software implementation is also included in the table. The raw measurement shows a speedup factor of over 140, which is much higher compared to the improvement of the bootstrap process. A discussion of this discrepancy is incorporated in a later section. The last column, ‘Scaled Improvement,’ is added for this purpose and is discussed later, as well.
Proposed  Software  Scaled  
Operation  Accelerator ()  ()  Improvement  Improvement 
RLWE Substitution  105  17616  167.8  27.9 
105  14739  140.4  23.4 
Communication Size  
Parameters  Sender’s Processing  (MB)  
b  k  Time ()  b  k 
32  14  1642  27.0  256 
12  814  7.5  64  
10  585  2.1  16  
30  14  1148  24.0  256 
12  410  6.8  64  
10  203  1.9  16  
28  14  935  21.0  256 
12  287  6.0  64  
10  102  1.7  16 
Based on the time consumption of the basic operations from Table 3, the processing times on the Sender’s side with the proposed accelerator and the communication size of the proposed PSI are listed in Table 4. Since the complexity of our scheme is only directly dependent on the bit width and the hash table size , assuming only one element in each bin on the Receiver’s side, we only list these two factors as design parameters in the table, with the security parameters set as above. In the ReceivertoSender communication size, the keyswitch keys and the RGSWencrypted are not included, which are of size 2.1 MB and 384 KB, respectively. Note that a modulus switch process can be applied to the returning LWE ciphertexts from the Sender to the Receiver, which can further reduce the message size by ~ [PSI6]. Figure 10 shows a time breakdown of the proposed PSI operating with the proposed accelerator. Four parts are included: (a) RLWE substitution; (b); (c) RGSW transfer, which transfers the reconstructed RGSWs to the FPGA DDR; and (d) software post process. The first three are attributed to hardware. The measured time consumption of each part is also included in the diagram. The software post processing times are raw measurement data and not scaled, which will be discussed in next section.
5.4 Analysis of The Measurement Results
5.4.1 Software Inefficiency Encountered during Measurement
As mentioned in the above section, compared to the improvement of the bootstrap process listed in Table 2, we see a higher speedup in the basic PSI operations, as shown in Table 3. The discrepancy mainly results from the different software implementations that are being used in the comparison. Since the proposed PSI and its basic operations are not directly available in opensource libraries, we developed the software implementation ourselves from scratch for both verifying the hardware design and baseline comparison. We also built our own software for the bootstrap process for the purpose of hardware verification and comparison.
However, due to our relatively limited effort, our own software code may not perform as efficiently as the highly optimized opensource libraries. In order to estimate the potential software performance discrepancy, a comparison between our own software and an opensource library [PALISADE] is conducted with the same host machine using commonly available operations such as NTT/INTT, polynomial operations, and bootstrap process. Based on the comparison, our own software code is around slower compared to opensource library. Hence, the measured improvement in the third column of Table 3 is scaled by 6 to factor in potential software optimization for a more realistic speedup number for the basic operations of the proposed PSI. This scaled number is shown in the last column of Table 3.
The inefficiency in our software code includes unoptimized post processing, which takes about of the total processing time of the proposed PSI operating on the accelerator (Figure 10). Thus, by factoring out this inefficiency, the total time consumption of the proposed PSI could be reduced by around (which is not accounted for in the reported performance in Table 4).
Parameters  Sender’s Processing (s)  
b  k  Measured  Attainable Bound 
32  14  1642  273 
12  814  135  
10  585  97  
30  14  1148  191 
12  410  68  
10  203  33  
28  14  935  155 
12  287  47  
10  102  17 
5.4.2 I/O Bandwidth Bottleneck of the Implemented Accelerator
During the measurement, we find that the latency of processing just one input on the proposed acceleration hardware is ~350 for RLWE substitution and 309 for , which includes 120 streaming in and out. Due to the pipelined nature of the proposed accelerator, a maximum parallelism of 13 can be achieved in the RLWE mode. Therefore, ideally, the average time consumption of processing one input on the hardware should be ~17 , which is faster compared to the numbers listed in the first column of Table 3. This shows that, in the RLWE mode, the accelerator is bottlenecked by the I/O bandwidth. In the case that an optimized I/O is achieved, better performance can be extracted from the proposed accelerator.
Table 5 summarizes the (estimated) attainable bound of processing time of the proposed PSI, which both factors out software inefficiency and operates on an optimized I/O.
6 Conclusion
In conclusion, the first hardware acceleration architecture for thirdgeneration FHE is proposed in this paper. Featuring an asymmetric INTT/NTT configuration, the proposed compute pipeline achieves less resource usage while maintaining a high throughput. An extensive analysis of the architecture is presented. An unbalanced PSI protocol based on thirdgeneration FHE is also proposed to better demonstrate the architecture. Supplemented by several optimizations for reducing the communication and computation costs, the proposed PSI achieves a computation cost independent of the Sender’s set size. Implemented with AWS cloud FPGA, the proposed accelerator achieves over 21× performance improvement compared with a software implementation on various subroutines of the FHE and the proposed PSI at 125 MHz.
Acknowledgment
Covered for blind review.
Appendix A RLWE to LWE Conversion
Since RLWE is a special form of LWE, the coefficients of the polynomial of an RLWE ciphertext can be converted into multiple separate LWE ciphertexts under the same secret key with some transformation of polynomial . For example, in an RLWE ciphertext , the coefficient of at index :
(13) 
can be viewed as an LWE ciphertext encrypted by secret key , where .
Appendix B Example of RLWE Expansion with RLWE Substitution
To demonstrate how RLWE substitution fulfills the expansion, is shown as an example. For , . Thus, the addition of the substituted ciphertext to the original one extracts the even index coefficients of the message , and the subtraction extracts the odd index coefficients, as shown in Equation 14.
(14) 
Therefore, by recursively substituting with , for , each coefficient of the message is extracted into a separate RLWE ciphertext . The scale can be offset by prescaling the message with the multiplicative inverse of the in .
Appendix C LUT Comparison with Permutation Based Hashing
The comparison in the homomorphic LUT still holds with permutationbased hashing. Assuming that from the Receiver and from the Sender are in the same bin after hashing and , from Equation 11, it is apparent that
(15) 
Therefore, , resulting in . Thus, the correctness of the LUT based PSI holds with permutationbased hashing.