1. Introduction
Currently deployed publickey cryptographic schemes, i.e., Rivest Shammir Adleman (RSA) and Ellipticcurve Cryptography (ECC), have their security strength built on the hardness of solving hard mathematical problems such as prime factorization and discrete logarithms. While these crypto schemes have been standardized and, to a large extent, remain useful, the recent advances in the field of quantum computers now threat to break them (Shor, 1997). Therefore, researchers are focusing on designing and investigating quantumresistant publickey algorithms and protocols to keep future communications secure.
Recently, a competition has been started by the National Institute of Standards and Technology (NIST) for the standardization of postquantum cryptographic (PQC) publickey protocols (NIST, Created January 3, 2017, Updated June 24, 2020), i.e., protocols that would not be vulnerable to quantum computers. As the competition approaches its end, the majority of the remaining candidates are based on computationally infeasible lattice problems. One such candidate is a key encapsulation mechanism (KEM) named SABER (D’Anvers et al., 2021), which is the central piece of this study.
Throughout the standardization/competition process, NIST has considered the security strength of PQC KEM protocols. C/C++ reference implementations of the finalist protocols are available from (NIST, Created January 3, 2017, Updated June 24, 2020). Naturally, as with ECC and RSA, having accelerators for PQC candidates is of interest as dedicated hardware can achieve significant speedups in performance. Examples of hardware accelerators for NIST PQC protocols are presented in (Roy and Basso, 2020; Dang et al., 2019; Maria Bermudo Mera et al., 2020; Banerjee et al., 2019; Zhu et al., 2021; Beirendonck et al., 2021; Abdulgadir et al., 2021) where both field programmable gate array (FPGA) and application specific integrated circuit (ASIC) platforms are targeted.
Comparatively, stateoftheart hardware implementations of SABER (Roy and Basso, 2020; Zhu et al., 2021) provide significant performance improvements in terms of computational time for the key generation (KeyGen), encapsulation (Encaps) and decapsulation (Decaps) operations. The required computation time for these operations can be further reduced by employing different architectural and circuitlevel solutions. Consequently, the focus of this work is to show the design space exploration for the NIST PQC finalist SABER with a focus on improving performance.
The design space exploration, in this work, determines the adaption in various architectural elements (i.e., distinct memory configurations, pipelining, and logic sharing) with an emphasis on optimizing the design for a specific 65nm ASIC technology. Therefore, to initiate our design space exploration, we have selected an open source implementation of SABER
^{1}^{1}1The utilized SABER core is modelled as an instruction set coprocessor architecture. The code is written in Verilog at Register Transfer Level (RTL). It can be accessed directly at https://github.com/sujoyetc/SABER_HW.. The existing code targets an FPGA platform, whereas in our work we target an ASIC platform. Converting the code to ASIC is one of the contributions of our work, as well as the following:
Exploration of different types, numbers, and sizes of compiled memories in a ‘smart synthesis’ fashion.

Promoting logic sharing between SABER building blocks that require similar functionality.

Pipelining of selected portions of the design, thus tradingoff throughput for latency.

Design of a tapeoutready SABER core in a commercial 65nm CMOS technology, for which we provide a layout and power, area, and timing characteristics.

Source codes for our many architectures^{2}^{2}2Available from (Imran and Pagliarini, 2021).
The remainder of this paper is organized as follows: Section 2 provides the required mathematical background and discusses the baseline architecture for the SABER PQC KEM protocol. Our design space exploration is given in Section 3. Implementation results and a comparison to the state of the art is provided in Section 4. Finally, Section 5 concludes the paper.
2. Preliminaries
This section presents the required mathematical background and a description of the chosen baseline architecture for SABER.
Symbols (or notations). The and are modulo powers of 2. Set of integers is presented with . Then the ring of integers modulo and is and , respectively. The ring of polynomials for an integer is presented with and where
is a fixed power of 2. Vectors are shown in bold and lower case font (e.g.,
a).Security strength. The security strength relies on the hardness of module Learning With Rounding (ModLWR) problem. Therefore, a ModLWR sample is defined as follows:
(1) 
In Eq. 1, a is a vector of randomly generated polynomials in , s is a secret vector of polynomials in
whose coefficients are sampled from binomial distribution, and the modulus
. The identification between ModLWR samples and uniformly random samples in formulates the ModLWR problem. Therefore, this ModLWR problem is presumed to be computationally infeasible both on classical and quantum computers. Consequently, SABER is a good candidate for developing quantumresistant cryptosystems.PKE and KEM operations. SABER is a Chosen Ciphertext Attack, i.e., INDCCA, secure KEM and Chosen Plaintext Attack, i.e., INDCPA, secure publickey encryption (PKE) scheme. Therefore, the PKE crypto operations are the generation of pairs of public and private keys (PKE.KeyGen), encryption (PKE.Enc) and decryption (PKE.Dec). Similarly, the corresponding KEM operations are key generation (KEM.KeyGen), encapsulation (KEM.Encaps) and decapsulation (KEM.Decaps). These operations are described as follows:
Key Generation. PKE.KeyGen starts by randomly generating a seed that defines an matrix A containing polynomials in . A function (see Algorithm 1 of (Roy and Basso, 2020)) is used to generate the matrix from the seed based on SHAKE128. A secret vector s of polynomials is also generated. These polynomials are sampled from a centered binomial distribution. The generated public key contains a matrix seed and rounded product , while the secret key contains a secret vector s. KEM.KeyGen does not differ from PKE.KeyGen, except that it appends a secret key with a hash of the public key and a randomly generated string .
Encryption and Encapsulation. The PKE.Enc operation consists of generating a new secret and adding message to the inner product between the public key and the new secret . This forms the first part of the ciphertext while the second part contains the rounded product . The KEM.Encaps operation starts by randomly generating a message and obtaining from that the public key. The ciphertext contains the encrypted message and a value achieved from the message and public key.
Decryption and Decapsulation. PKE.Dec requires the secret key s to extract original message from the inner product between the public and secret keys. It is the reverse to PKE.Enc. KEM.Decaps reencrypts the obtained message with the randomness associated with it and checks whether the ciphertext corresponds to the one received.
Set of parameters. For a security level equivalent to AES128, AES192, and AES256, SABER provides three variants that are termed LightSABER, SABER, and FireSABER, respectively. All three variants use polynomial degree and moduli & . They differ only in the module dimension, binomial distribution parameter (), and the message space. For more details about security parameters, PKE and KEM operations, we refer readers to algorithms 1–6 of (Roy and Basso, 2020).
2.1. Baseline architectures
2.1.1. FPGA Coprocessor architecture of (Roy and Basso, 2020)
As introduced in Section 1, we have used an open source crypto core for which the target platform is FPGA. The coprocessor consists of: (i) a data memory (BRAM with a size of 102464); (ii) a program memory; (iii) a dedicated finite state machine based (FSM) controller for orchestrating the SABER operations; and (iv) individual SABER building blocks. The building blocks include: (i) polynomial VectorVector multiplier wrapper; (ii) variants of secure hashing algorithms, i.e., SHA3256, SHA3512, and SHAKE128; (iii) a binomial sampler; (iv) AddPack; (v) AddRound; (vi) Verify; (vii) Constanttime Move (CMOV); (viii) Unpack; (ix) CopyWords; and (x) BS2POLVEC_{p}.
A BRAMimplemented memory is used to keep initial, intermediate, and final results for the computation of required cryptographic operations. A program memory is employed to enable the coprocessor flexibility and its instruction set architecture (ISA) that comprehends a number of instructions required by (the variants of) SABER. For polynomial multiplication, inside the VectorVector multiplier, a centralized schoolbook multiplier architecture is utilized (described in (Basso and Roy, 2020)). A sampler is required to compute a sample from pseudorandom input string for all KeyGen, Encaps, and Decaps operations. The verify block is responsible for comparing two byte strings of the same length. Based on the output of the verify unit, CMOV is responsible to either copy the decrypted session key or a pseudo random string at a specified memory location. The AddPack block computes coefficientwise addition with a constant followed by generated message. Moreover, it packs the resultant bits into a byte string. Similarly, the AddRound block performs coefficientwise addition of a constant followed by coefficientwise rounding. The unpack unit converts a byte string into bit string. The BS2POLVEC_{p} block converts the byte string into a polynomial vector. A dedicated FSM is responsible for interpreting incoming instructions from the program memory and to communicate/activate the individual building blocks.
2.1.2. Our baseline architecture
To achieve our design premise, i.e., high performance, we have constructed a baseline ASIC architecture for evaluation on a commercial 65nm technology. The first key difference with respect to (Roy and Basso, 2020) is the replacement of the BRAM with an SRAM. The SRAM is generated by using a commercial memory compiler provided by a partner foundry. Initially, for the baseline architecture, the memory size is kept identical (102464). We will later show many variants where the number of memory instances and their sizes are optimized with the aim of improving the clock frequency.
It is important to note that our baseline architecture remains a coprocessor architecture and that the same ISA is utilized. We assume the program memory resides outside of the SABER accelerator core. The same building blocks utilized in (Roy and Basso, 2020) are kept in our work, but most of them are modified during our optimizations, which we detail in the next section.
3. Design Space Exploration Process
To differentiate our generated architecture to one another, we have adopted a different name for each design as shown in Fig. 1. In order to provide a simple terminology for our studied architectures, we make use of the prefixes DP and SP, meaning that the architecture employs either a dualport or a singleport memory. Similarly, the PIP prefix implies that the architecture in question is pipelined. Based on this terminology, the following architectures are considered:

Baseline { • DP_1(1024x64)

Optimized
Therefore, we have presented five optimized designs originating from our baseline architecture. The memory is structured as i(m n), where is the number of instances, is the number of memory addresses, and is the data width of each address.
In addition to the FSM controller and building blocks shown in Fig. 1, our design space exploration led to the creation of new units: (i) memory manager; (ii) pipeline register; and (iii) shared shift buffer. All these units are common to all of our studied architectures, except for the pipeline register that is employed only in our pipeline architectures, i.e., PIP_DP and PIP_SP. Furthermore, we have done modifications to many building blocks to synchronize their inputs/outputs with the memory timing requirements. The modified blocks are shown with dashed lines in Fig. 1.
3.1. Memory manager
A smart memory synthesis (Sumbul et al., 2015) approach is investigated and implemented in our Memory Manager unit. We clarify that the central concept of smart synthesis is the observation that having smaller and distributed memories can be advantageous in an ASIC design. Smaller memories require simpler address decoder units (which are faster). This, combined with the fact that part of the address decoding is now described as logic and can be cooptimized with the remainder of the design, leads to performance improvements with sometimes marginal increase in area. In this work, we explore a smart memory synthesis strategy within the limitations of a commercial memory compiler.
For KEM operations, when the security is equivalent to AES192, SABER requires 992, 1344, and 1088 bytes for generating a single publickey, secretkey, and a cipher text (D’Anvers et al., 2021). Therefore, a relatively large memory () is employed in (Roy and Basso, 2020). We have used the same memory size in our baseline architecture. To initiate our design space exploration process, we have divided the data width (64 bit) of the employed memory into smaller chunks (32 and 16) and increased the number of memory instances accordingly. With this division, the memory structure becomes DP_2(102432) and DP_4(102416). This design choice results in an increase in clock frequency at the expense of area and power. Thereafter, from DP_4(102416) memory structure, we have constructed another architecture where we have reduced the required number of memory addresses from 1024 to 512. In this case, the memory structure becomes DP_8(51216). Conversely, this design choice results in an increase in area and power with a marginal gain in clock frequency. Therefore, at this point, we deem that further diving the memories is no longer of interest.
In our first pipelined architecture, i.e., PIP_DP, we have used the same 4(102416) memory structure as employed in DP_4(102416). Our second pipelined architecture, however, utilizes compiled RegFiles^{3}^{3}3RegFiles are not flipflops. This is a vendorspecific terminology for a compiled 6T SRAM memory that is advantageous when bit density can be tradedoff with performance. It is also termed a “highspeed” variant of SRAM by its vendor.. One of the limitations of the use of a RegFile is that the IP available to us is singleport, meaning that the design has to be modified such that all building blocks that benefit from concurrent read and write operations now execute them sequentially, one after the other. The consequence is that the overall number of clock cycles for a given cryptographic operation will increase. Later, we will show that this increase is beneficial since the improved clock frequency still reduces the overall latency for all SABER operations. The memory structure of the PIP_SP architecture is 4(25664).
Design  Area Information  Timing Information  Power Information (in mW)  
Area ()  Gates  Clk. P ()  Freq. ()  Crypto core  Combinational logic  Memory  
Lkg  Dyn  Lkg  Dyn  Lkg  Dyn  
DP_1(102464)  0.299  43336  2.000  500  0.090  86.844  0.059  16.235 (19%)  0.003  38.001 (44%) 
DP_2(102432)  0.308  45319  1.718  582  0.091  104.835  0.059  18.499 (18%)  0.004  48.322 (46%) 
DP_4(102416)  0.340  39981  1.638  610  0.082  135.342  0.051  18.762 (14%)  0.006  81.368 (60%) 
DP_8(51216)  0.478  45979  1.624  615  0.099  220.410  0.062  21.691 (10%)  0.010  157.490 (71%) 
PIP_DP_4(102416)  0.365  46217  1.508  663  0.097  233.361  0.063  20.890 (10%)  0.006  168.476 (72%) 
PIP_SP_4(25664)  0.314  64230  0.998  1002  0.111  142.413  0.074  32.925 (23%)  0.006  39.060 (27%) 
Clk. P. clock period, Lkg. leakage power, Dyn. dynamic power 
3.2. Pipelining
Initially, with the goal of improving clock frequency, we have employed different memory configurations until the improvements in clock frequency were exhausted. However, as the memory configurations change, the critical path of the design changes as well. In order to shorten the critical path and to further optimize the clock frequency, we have to explore other circuit level solutions, such as selective pipelining.
Based on the evaluation of the critical path of several architectures (details are given in section 4.1), it becomes evident that the memory is the performance bottleneck of the design. For this reason, we have placed pipeline registers at the memory output. This guarantees that the critical path is proportional to the memory access time (as opposed to being proportional to the memory and to the logic that follows it). Therefore, in our PIP_DP and PIP_SP architectures, the input to the pipeline register is from the memory while the output is connected to the binomial sampler (not shown in Fig. 1).
3.3. Shared shift buffer
For several building blocks of SABER, i.e., AddRound, AddPack, BS2POLVEC_{p}, and multiplier, a shift register is required to read from many memory addresses and accumulate (hundreds of) bits into local registers. For example, a 320bit long register is required in AddPack and BS2POLVEC_{p} while a 64 and 676 bit register are required in AddPack and Multiplier, respectively. It is important to mention that all the SABER building blocks produce outputs serially, so the shift buffer can be shared as there are no concerns with concurrent access. Therefore, we have efficiently employed a single 676bit register that is shared by AddRound, AddPack, BS2POLVEC_{p}, and Multiplier. The use of a shared shift buffer results in a 10.3% decrease in the total area with no impact on performance. All results given in the next section consider the use of this shared buffer by all architectures.
4. Results and Comparisons
The synthesis results on a 65nm commercial technology for our baseline and optimized architectures are presented in Table 1
. These results are obtained after logic synthesis in Cadence Genus. The initial power estimates are obtained by assuming constant switching probabilities (i.e., while considering a synthetic workload).
As shown in Table 1, the concurrent use of compiled memories in a ‘smart synthesis’ fashion with logic sharing to several SABER building blocks and pipelining allow us to achieve clock frequency, albeit with overheads in area (column two) and power (columns six to eleven). With several optimizations from baseline (DP) to PIP_DP architectures, we have shown that memory is the actual bottleneck in our implementation. For example, for baseline architecture, out of total dynamic power, the memory consumes 44% while the combinational logic utilizes 19%. Moreover, increase in memory instances results increase in power (72% of the total dynamic power, see last column of Table 1 for our PIP_DP_4(102416) architecture). Therefore, one approach to overcome this bottleneck is the use of faster memory instances as we employed in our PIP_SP_4(25664) architecture where combinational logic is responsible for 23% of the dynamic power while memory is responsible for 27%.
One interesting aspect of the PIP_SP_4 architecture is that the higher clock frequency changes the behavior of the synthesis tool considerably. We have verified that the tool then prefers to map the logic to (numerous) simpler gates instead of complex gates. Our analysis of the synthesis log also shows that partitioning decisions made by the tool were more frequent. The end result is that the PIP_SP_4 architecture has 18k more logic gates than its counterpart PIP_DP_4. We have also verified an increase in the number of buffers and inverters. Even for a simple gate like NAND2, we see 1626 instances in PIP_DP_4 while PIP_SP_4 has 3450 instances. It is important to highlight that the number of flipflops does not change since the PIP_SP_4 design is identical to PIP_DP_4.
We have calculated clock cycles (CCs) from end to end of each operation (KEM.KeyGen, KEM.Encaps, and KEM.Decaps). The time required to perform one cryptographic computation determines latency () and is calculated using Eq. 2. The CCs information for each SABER building block is given in Table 2. The total CCs and latency to compute KEM.KeyGen, KEM.Encaps and KEM.Decaps for our baseline and optimized architectures is shown in Table 3.
(2) 
building blocks  Clock cycles  Reason  

(Roy and Basso, 2020)  This Work  
Binomial Sampler  145  246  Pipelining 
Multiplier  894  970  Memory sync. 
Unpack  167  295  Memory sync. 
CopyWords  60  211  Singleport RegFile 
Others    No change  
Designs  Total clock cycles  Latency ()  

KeyGen  Encaps  Decaps  KeyGen  Encaps  Decaps  
DP_1  5644  6990  8664  11.2  13.9  17.3 
DP_2  5644  6990  8664  9.6  12.0  14.8 
DP_4  5644  6990  8664  9.2  11.4  14.2 
DP_8  5644  6990  8664  9.1  11.3  14.0 
PIP_DP_4  5741  7087  8761  8.6  10.6  13.12 
PIP_SP_4  7154  7136  9359  7.1  7.1  9.3 
Table 2 reveals that simultaneous use of multiple optimization approaches results in additional CCs when compared to baseline design. For example, our PIP_SP_4(25664) architecture requires 101, 76, 128, and 151 additional CCs for the Binomial Sampler, VectorVector Polynomial Multiplier, Unpack, and CopyWords building blocks. For other building blocks, the CC count will remain identical to the original design (meaning no changes when compared to (Roy and Basso, 2020)). Similarly, Table 3 shows that the increase in both CCs and clock frequency (values given in column six of Table 1) result in a decrease in the computation time.
4.1. Critical path analysis
The critical paths of our baseline and optimized architectures are shown in Fig. 2. Our analysis reveals that the memories containing longer access time result in longer critical paths for most architectures (i.e., the memory presents itself as the bottleneck) while the use of faster RegFiles result in a shorter critical path. In other words, as shown in Fig. 2, the critical path of our baseline architectures depend on the memory and some amount of combinational logic (to a lesser degree). However, this is not the case for our optimized PIP_SP architecture where the critical path is mostly combinational logic (and the setup time of the destination flipflop). This result implies that our optimized architecture is saturating the memory bandwidth thanks to our optimization strategies at architecture and circuit levels.
4.2. Physical layout for PIP_SP
The layout of CCAsecure KEM SABER accelerator, as shown in Fig. 3, is obtained from Cadence Innovus. The accelerator circuit was implemented with a nominal voltage of 1.2V in a 65nm CMOS technology. The design is placed and clock tree synthesis (CTS) is performed. The circuit is fully routed and passes design rule checking (DRC) with no violations. Metals M1 through M7 are used for signal routing, while the power is distributed in M8/M9. This is a typical metal stack for the considered 65nm process. The circuit is tapeoutready with a core utilization of 88.66%.
The results achieved after physical synthesis for different corners are given in Table 4. These results were obtained with the aid of value change dump (VCD) files, i.e., files that capture the activity of the design based on representative simulation loads. Thus, the power values reported here are more realistic. Three different corners were used for characterization: slowslow (SS), typicaltypical (TT), and fastfast (FF). These corners have operating conditions for different voltages and temperatures. The results reveal, as expected, that FF consumes more power than TT. Similary, TT consumes more power than SS.
Operations  Power values (in )  

SS  TT  FF  
KEM.KeyGen  146.7  184.3  244.8 
KEM.Encaps  148.9  187.0  248.3 
KEM.Decaps  148.4  186.4  247.5 
The comparison to existing SABER implementations is given in Table 5. Column one provides the reference implementation while the targeted platform is given in column two. The latency in for KEM.KeyGen, KEM.Encaps and KEM.Decaps is given in column three. Column four provides the clock frequency (). Finally, the last column provides the area for FPGA (in terms of lookuptables and flipflops) and ASIC (in ) platforms. We have placed a ‘–’ where required information is not available.
Ref. #  FPGA/ASIC  Latency ()  Freq.  Area 
(MHz)  LUT/FF (or) mm^{2}  
(Abdulgadir et al., 2021)  Artix7  –/467.1/527.6  100  6713/7363 
(Dang et al., 2019)  Ultrascale+  –/60/65  322  –/– 
(Maria Bermudo Mera et al., 2020)  Artix7  3.2K/4.1K/3.8K  125  7.4K/7.3K 
(Roy and Basso, 2020)  Ultrascale+  21.8/26.5/32.1  250  23.6K/9.8K 
(Zhu et al., 2021)  40nm  2.66/3.64/4.25  400  0.38 
PIP_SP  65nm  7.1/7.1/9.3  1000  0.314 
Comparison to FPGA implementations (Dang et al., 2019; Maria Bermudo Mera et al., 2020; Roy and Basso, 2020; Abdulgadir et al., 2021). In terms of computation time (shown in Table 5), the most efficient implementation of SABER on FPGA is described in (Roy and Basso, 2020). It takes 5453, 6618 and 8034 CCs for the computation of one KEM.KeyGen, KEM.Encaps and KEM.Decaps which are comparatively 24%, 8% and 15% lower than our PIP_SP architecture. Moreover, our PIP_SP architecture require 3.07, 3.73 and 3.45 times lower latency. For same operations, the proposed PIP_SP architecture takes 450.7, 577.4 and 408.6 times lower latency as compared to (Maria Bermudo Mera et al., 2020). Additionally, our PIP_SP architecture achieves 8 and 4 times higher clock frequency as compared to (Maria Bermudo Mera et al., 2020) and (Roy and Basso, 2020), respectively.
On Xilinx Zynq Ultrascale+ MPSoC, a software/hardware codesign processor architecture is presented in (Dang et al., 2019). For KEM.Encaps and KEM.Decaps, our PIP_SP architecture is 8.45 and 6.98 times faster (in terms of latency). As compared to lightweight implementation of SABER, described in (Abdulgadir et al., 2021), our PIP_SP architecture require 65.78 and 56.73 times lower latency for KEM.Encaps and KEM.Decaps, respectively. Moreover, our PIP_SP architecture results 10 and 3.10 times higher clock frequency as compared to (Abdulgadir et al., 2021) and (Dang et al., 2019). Noted that the area comparison to (Abdulgadir et al., 2021; Roy and Basso, 2020; Maria Bermudo Mera et al., 2020; Dang et al., 2019) is not possible due to distinct implementation platforms (as we have provided synthesis on ASIC while (Roy and Basso, 2020; Maria Bermudo Mera et al., 2020; Dang et al., 2019; Abdulgadir et al., 2021) utilizes FPGA).
Comparison to ASIC accelerator (Zhu et al., 2021). As shown in Table 5, our optimized PIP_SP architecture has higher latency. On the other hand, we are utilizing 1.21 times lower hardware resources on a 65nm technology while the referenced work utilized 40nm. It is therefore likely that our design would be a fraction of the size in the same technology. Moreover, we are achieving 2.5 times higher clock frequency. For multiplication of two 256degree polynomials in SABER, we have employed a centralized schoolbook multiplier architecture of (Basso and Roy, 2020). It takes 256 CCs to compute one polynomial multiplication. On the other hand, in (Zhu et al., 2021), the use of an 8level Karatsuba multiplier for the same polynomial length requires 81 CCs instead of 256.
Furthermore, a highspeed Keccak module containing two parallel sponge functions (Keccakf) is used in (Zhu et al., 2021). It computes two Keccakf[1600] computations in each clock cycle and each round of Keccak is performed every 12 CCs. In our architectures, a single sponge function in a serial fashion is incorporated which results in 28 CCs to generate 1,344 bits of a pseudorandom string. In addition to aforesaid differences in performance, our implementation follows a coprocessor architecture while a fully parallelized architecture is described in (Zhu et al., 2021). Consequently, the decrease in clock cycles in (Zhu et al., 2021) ultimately shows decrease in computation time.
5. Conclusions
This work has presented a design space exploration of SABER with a focus on high performance. Our design space exploration results in clock frequency with concurrent use of compiled memories in a ‘smart synthesis’ fashion, logic sharing between SABER building blocks, and pipelining. Moreover, we have shown that for optimizing clock frequency with area and power overheads, a single instance of a large memory may not be optimal, and that numerous smaller memories can be more convenient.
Finally, we highlight that our design already is tapeoutready and will be sent for fabrication in early September (the packaged parts are expected to be delivered by December). This will allow us to extend this work with physical measurements after IC fabrication.
6. Acknowledgments
This work was partially supported by the EC through the European Social Fund in the context of the project “ICT programme”. It was also partially supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 952252 (SAFEST) and by the Estonian Research Council grant MOBERC35.
References
 A Lightweight Implementation of Saber Resistant Against SideChannel Attacks. In Third PQC Standardization Conference, Cited by: §1, §4.2, §4.2, Table 5.
 Sapphire: a configurable cryptoprocessor for postquantum latticebased protocols. IACR Transactions on Cryptographic Hardware and Embedded Systems 2019 (4), pp. 17–61. External Links: Link, Document Cited by: §1.
 Optimized polynomial multiplier architectures for postquantum kem saber. Note: Cryptology ePrint Archive, Report 2020/1482https://eprint.iacr.org/2020/1482 Cited by: §2.1.1, §4.2.
 A sidechannelresistant implementation of saber. J. Emerg. Technol. Comput. Syst. 17 (2). External Links: ISSN 15504832, Link, Document Cited by: §1.
 SABER: mlwrbased kem. External Links: Link Cited by: §1, §3.1.
 Implementing and benchmarking three latticebased postquantum cryptography algorithms using software/hardware codesign. In 2019 International Conference on FieldProgrammable Technology (ICFPT), Vol. , pp. 206–214. Cited by: §1, §4.2, §4.2, Table 5.
 saberchip. Note: https://github.com/CentreforHardwareSecurity/saberchip Cited by: footnote 2.
 Compact domainspecific coprocessor for accelerating module latticebased kem. In 2020 57th ACM/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Document Cited by: §1, §4.2, §4.2, Table 5.
 Postquantum cryptography. External Links: Link Cited by: §1, §1.
 Highspeed instructionset coprocessor for latticebased key encapsulation mechanism: saber in hardware. IACR Transactions on Cryptographic Hardware and Embedded Systems 2020 (4), pp. 443–466. External Links: Link, Document Cited by: §1, §1, §2.1.1, §2.1.2, §2.1.2, §2, §2, §3.1, §4.2, §4.2, Table 2, Table 5, §4.
 Polynomialtime algorithms for prime factorization and discrete logarithms on a quantum computer. 26 (5). External Links: ISSN 00975397, Link, Document Cited by: §1.
 A synthesis methodology for applicationspecific logicinmemory designs. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Document Cited by: §3.1.
 LWRpro: an energyefficient configurable cryptoprocessor for modulelwr. IEEE Transactions on Circuits and Systems I: Regular Papers 68 (3), pp. 1146–1159. External Links: Document Cited by: §1, §1, §4.2, §4.2, Table 5.
Comments
There are no comments yet.