In 2013, NIST started the Lightweight Cryptography (LW) project , with the end goal of creating a portfolio of lightweight algorithms for authenticated encryption with associated data (AEAD), and optionally hashing, in constrained environments . For hardware-oriented lightweight algorithms, hardware implementation results are an important criteria for assessment and comparison. In the first round of the LWC evaluation, more than half of the candidates 
reported hardware implementation results or their estimates, ranging from complete implementation and analysis to partial implementation results and theoretical estimates based on gate count. Various amounts of analysis, such as area reported only for a cryptographic primitive used or thorough area breakdown of all components, different design decisions, such as serial and unrolled implementations, and different ASIC and/or FPGA implementation technologies can be found. Furthermore, some authors report the results without interface, some with the interface, and in some cases,e.g. , CAESAR Hardware Applications Programming Interface (API) for Authenticated Ciphers  was used. This paper explores different hardware design options for two of the LWC candidates,  and . The original and parallel implementations were synthesized using four different ASIC libraries, including 65nm, 90 nm and 130 nm technologies. implementations range from a throughput of 0.5 bits-per-clock cycle (bpc) and an area of 4210 GE (averaged across the four ASIC libraries) up to 4 bpc and 7260 GE. results range from 0.57 bpc with 2920 GE and 4.57 bpc with 11080 GE. The paper is organized as follows: Section 2 briefly introduces and , Section 3 lists design principles and presents the interface with the environment and describes the implementations of both ciphers. Section 4 describes the parallel implementations of and . Implementation technologies and results are summarized in Section 5.
2 Specifications of and
Both and permutations operate in a unified duplex sponge mode . The 320 bit permutation offers both AEAD and hashing functionalities, and the 259 bit permutation supports AEAD functionality. Because of the similarities between and , this section begins with a short description of and permutations, followed by a discussion on the unified duplex sponge mode for both schemes, highlighting some differences.
has a 320 bit internal state , divided into five 64 bit registers, denoted A, B, C, D, and E. The 320 bit permutation uses the unkeyed reduced-round Simeck block cipher  with a block size of 64 and 8 rounds, denoted Simeck box , as the nonlinear operation. is a lightweight permutation, consisting of left cyclic shifts, and and xor gates. Each round is parameterized by a single bit of round constant , and . The algorithmic description of is shown at the end of Algorithm 1, with the 64 bit input and output split into half, i.e. and respectively. To construct the -step, , is applied to three registers A, C and E, with each using its own round constant . The 8 bit constants , are generated by an LFSR with the feedback polynomial run in a 3-way parallel configuration to produce one bit of each per clock cycle. At each step, the outputs of are added to registers B, D and E, which are further parameterized by step constants . The computation of step constants does not need any extra circuitry, but rather uses the same LFSR as the round constants: the three feedback values together with all 7 state bits yield 10 consecutive sequence elements, which are then split into three 8 bit step constants. The step constants are used once every clock cycle. The step function is then concluded by a permutation of all five registers. For the properties of , the choice of the final permutation, and the number of rounds and steps refer to .
WG is a hardware oriented scheme, built on top of the initialization phase of the well-studied, LFSR based, Welch-Gong (WG) stream cipher [11, 12]. The permutation is iterative and has a round function derived from the LFSR, the decimated Welch-Gong permutation WGP, and the small S-boxes SB. Details, such as differential uniformity and nonlinearity of the WGP and SB and selection of the LFSR polynomial can be found in . The parameter selection for was aimed at balancing the security and hardware implementation area, using hardware implementation results for many design decisions, e.g., field size, representation of field elements, LFSR polynomial, etc. Both LFSR and WGP are defined over and the S-box is a 7 bit permutation. is defined with the primitive polynomial , and the field elements are represented using the polynomial basis , where is the root of (Table 1). The LFSR is defined by the feedback polynomial (Table 1), which is primitive over . The 37 stages of the LFSR also constitute the internal state of , denoted ; the subscript is used to mark the -th iteration of the permutation. For the element , the decimated WG permutation with decimation is defined in Table 1. The 7 bit SB uses a nonlinear transformation Q and a permutation P, which together yield one-round . The SB itself iterates the function 5 times, applies once, and then complements the 0th and 2nd bit (Table 1).
As mentioned before, the permutation is iterative, and repeats its round function -StateUpdate 111 times, as shown in Algorithm 2. In each round, 6 stages of the LFSR are updated nonlinearly, while all the remaining stages are just shifted. A pair of bit round constants is xored with the pair of stages . Round constants are produced by an LFSR of length 7 with feedback polynomial , implemented in a 2-way parallel configuration, see  for details.
2.3 The Unified Duplex Sponge Mode
--128 and --128 use the unified duplex sponge mode from sLiSCP  (Figure 1). The phases for encryption and decryption are: initialization, processing of associated data, encryption (Figure 1(a)) or decryption (Figure 1(b)), and finalization. Figure 1 also shows the domain separators for each phase. The internal state is divided into a capacity part (256 bits for --128 and 195 bits for ) and a 64 bit rate , which for:
--128 consists of bytes A, A, A, A, C, C, C, C
--128 consists of the 0-th bit of stage , i.e., , and all bits of stages , and
The input data (associated data , message or ciphertext
) is absorbed (or replaced) into the rate part of the internal state. If the input data length is not a multiple of 64, padding with (10) is needed. In Figure 1, denotes the number of 64 bit blocks of and the number of 64 bit blocks of and after padding. Refer to [4, 5] for further padding rules. No padding is needed during initialization and finalization because both schemes use a 128 bit key. With the exception of tag extraction, both schemes generate an output only during the encryption and decryption phases: the 64 bit output block is obtained by the xor of the current input and rate. Figure 1 also shows functions load- and , which are straightforward for . The load- performs the loading of the 128 bit key and nonce , where the key is loaded into registers A and C, the nonce into B and E, and the register D is loaded with zeros. The extracts the 128 bit tag from registers A and C. Special care was taken in the specification of and of to take advantage of the shifting nature of the LFSR, which will be discussed in more detail in Section 3.3.
The HASH functionality is shown in Figure 2, with only two phases, namely absorbing and squeezing. The only input is now the message . Since the hash has a fixed length of 256 bits, the length of the squeezing phase is fixed. --256 is unkeyed, and the state is loaded with a fixed initialization vector IV. More specifically, the function loads the state bytes B, B and B with bytes 0x80, 0x40, and 0x40 respectively, and sets all other state bits to zero.
3 Hardware Implementations
3.1 Hardware Design Principles and Interface with the Environment
The design principles and assumptions followed by the hardware implementations:
Multi-functionality module. The system should include all supported operations in a single module (Figure 4), because lightweight applications cannot afford the extra area for separate modules.
Single input/output ports. In small devices, ports can be expensive. To ensure that and are not biased in favour of the system, at the expense of the environment, the ciphers have one input and one output port (Table 3.1). That being said, the authors agree with the proposed lightweight cryptography hardware API’s  use of separate public and private data ports and will update implementations accordingly.
Valid-bit protocol and stalling capability. The environment may take an arbitrarily long time to produce any piece of data. For example, a small microprocessor could require multiple clock cycles to read data from memory and write it to the system’s input port. The receiving entity must capture the data in a single clock cycle (Figure 4). In reality, the environment can stall as well. In the future, and implementations will be updated to match the proposed lightweight cryptographic hardware API’s use of a valid/ready protocol for both input and output ports.
Use a “pure register-transfer-level” implementation style. In particular, use only registers, not latches; multiplexers, not tri-state buffers; synchronous, not asynchronous reset; no scan-cell flip-flops; clock-gating is used for power and area optimization.
Since both and use a unified sponge duplex mode, they share a common interface with the environment (Table 3.1). The environment separates the associated data and the message/ciphertext, and performs padding if necessary. The domain separators shown in Figure 1 are provided by the environment and serve as an indication of the phase change for AEAD functionality. For --256, the phase change is indicated by the change of the i_mode(0) signal, as shown in Table 3. The hardware is unaware of the lengths of individual phases, hence no internal counters for the number of processed blocks are needed.
|reset||resets the state machine|
|i_mode||mode of operation|
|i_padding||the last block is padded|
|i_valid||valid data on i_data|
|o_ready||hardware is ready|
|o_valid||valid data on o_data|
The top-level module, shown in Figure 4, is also very similar for both and . It depicts the interface signals from Table 3.1, with only slight differences in bitwidths. Figure 4 shows the timing diagram for during the encryption phase of message blocks and , which clearly shows the valid-bit protocol. The first five lines show the top-level interface signals and line six shows the value of the permutation counter pcount, which is a part of the finite state machine (FSM) and keeps track of the clock cycles needed for one permutation. After completing the previous permutation, the top-level module asserts o_ready to signal to the environment that an permutation just finished and new data can be accepted. The environment replies with a new message block accompanied by an i_valid signal. The hardware immediately encrypts, returns and asserts o_valid. This clock cycle is also the first round of a new permutation and the o_ready is deasserted, indicating that the hardware is busy. Figure 4 shows the hardware remaining busy (o_ready = 0) for the duration of one permutation. When pcount wraps around from 127 to 0, the hardware is again idle and ready to receive new input, in this case . A few more details about the use of pcount will follow in Subsection 3.2. The interaction between the top-level module and the environment during the encryption phase of is very similar, with 111 clock cycles for the completion of one permutation. More significant differences for the interaction with the environment arise during loading, tag extract and of course --256.
3.2 ACE Datapath
Figure 5(a) shows the datapath. The top and bottom of the figure depict the five 64 bit registers A, B, C, D and E, followed by the hardware components required for normal operation during permutation, absorbing, and replacing, which imposes input multiplexers controlled by the mode and the counter pcount. Similarly, the output multiplexers are needed to accommodate encryption/decryption and tag generation for --128 and squeezing for --256. Furthermore, the output is forced to 0 during normal operation. The registers A, C and E are split in half to accommodate inputs and outputs. The rest of Figure 5(a) shows one step of the permutation (Algorithm 1). The rounds and steps always use the same hardware, but in different clock cycles, which forces the use of multiplexers inside the permutation. The last row of multiplexers accommodates loading.
3.3 WAGE Datapath
Because of the shifting nature of the LFSR, which in turn affects loading, absorbing and squeezing, the datapath is slightly more complicated than the datapath and hence is explained in two levels:
wage_lfsr treated as a black box in Figure 6 with (no parallelization)
wage_lfsr: The LFSR has 37 stages with 7 bits per stage, a feedback with 10 taps and a module for multiplication with (Table 1). The internal state of wage_lfsr is also the internal state of .
WGP module implementing WGP: For smaller fields like , the WGP area, when implemented as a constant array in VHDL/Verilog, i.e., as a look-up table, is smaller than when implemented using components such as multiplication and exponentiation to powers of two [13, 14]. However, the WGP is not stored in hardware as a memory array, but rather as a net of and, or, xor and not gates, derived and optimized by the synthesis tools.
SB module: The SB is implemented in unrolled fashion, i.e. as purely combinational logic, composed of 5 copies of , followed by a and the final two not gates (Table 1).
lfsr_c: The lfsr_c for generating the round constants was implemented in a 2-way parallel fashion. It has only 7 1 bit stages and two xor gates for the two feedback computations.
Extra hardware for the wage_lfsr in sponge mode. Figure 7 shows details for stages . The grey line represents the path for normal operation during the permutation. The additional hardware for the entire wage_lfsr is listed below, with examples in brackets referring to Figure 7.
The 64 bit i_data is padded with zeros to 70 bits, then fragmented into 7 bit wage_lfsr inputs , , corresponding to the rate stages . For each data input there is a corresponding 7 bit data output . ( and in Figure 7).
10 xor gates must be added to the stages to accommodate absorbing, encryption and decryption (xors at stages ,).
10 multiplexers to switch between absorbing and normal operation (Amux1, Amux0 at ,).
An xor and a multiplexer are needed to add the domain separator i_dom_sep (Amux at ).
To replace the contents of the stages, 10 multiplexers are added (Rmux1 at stage )
Instead of additional multiplexers for loading, the existing Rmux, , multiplexers are now controlled by replace or load and labelled RLmux, (see RLmux0 on ). Since all non-input stages must keep their previous values, an enable signal lfsr_en is needed.
Three 7 bit and gates to turn off the inputs and (and at ).
Four multiplexers are needed to turn off the SB during loading and tag extraction (SBmux at ).
The total hardware cost to support the sponge mode is: 24 7 bit and one 2 bit multiplexers, 10 7 bit and one 2 bit xor gates, three 7 bit and gates.
As mentioned in Section 2, special care was given to the design of loading and tag-extract. The existing data inputs are reused for loading, and the outputs for tag extraction. The wage_lfsr is divided into five loading regions using the inputs , . For example, the region in Figure 7 is loaded through input , however, instead of storing , the data is fed directly into , i.e. the RLmux0 disconnects the Amux0 output. The remaining stages in this region are loaded by shifting, which requires the SBmux at . Note that there is no need to disconnect the two WGP, because they are automatically disabled by loading through and , located at stages and respectively. The loading process is illustrated in Table 4, where is the 7 bit block of the 128 bit key . Table 4 shows the key shifting through the LFSR stages in 9 clock cycles. The stages are shown in the second row of Table 4, and the values “-” in the table denote the old, unknown values that are overwritten by the new key. The state of stages after after loading is finished is shown in the last row. The tag is extracted in a similar fashion as loading, but from the data output at the end of a particular loading region, e.g., the region , loaded through , is extracted through . The longest tag extraction region is of length 9, which is the same as the longest loading region.
3.4 Hardware-Oriented Design Decisions
The design process for and tightly integrated cryptanalysis and hardware optimizations. A few key hardware-oriented decisions are highlighted here; more can be found in the design rationale chapters of [4, 5]. Functionally, it is equivalent for the boundary between phases to occur either before or after the permutation. For and , the boundary was placed after the permutation updates the state register. This means that the two-bit domain separator is sufficient to determine the value of many of the multiplexer select lines and other control signals. All phases that have a domain separator of "00" have the same multiplexer select values. The same also holds true for "01". Unfortunately, this cannot be achieved for "10", because encryption and decryption require different control signal values, but the same domain separator. Using the domain separator to signal the transition between phases for encryption and decryption also simplifies the control circuit. For hashing, the change in phase is indicated by the i_mode signal. In applications where the delay through combinational circuitry is not a concern, such as with lightweight cryptography, where clock speed is limited by power consumption, not by the delay through combinational circuitry, it is beneficial to lump as much combinational circuitry as possible together into a single clock cycle. This provides more optimization opportunities for the synthesis tools than if the circuitry was separated by registers. For this reason, the datapath was designed so that the input and output multiplexers, one round of the permutation, and state loading multiplexers together form a purely combinational circuit, followed by the state register.
4 Parallel Implementations
4.1 Parallelization in General
Both ciphers can be parallelized (unrolled) to execute multiple rounds per clock cycle, at the cost of increased area. In the top-level schematic in Figure 4, the dashed stacked boxes indicate parallelization. The FSM is parameterized with parameter and used for un-parallelized () and parallelized () implementations. Other components are replicated to show copies, with in Figure 4. Such a representation is symbolic; parallelization is applied only to the permutation, not the entire datapath. The interface with the environment remains the same.
The un-parallelized permutation performs a single round per clock cycle, which implies 8 clock cycles per step. Parallel, i.e. unrolled, versions perform rounds per clock cycle, and were implemented for divisors of 8, i.e. . The permutation could be parallelized further, e.g. two or more steps in a single clock cycle. Figure 5(b) shows the example for registers A and B, with copies of connected in series. Each has its own round constant , . The round vs. step multiplexers are still needed, and can be removed only for values of , that are multiples of 8. Also note the step constant indicated as . For a step is concluded in 2 clock cycles. However, this requires a modification to the lfsr_c, which must now generate round constant bits , , per clock cycle. The last cycle within a step requires 7 additional bits, which together with yield 10 bits for the step constant generation . In the case the lfsr_c must generate 12 constant bits in the first cycle and 19 constant bits in the second clock cycle of the step, which are then used for and . For the extra constant bits, the lfsr_c feedback was replicated, i.e. feedbacks in addition to the original 3.
performs one clock cycle for the interaction with the environment, i.e. absorbing or replacing the input data into the state, followed by 111 clock cycles of the permutation. Because 111 is divisible only by 3 and 37, the opportunities to parallelize appear rather limited. However, by treating the absorption or replacement of the input data into the internal state as an additional clock cycle in the permutation, we increase the the length of the permutation to 112 clock cycles. Because 112 has many divisors, this allows parallelism of . The cost is a less than 1% decrease in performance for the additional clock cycle and some additional multiplexers, because the clock cycle that loads data has different behaviour than the normal clock cycles Figure 8 shows the 3-way parallel wage_lfsr including all nonlinear components and their copies. Multiplexers are not replicated, and hence, are not shown. For the components and SB in Figure 8, the superscript indicates the original () and the two copies (). Computation of the three feedbacks is not shown but is conducted as . Similar to , the generation of round constants must be parallelized as well. For readability, the two WGP were labelled , with being the original WGPs positioned at , just like . Similarly, the SBs were also labelled , , in the decreasing order, i.e. is the original SB with input .
5 Implementation Technologies and ASIC Implementation Results
Logic synthesis was performed with Synopsys Design Compiler version P-2019.03 using the compile_ultra command and clock gating. Physical synthesis (place and route) and power analysis were done with Cadence Encounter v14.13 using a density of 95%. simulations were done in Mentor Graphics ModelSim SE v10.5c. The ASIC cell libraries used were ST Microelectronics 65 nm CORE65LPLVT 1.25V, TSMC 65 nm tpfn65gpgv2od3 200c and tcbn65gplus 200a at 1.0V, ST Microelectronics 90 nm CORE90GPLVT and CORX90GPLVT at 1.0V, and IBM 130nm CMRF8SF LPVT with SAGE-X v2.0 standard cells at 1.2V. Some past works have used scan-cell flip-flops to reduce area, because these cells include a 2:1 multiplexer in the flip-flop which incurs less area than using a separate multiplexer. Scan-cell flip-flops were not used because their use as part of the design would prevent their insertion for fault-detection and hence, prevent the circuit from being tested for manufacturing faults. Furthermore, chip enable signals were removed from all datapath registers, which are controlled by clock gating instead. This allows a further reduction of the implementation area.
|Throughput is measured in bits per clock cycle (bpc), and plotted on a log scale axis.|
|The area axis is scaled as log(Area).|
|ST Micro 65 nm||TSMC 65 nm||ST Micro 90 nm||IBM 130 nm|
Note: Energy results done with timing simulation at 10 Mhz.
Figure 9 shows area vs. throughput for both and with different degrees of parallelization, denoted by W- and A- (). The throughput axis is scaled as log(Tput) and the area axis is scaled as log. The grey contour lines denote the relative optimality of the circuits using Tput/area. Throughput is increased by increasing the degree of parallelization (unrolling), which reduces the number of clock cycles per permutation round. For , the area of (W-1) is less than that of (A-1), because has 259 registers, compared to 320 for . As parallelization is increased, ’s area grows faster than ’s, because of the larger size of ’s permutation. Going from to results in area increase for and for on average. Optimality for reaches a maximum at . For , optimality continues to increase beyond . As can be seen by the relative constant size of the shaded rectangles enclosing the data points, the relative area increase with parallelization is relatively independent of implementation technology. Table 5 represents the same data points as Figure 9 with the addition of maximum frequency (f, MHz) and energy per bit (E, nJ). Energy is measured as the average value while performing all cryptographic operations over 8192 bits of data at 10 MHz. As the throughput increases, energy per bit decreases consistently, despite higher circuit area and, therefore, power consumption. However, this is not the case with . This phenomena can be explained by the higher relative area increase for which comes from the higher complexity of WGP with respect to SB-64. Connecting more WGPs in a combinational chain results in an exponential increase of the number of glitches, which drastically increases power consumption. Table 6 summarizes the area on ST Micro 65 nm of the LWC submissions that included synthesizable VHDL or Verilog code. Table 6 reports the area results obtained using the ST Micro 65 nm process and tool flow from this paper and the results reported in the submission. The various ciphers use different protocols and interfaces, sometimes provide different functionality (e.g., with or without hashing), and use different key sizes. As such, this analysis is very imprecise, but gives a rough comparison to and results. As the LWC competition progresses and the hardware API matures, more precise comparisons will become possible. This preliminary analysis indicates that and are among the smaller cipher candidates.
|This work||Reported in submission documents |
|Cipher||Module||Area (kGE)||Area (kGE)||ASIC technology used|
||9.9||4.2||theoretical estimate for 5 lanes|
The goal of the and design process was to build on the well studied Simeck S-Box and Welch-Gong permutation. The overall algorithms were designed to lend themselves to efficient implementations in hardware and to scale well with increased parallelism. has a larger internal state: 320 bits, vs 259 for , but the permutation is smaller than that of . This means the non-parallel version of is smaller than that of , but as parallelism increases, eventually becomes larger than . At 1 and 2 bits-per-cycle, the designs are relatively similar in area. A number of the NIST LWC candidate ciphers provided synthesizable source code. A preliminary comparison with these ciphers on ST Micro 65 nm indicates that and are likely to be among the smaller candidates. Acknowledgements This work benefited from the collaborative environment of the Comunications Security (ComSec) Lab at the University of Waterloo, and in particular discussions with Kalikinkar Mandal, Raghvendra Rohit, and Guang Gong.
-  NIST Lightweight Cryptography https://csrc.nist.gov/Projects/Lightweight-Cryptography
-  Submission Requirements and Evaluation Criteria for the Lightweight Cryptography Standardization Process https://csrc.nist.gov/CSRC/media/Projects/Lightweight-Cryptography/documents/final-lwc-submission-requirements-august2018.pdf
-  NIST Lightweight Cryptography round 1 candidates https://csrc.nist.gov/Projects/Lightweight-Cryptography/Round-1-Candidates
-  M.D. Aagaard, R. AlTawy, G. Gong, K. Mandal, R. Rohit, “ACE: An Authenticated Encryption and Hash Algorithm — Submission to the NIST LWC Competition”, March 2019, https://csrc.nist.gov/CSRC/media/Projects/Lightweight-Cryptography/documents/round-1/spec-doc/ace-spec.pdf
-  M.D. Aagaard, R. AlTawy, G. Gong, K. Mandal, R. Rohit, “WAGE: An Authenticated Cipher — Submission to the NIST LWC Competition”, March 2019, https://csrc.nist.gov/CSRC/media/Projects/Lightweight-Cryptography/documents/round-1/spec-doc/wage-spec.pdf
-  B. Rezvani, W. Diehl, “Hardware Implementations of NIST Lightweight Cryptographic Candidates: A First Look”, Cryptology ePrint Archive, Report 2019/824, 2019.
-  E. Homsirikamol, W. Diehl, A. Ferozpuri, F. Farahmand, P. Yalla, J.P. Kaps, K. Gaj, “CAESAR Hardware API.” Cryptology ePrint Archive, Report 2015/669, 2016.
-  J.P. Kaps, W. Diehl, M. Tempelmeier, E. Homsirikamol, K. Gaj, “Hardware API for Lightweight Cryptography”, 2019
-  R. AlTawy, R. Rohit, M. He, K. Mandal, G. Yang and G. Gong. sLiSCP: Simeck-based Permutations for Lightweight Sponge Cryptographic Primitives. In SAC (2017), C. Adams and J. Camenisch, Eds., Springer, pp 129-150.
-  G. Yang, B. Zhu, V. Suder, M.D. Aagaard, and G. Gong. The Simeck family of lightweight block ciphers. In CHES (2015), T. Güneysu and H. Handschuh, Eds., Springer, pp. 307-329.
-  Y. Nawaz and G. Gong. The WG stream cipher. ECRYPT Stream Cipher Project Report 2005 33 (2005).
-  Y. Nawaz, and G. Gong. WG: A family of stream ciphers with designed randomness properties. Inf. Sci. 178, 7 (Apr. 2008), 1903-1916.
-  M.D. Aagaard, G. Gong, and R.K. Mota. Hardware implementations of the WG-5 cipher for passive RFID tags. In Hardware-Oriented Security and Trust (HOST), 2013, IEEE, pp. 29-34.
-  Y. Luo, Q. Chai, G. Gong, and X. Lai. A lightweight stream cipher WG-7 for RFID encryption and authentication. In 2010 IEEE Global Telecommunications Conference GLOBECOM 2010 (Dec 2010), pp. 1-6.