The Advanced Encryption Standard (AES) is a 128-bit block cipher with 128/192/256 - bit key, defined in the FIPS 197 standard . AES is a mandatory building block of the TLS 1.3  security protocol and is widely used for storage encryption, shared-secret authentication, cryptographic random number generation, and in many other applications.
The SM4 block cipher  fulfills a similar role to AES in the Chinese market and is the main block cipher recommended for use in China. SM4 also has a 128-bit block size, but only one key size, 128 bits. Even though its high-level structure differs completely from AES, the two share significant similarities in their sole nonlinear component, which is a single -bit “S-Box” substitution table in both cases.
Cache timing attacks on AES became well known in mid-2000s when it was demonstrated that common table-based implementations can be exploited even remotely [4, 5]; very similar issues also affect SM4. In presence of a cache, the only way to make the execution time of these ciphers fully independent of secret data is to eliminate the table lookup either by implementing it as bitsliced Boolean logic or by providing a specific ISA extension for the S-Box lookup.
Consumer CPUs have had instructions to support AES for almost a decade via the Intel AES-NI in x86  and ARMv8-A cryptographic extensions ; these are almost universally available in PCs and higher-end mobile devices such as phones. ARM also supports SM4 via the ARMv8.2-SM extension. The AES instructions have been shown to make AES much less of a throughput bottleneck for high-speed TLS communication (servers) and storage encryption (mobile devices), thereby also extending battery life in the latter. Both Intel and ARM cryptographic ISAs require 128-bit (SIMD) registers, and are not available on lower-end CPUs.
In this work, we show that it is possible to create a simple AES and SM4 ISA extension that offers a significant performance improvement and timing side-channel resistance with a minimally increased hardware footprint. It is especially suitable for lightweight RV32 targets.
Ii A Lightweight AES and SM4 ISA Extension
The ISA extension operates on the main register file only, using two
source registers, one destination register, and a 5-bit field
fn[4:0] which can be seen either as an “immediate constant”
or just code points in instruction encoding.
In either case, the interface to the (reference) combinatorial logic is:
See Section IV-B for encoding details of ENC1S as an RV32 R-type custom instruction for testing purposes. For RV64 the words are simply truncated or zero-extended.
For emulation, the instructions are encapsulated in C as:
The five bits of cover encryption, decryption, and key
schedule for both algorithms. Bits
fn[1:0] first select
a single byte from
rs2. Two bits
fn[4:3] indicate which
- bit S-Box is used (AES, AES, or SM4), and
fn[4:2] specifies a
- bit linear expansion transformation (each of three S-Boxes has two alternative linear transforms, indicated by
fn). The expanded 32-bit value is then rotated by 0–3 byte positions based on
fn[1:0]. The result is finally XORed with
rs1and written to
Table I contains the identifiers (pseudo instructions)
that we currently use for bits
fn[4:2]. Usually we may arrange
computation so that
rs1 without increasing
instruction count, making a two-operand “compressed” encoding possible.
|Identifier||fn[4:2]||Description or Use|
||AES Encrypt round.|
||AES Final / Key sched.|
||AES Decrypt round.|
||AES Decrypt final.|
||SM4 Encrypt and Decrypt.|
||SM4 Key Schedule.|
||( points used.)|
For AES the instruction selects a byte from
rs2, performs a single
S-box lookup (SubBytes or its inverse), evaluates a part of the MDS
matrix (MixColumns) if that linear expansion is step selected,
rotates the result by a multiple of 8 bits
(ShiftRows), and XORs the result with
There is no need for separate instructions for individual steps of AES
as small parts of each of them have been incorporated into a single
instruction. We’ve found that each one of these substeps requires
surprisingly little additional logic.
For SM4 the instruction has the same data path with byte selection, S-Box lookup, and two different linear operations, depending on whether encryption/decryption or key scheduling task is being performed.
Both AES  and SM4  specifications are written using big-endian notation while RISC-V uses primarily little-endian convention . To avoid endianness conversion the linear expansion step outputs have a flipped byte order. This is less noticeable with AES, but the 32-bit word rotations of SM4 become less intuitive to describe (while wiring is equivalent).
We refer to the concise reference implementation discussed in Section IV for details about specific logic operations required to implement the ISA extension, and for standards-derived unit tests.
Iii Using the AES and SM4 Instructions
AES and SM4 were originally designed primarily for 32-bit software implementation. An ECN1S AES adopts this “intended” 32-bit implementation structure but removes table lookups and rolls several individual steps into the same instruction. Both AES and SM4 implementations are also realizable with the reduced “E” register file without major changes.
Iii-a AES Computation and Key Schedule
The structure of an AES implementation is similar to a “T-Table”
implementation, with sixteen invocations of
per round and not much else (apart from fetching the round subkeys). In practice, two sets of four registers are used to store the state, with one set being used to rewrite the other, depending on whether an odd or even-numbered round is being processed. AES hasrounds, depending on the key size which can be , respectively. The final round requires sixteen invocations of
AES_FN_FWD. The same instructions are also used in the key schedule which expands the secret key to subkey words.
The inverse AES operation is structured similarly, with 16
AES_FN_DEC per main body round and 16
AES_FN_REV for the
final round. These instructions are also used for reversing the
Four precomputed subkey words must be fetched in each round, requiring
four loads (lw instructions) in addition to their address increment
(typically every other round).
There is no need for separate AddRoundKey XORs as the
subkeys simply initialize either one of the four-register sets used
to store the state.
It is also possible to compute the round keys “on the fly” without committing them to RAM. This may be helpful in certain types of security applications. The overhead is roughly 30%. However, if the load operation is much slower than register-to-register arithmetic, the overhead of on-the-fly subkey computation can become negligible. On-the-fly keying is more challenging in reverse.
Iii-B SM4 Computation and Key Schedule
SM4 has only one key size, 128 bits. The algorithm has 32 steps, each using a single 32-bit subkey word. The steps are typically organized into 8 full rounds of 4 steps each. Due to its Feistel-like structure SM4 does not require an inverse S-Box for decryption like AES, which is a substitution-permutation network (SPN). The inverse SM4 cipher is equivalent to the forward cipher, but with with reversed subkey order.
Each step uses all four state words and one subkey word as inputs,
replacing a single state word. Since input mixing is built from XORs,
some of the temporary XOR values are unchanged and can be shared between
steps. Each round requires ten XORs in addition to sixteen
SM4_FN_ENC invocations, bringing the total number of arithmetic
instructions to 26 per round – or 6.5 per step. Therefore SM4 performance
is slightly lower than that of AES-128, despite having fewer full rounds.
The key schedule similarly requires 16 invocations of
and 10 XORs to produce a block of four subkey words. The key schedule
uses 32 “CK” round constants which can be either fetched from a table
or computed with 8-bit addition operations on the fly.
For SM4 each block of four consecutive invocations
SM4_FN_KEY share the same source and
destination registers, differing only in
fn[1:0] which steps
through . We denote such a four-ENC1S block as pseudo
instruction ENC4S. One can reduce the per-round instruction count of SM4
from 26 (+4 lw) to 14 (+4 lw) by implementing ENC4S as a “real”
instruction that is almost four times larger than ENC1S in hardware.
Note that without additional instructions an AES implementation does not benefit from ENC4S in encryption or decryption, only in key schedule.
Iv Reference Implementation
An open-source reference implementation is available111AES/SM4 ISA Extension: https://github.com/mjosaarinen/lwaes_isa. The distribution contains HDL combinatorial logic for the ENC1S instruction (including the S-Boxes) and provisional assembler listings for full AES-128/192/256 and SM4-128.
The package also has C-language emulator code for the instruction logic, “runnable pseudocode” implementations of algorithms, and a set of standards-derived unit tests. This research distribution is primarily intended for obtaining data such as instruction counts and intermediate values but can be readily integrated into many RISC-V cores and emulators.
Iv-a About the AES, SM4 S-Boxes
AES and SM4 can share data paths so it makes sense to explore their additional structural similarities and differences. Both SM4 and AES S-Boxes are constructed from finite field inversion in together with a linear (affine) transformations on input and/or output. The inversion makes them “Nyberg S-Boxes”  with desirable properties against differential and linear cryptanalysis, while the linear mixing steps are intended to break the bytewise algebraic structure.
Since is an involution (self-inverse) and affine isomorphic regardless of polynomial basis, AES, AES, and SM4 S-Boxes really differ only in their inner and outer linear layers.
Boyar and Peralta 
show how to build low-depth circuits for AES that are composed of a linear top and bottom layers and a shared nonlinear middle stage. Here XOR and XNOR gates are “linear” and the shared nonlinear layer consists of XOR and AND gates only. For this project we created additional top and bottom layers for SM4 that use the same the middle layer as AES and AES.
Each S-Box expands an 8-bit input to 21 bits in a linear inner (“top”) layer, uses the shared nonlinear 21-to-18 bit mapping as a middle layer, and again compresses 18 bits to 8 bits in the outer (“bottom”) layer. Table II gives the individual gate counts to each layer; summing up top, middle, and bottom gives the total S-Box gate count ( 128).
Despite such a strict structure and limited choice of gates (that is suboptimal for silicon but very natural to mathematics), these are some of the smallest circuits for AES known. Note that it is possible to implement AES with fewer gates (113 total), but this results in 50% higher circuit depth .
Iv-B Experimental Instruction Encoding and Synthesis
For prototyping we interfaced the ENC1S logic using the custom-0
opcode and R-type instruction encoding with
lower 5 bits of the funct7 field:
The implementation has been tested with PQShield’s “Pluto” RISC-V core. We synthesized the same core on low-end Xilinx Artix-7 FPGA target (XC7A35TICSG324-1L) with and without the ENC1S (AES, SM4) instruction extension and related execution pipeline interface.
For comparison, we also measured the size of a memory-mapped AES module “EXTAES”. This module implements AES encryption only, not inverse AES or SM4. Table III
summarizes the relative area of ENC1S and EXTAES. Note that the SoC used in this synthesis has some additional logic that is not relevant to the current discussion. We estimate that the full (AES, AES, SM4) instruction proposal increases the amount of core logic (LUTs) by about 10% over a typical baseline RV32I core, but much less for more complex cores.
Implementors can experiment if it is beneficial to multiplex the S-Box linear layers with the shared middle layer. The required mux logic seems large and increases the circuit depth, so our current reference implementation does not use it.
We observe that the EXTAES module requires a large amount of additional slice registers. Such a memory-mapped state is more difficult to manage and share among processes than the ENC1S state which is always contained in the register file. While the EXTAES module has 16 parallel S-Boxes and executes the core AES itself in about a dozen cycles, loading and storing of blocks causes significant additional latency.
|Resource||Base||ENC1S ()||EXTAES ()|
V Performance and Security Analysis
The hand-optimized AES implementation222Ko Stoffelen: “RISC-V Crypto”  https://github.com/Ko-/riscvcrypto referenced in  requires 80 core arithmetic instructions per round. The same task can be accomplished with 16 ENC1S instructions. Furthermore, 16 of those 80 are memory loads, which typically require more cycles than a simple arithmetic instruction (or ENC1S). Each AES round additionally requires a few operations for loading subkeys and managing instruction flow.
Overall, based on RV32 and RV64 instruction counts we estimate that the performance of an ENC1S AES can be expected to be more than 500% better than the fastest AES implementations that use the baseline ISA only. Much of the precise performance gain over a table-based implementation depends on the latency of memory load operations.
ENC1S-based AES and SM4 implementations are inherently constant-time and resistant to timing attacks. Stoffelen  also presents a constant-time, bitsliced AES implementation for RISC-V which requires times more cycles than the optimized table-based implementation. So ENC1S speedup over a timing side-channel hardened base ISA implementation is expected to be roughly 15-fold.
We are not aware of any definitive assembler benchmarks for SM4 on RISC-V, but based on instruction count estimates the performance improvement can be expected to be roughly similar or more (over 500 %). Without ENC1S simple SM4 software implementations would benefit from rotation instructions which have been proposed in the RISC-V bit manipulation extension, but are not widely available.
We have only discussed timing side-channel attacks. Since these instructions interact with the main register file, any electromagnetic emission countermeasures would probably have to be extended to additional parts of the CPU core.
It may be possible to address electromagnetic emissions with completely different types of “masking” instructions. We note that the low multiplicative complexity of our S-Box logic helps when building side-channel resistance beyond timing attacks. Goudarzi et al  found the Boyar-Peralta type S-Box to be ideal for masked implementations, a general countermeasure against emission side-channel attacks.
We propose a minimalistic RISC-V ISA extension for AES and SM4 block ciphers. The resulting speedup is 500% or more for both ciphers when compared to hand-optimized base ISA assembler implementations that use lookup tables.
In addition to saving energy and reducing latency in secure communications and storage encryption, the main security benefit of the instructions is their constant-time operation and resulting resistance against cache timing attacks. Such countermeasures are expensive in pure software implementations.
The instructions require logic only for a single S-Box, which is combined with additional linear layers for increased code density and performance. The hardware footprint of the instruction is very small as a result. If both AES and SM4 are implemented on the same target they can share data paths which further simplifies hardware. However, AES and SM4 are independent of each other and AES is also optional. It is not rare to implement and use the forward AES without inverse AES as common CTR-based AES modes (such as GCM) do not require the inverse cipher for decryption .
This proposal is targeted towards (ultra) lightweight MCUs and SoCs. A different type of ISA extension may provide additional speedups on 64-bit and vectorized platforms, but with the cost of increased implementation area. Designers may still want to choose this minimal-footprint option if timing side-channel resistance is their primary concern.
-  NIST, “Advanced Encryption Standard (AES),” Federal Information Processing Standards Publication FIPS 197, November 2001.
-  E. Rescorla, “The Transport Layer Security (TLS) protocol version 1.3,” IETF RFC 8446, August 2018. [Online]. Available: https://www.rfc-editor.org/info/rfc8446
-  SAC, “GB/T 32907-2016: SM4 block cipher algorithm,” Cryptographic Standards Publication, original in Chinese. Also GM/T 0002-2012, August 2016. [Online]. Available: http://www.gmbz.org.cn/upload/2018-04-04/1522788048733065051.pdf
-  D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermeasures: The case of AES,” in Topics in Cryptology - CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, San Jose, CA, USA, February 13-17, 2006, Proceedings, ser. Lecture Notes in Computer Science, D. Pointcheval, Ed., vol. 3860. Springer, 2006, pp. 1–20. [Online]. Available: https://eprint.iacr.org/2005/271
-  D. J. Bernstein, “Cache-timing attacks on AES,” Web-published Manuscript, April 2005. [Online]. Available: http://cr.yp.to/papers.html#cachetiming
-  S. Gueron, “Intel Advanced Encryption Standard (AES) new instructions set,” White Paper, May 2010, 323641-001.
-  ARM, “Arm A64 Instruction Set Architecture Armv8, for Armv8-A architecture profile,” 2019, ARM DDI 0596 (ID 120619). [Online]. Available: https://static.docs.arm.com/ddi0595/f/SysReg_xml_v86A-2019-12.pdf
-  A. Waterman and K. Asanović, Eds., The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 20191213. RISC-V Foundation, December 2019. [Online]. Available: https://riscv.org/specifications/
-  K. Nyberg, “Differentially uniform mappings for cryptography,” in Advances in Cryptology - EUROCRYPT ’93, Workshop on the Theory and Application of of Cryptographic Techniques, Lofthus, Norway, May 23-27, 1993, Proceedings, ser. Lecture Notes in Computer Science, T. Helleseth, Ed., vol. 765. Springer, 1993, pp. 55–64. [Online]. Available: https://doi.org/10.1007/3-540-48285-7_6
-  J. Boyar and R. Peralta, “A small depth-16 circuit for the AES S-box,” in Information Security and Privacy Research - 27th IFIP TC 11 Information Security and Privacy Conference, SEC 2012, Heraklion, Crete, Greece, June 4-6, 2012. Proceedings, ser. IFIP Advances in Information and Communication Technology, D. Gritzalis, S. Furnell, and M. Theoharidou, Eds., vol. 376. Springer, 2012, pp. 287–298. [Online]. Available: https://eprint.iacr.org/2011/332
-  ——, “A new combinational logic minimization technique with applications to cryptology,” in Experimental Algorithms, 9th International Symposium, SEA 2010, Ischia Island, Naples, Italy, May 20-22, 2010. Proceedings, ser. Lecture Notes in Computer Science, P. Festa, Ed., vol. 6049. Springer, 2010, pp. 178–189. [Online]. Available: https://doi.org/10.1007/978-3-642-13193-6_16
-  K. Stoffelen, “Efficient cryptography on the RISC-V architecture,” in Progress in Cryptology - LATINCRYPT 2019 - 6th International Conference on Cryptology and Information Security in Latin America, Santiago de Chile, Chile, October 2-4, 2019, Proceedings, ser. Lecture Notes in Computer Science, P. Schwabe and N. Thériault, Eds., vol. 11774. Springer, 2019, pp. 323–340. [Online]. Available: https://eprint.iacr.org/2019/794
-  D. Goudarzi and M. Rivain, “How fast can higher-order masking be in software?” in Advances in Cryptology - EUROCRYPT 2017 - 36th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Paris, France, April 30 - May 4, 2017, Proceedings, Part I, ser. Lecture Notes in Computer Science, J. Coron and J. B. Nielsen, Eds., vol. 10210, 2017, pp. 567–597. [Online]. Available: https://eprint.iacr.org/2016/264
-  M. Dworkin, “Recommendation for block cipher modes of operation: Methods and techniques,” NIST Special Publication SP 800-38A, December 2001.