A Lightweight ISA Extension for AES and SM4

by   Markku-Juhani O. Saarinen, et al.

We describe a lightweight RISC-V ISA extension for AES and SM4 block ciphers. Sixteen instructions (and a subkey load) is required to implement an AES round with the extension, instead of 80 without. An SM4 step (quarter-round) has 6.5 arithmetic instructions, a similar reduction. Perhaps even more importantly the ISA extension helps to eliminate slow, secret-dependent table lookups and to protect against cache timing side-channel attacks. Having only one S-box, the extension has a minimal hardware size and is well suited for ultra-low power applications. AES and SM4 implementations using the ISA extension also have a much-reduced software footprint. The AES and SM4 instances can share the same data paths but are independent in the sense that a chip designer can implement SM4 without AES and vice versa. Full AES and SM4 assembler listings, HDL source code for instruction's combinatorial logic, and C code for emulation is provided to the community under a permissive open source license. The implementation contains depth- and size-optimized joint AES and SM4 S-Box logic based on the Boyar-Peralta construction with a shared non-linear middle layer, demonstrating additional avenues for logic optimization. The instruction logic has been experimentally integrated into the single-cycle execution path of the “Pluto” RV32 core and has been tested on an FPGA.



There are no comments yet.


page 1

page 2

page 3

page 4


Fast Selective Flushing to Mitigate Contention-based Cache Timing Attacks

Caches are widely used to improve performance in modern processors. By c...

SIMF: Single-Instruction Multiple-Flush Mechanism for Processor Temporal Isolation

Microarchitectural timing attacks are a type of information leakage atta...

A First Look at RISC-V Virtualization from an Embedded Systems Perspective

This article describes the first public implementation and evaluation of...

Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions

This paper presents a novel, non-standard set of vector instruction type...

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

Single-issue processor cores are very energy efficient but suffer from t...

Systematic Prevention of On-Core Timing Channels by Full Temporal Partitioning

Microarchitectural timing channels enable unwanted information flow acro...

TCN Mapping Optimization for Ultra-Low Power Time-Series Edge Inference

Temporal Convolutional Networks (TCNs) are emerging lightweight Deep Lea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Advanced Encryption Standard (AES) is a 128-bit block cipher with 128/192/256 - bit key, defined in the FIPS 197 standard [1]. AES is a mandatory building block of the TLS 1.3 [2] security protocol and is widely used for storage encryption, shared-secret authentication, cryptographic random number generation, and in many other applications.

The SM4 block cipher [3] fulfills a similar role to AES in the Chinese market and is the main block cipher recommended for use in China. SM4 also has a 128-bit block size, but only one key size, 128 bits. Even though its high-level structure differs completely from AES, the two share significant similarities in their sole nonlinear component, which is a single -bit “S-Box” substitution table in both cases.

Cache timing attacks on AES became well known in mid-2000s when it was demonstrated that common table-based implementations can be exploited even remotely [4, 5]; very similar issues also affect SM4. In presence of a cache, the only way to make the execution time of these ciphers fully independent of secret data is to eliminate the table lookup either by implementing it as bitsliced Boolean logic or by providing a specific ISA extension for the S-Box lookup.

Consumer CPUs have had instructions to support AES for almost a decade via the Intel AES-NI in x86 [6] and ARMv8-A cryptographic extensions [7]; these are almost universally available in PCs and higher-end mobile devices such as phones. ARM also supports SM4 via the ARMv8.2-SM extension. The AES instructions have been shown to make AES much less of a throughput bottleneck for high-speed TLS communication (servers) and storage encryption (mobile devices), thereby also extending battery life in the latter. Both Intel and ARM cryptographic ISAs require 128-bit (SIMD) registers, and are not available on lower-end CPUs.

In this work, we show that it is possible to create a simple AES and SM4 ISA extension that offers a significant performance improvement and timing side-channel resistance with a minimally increased hardware footprint. It is especially suitable for lightweight RV32 targets.

Ii A Lightweight AES and SM4 ISA Extension

The ISA extension operates on the main register file only, using two source registers, one destination register, and a 5-bit field fn[4:0] which can be seen either as an “immediate constant” or just code points in instruction encoding. In either case, the interface to the (reference) combinatorial logic is:

module enc1s(
  output [31:0] rd,   // to output register
  input  [31:0] rs1,  // input register 1
  input  [31:0] rs2,  // input register 2
  input  [4:0]  fn    // 5-bit func specifier

See Section IV-B for encoding details of ENC1S as an RV32 R-type custom instruction for testing purposes. For RV64 the words are simply truncated or zero-extended.

For emulation, the instructions are encapsulated in C as:

uint32_t enc1s(uint32_t rs1, uint32_t rs2,
  int fn);            // ENC1Sfn rd, rs1, rs2

The five bits of cover encryption, decryption, and key schedule for both algorithms. Bits fn[1:0] first select a single byte from rs2. Two bits fn[4:3] indicate which - bit S-Box is used (AES, AES, or SM4), and additionally fn[4:2] specifies a

- bit linear expansion transformation (each of three S-Boxes has two alternative linear transforms, indicated by

fn[2]). The expanded 32-bit value is then rotated by 0–3 byte positions based on fn[1:0]. The result is finally XORed with rs1 and written to rd.

Table I contains the identifiers (pseudo instructions) that we currently use for bits fn[4:2]. Usually we may arrange computation so that rd = rs1 without increasing instruction count, making a two-operand “compressed” encoding possible.

Identifier fn[4:2] Description or Use
AES_FN_ENC 3’b000 AES Encrypt round.
AES_FN_FWD 3’b001 AES Final / Key sched.
AES_FN_DEC 3’b010 AES Decrypt round.
AES_FN_REV 3’b011 AES Decrypt final.
SM4_FN_ENC 3’b100 SM4 Encrypt and Decrypt.
SM4_FN_KEY 3’b101 SM4 Key Schedule.
Unused 3’b11x ( points used.)
TABLE I: High-level identifiers (pseudo instructions) for fn[4:2].

For AES the instruction selects a byte from rs2, performs a single S-box lookup (SubBytes or its inverse), evaluates a part of the MDS matrix (MixColumns) if that linear expansion is step selected, rotates the result by a multiple of 8 bits (ShiftRows), and XORs the result with rs1 (AddRoundKey). There is no need for separate instructions for individual steps of AES as small parts of each of them have been incorporated into a single instruction. We’ve found that each one of these substeps requires surprisingly little additional logic.

For SM4 the instruction has the same data path with byte selection, S-Box lookup, and two different linear operations, depending on whether encryption/decryption or key scheduling task is being performed.

Both AES [1] and SM4 [3] specifications are written using big-endian notation while RISC-V uses primarily little-endian convention [8]. To avoid endianness conversion the linear expansion step outputs have a flipped byte order. This is less noticeable with AES, but the 32-bit word rotations of SM4 become less intuitive to describe (while wiring is equivalent).

We refer to the concise reference implementation discussed in Section IV for details about specific logic operations required to implement the ISA extension, and for standards-derived unit tests.

Iii Using the AES and SM4 Instructions

AES and SM4 were originally designed primarily for 32-bit software implementation. An ECN1S AES adopts this “intended” 32-bit implementation structure but removes table lookups and rolls several individual steps into the same instruction. Both AES and SM4 implementations are also realizable with the reduced “E” register file without major changes.

Iii-a AES Computation and Key Schedule

The structure of an AES implementation is similar to a “T-Table” implementation, with sixteen invocations of AES_FN_ENC

per round and not much else (apart from fetching the round subkeys). In practice, two sets of four registers are used to store the state, with one set being used to rewrite the other, depending on whether an odd or even-numbered round is being processed. AES has

rounds, depending on the key size which can be , respectively. The final round requires sixteen invocations of AES_FN_FWD. The same instructions are also used in the key schedule which expands the secret key to subkey words.

The inverse AES operation is structured similarly, with 16 AES_FN_DEC per main body round and 16 AES_FN_REV for the final round. These instructions are also used for reversing the key schedule. Four precomputed subkey words must be fetched in each round, requiring four loads (lw instructions) in addition to their address increment (typically every other round). There is no need for separate AddRoundKey XORs as the subkeys simply initialize either one of the four-register sets used to store the state.

It is also possible to compute the round keys “on the fly” without committing them to RAM. This may be helpful in certain types of security applications. The overhead is roughly 30%. However, if the load operation is much slower than register-to-register arithmetic, the overhead of on-the-fly subkey computation can become negligible. On-the-fly keying is more challenging in reverse.

Iii-B SM4 Computation and Key Schedule

SM4 has only one key size, 128 bits. The algorithm has 32 steps, each using a single 32-bit subkey word. The steps are typically organized into 8 full rounds of 4 steps each. Due to its Feistel-like structure SM4 does not require an inverse S-Box for decryption like AES, which is a substitution-permutation network (SPN). The inverse SM4 cipher is equivalent to the forward cipher, but with with reversed subkey order.

Each step uses all four state words and one subkey word as inputs, replacing a single state word. Since input mixing is built from XORs, some of the temporary XOR values are unchanged and can be shared between steps. Each round requires ten XORs in addition to sixteen SM4_FN_ENC invocations, bringing the total number of arithmetic instructions to 26 per round – or 6.5 per step. Therefore SM4 performance is slightly lower than that of AES-128, despite having fewer full rounds.

The key schedule similarly requires 16 invocations of SM4_FN_KEY and 10 XORs to produce a block of four subkey words. The key schedule uses 32 “CK” round constants which can be either fetched from a table or computed with 8-bit addition operations on the fly.

For SM4 each block of four consecutive invocations of SM4_FN_ENC and SM4_FN_KEY share the same source and destination registers, differing only in fn[1:0] which steps through . We denote such a four-ENC1S block as pseudo instruction ENC4S. One can reduce the per-round instruction count of SM4 from 26 (+4 lw) to 14 (+4 lw) by implementing ENC4S as a “real” instruction that is almost four times larger than ENC1S in hardware.

Note that without additional instructions an AES implementation does not benefit from ENC4S in encryption or decryption, only in key schedule.

Iv Reference Implementation

An open-source reference implementation is available111AES/SM4 ISA Extension: https://github.com/mjosaarinen/lwaes_isa. The distribution contains HDL combinatorial logic for the ENC1S instruction (including the S-Boxes) and provisional assembler listings for full AES-128/192/256 and SM4-128.

The package also has C-language emulator code for the instruction logic, “runnable pseudocode” implementations of algorithms, and a set of standards-derived unit tests. This research distribution is primarily intended for obtaining data such as instruction counts and intermediate values but can be readily integrated into many RISC-V cores and emulators.

Iv-a About the AES, SM4 S-Boxes

AES and SM4 can share data paths so it makes sense to explore their additional structural similarities and differences. Both SM4 and AES S-Boxes are constructed from finite field inversion in together with a linear (affine) transformations on input and/or output. The inversion makes them “Nyberg S-Boxes” [9] with desirable properties against differential and linear cryptanalysis, while the linear mixing steps are intended to break the bytewise algebraic structure.

Since is an involution (self-inverse) and affine isomorphic regardless of polynomial basis, AES, AES, and SM4 S-Boxes really differ only in their inner and outer linear layers.

Boyar and Peralta [10]

show how to build low-depth circuits for AES that are composed of a linear top and bottom layers and a shared nonlinear middle stage. Here XOR and XNOR gates are “linear” and the shared nonlinear layer consists of XOR and AND gates only. For this project we created additional top and bottom layers for SM4 that use the same the middle layer as AES and AES


Component In, Out XOR XNOR AND Total
Shared middle 21 18 30 - 34 64
AES top 8 21 26 - - 26
AES bottom 18 8 34 4 - 38
AES top 8 21 16 10 - 26
AES bottom 18 8 37 - - 37
SM4 top 8 21 18 9 - 27
SM4 bottom 18 8 33 5 - 38
TABLE II: Algebraic gate counts for a Boyar-Peralta type low-depth S-Boxes that implement SM4 in addition to AES and AES.

Each S-Box expands an 8-bit input to 21 bits in a linear inner (“top”) layer, uses the shared nonlinear 21-to-18 bit mapping as a middle layer, and again compresses 18 bits to 8 bits in the outer (“bottom”) layer. Table II gives the individual gate counts to each layer; summing up top, middle, and bottom gives the total S-Box gate count ( 128).

Despite such a strict structure and limited choice of gates (that is suboptimal for silicon but very natural to mathematics), these are some of the smallest circuits for AES known. Note that it is possible to implement AES with fewer gates (113 total), but this results in 50% higher circuit depth [11].

Iv-B Experimental Instruction Encoding and Synthesis

For prototyping we interfaced the ENC1S logic using the custom-0 opcode and R-type instruction encoding with fn[4:0] occupying lower 5 bits of the funct7 field:

[31:30] [29:25] [24:20] [19:15] [14:12] [11:7] [6:0]
00 fn rs2 rs1 000 rd 0001011

The implementation has been tested with PQShield’s “Pluto” RISC-V core. We synthesized the same core on low-end Xilinx Artix-7 FPGA target (XC7A35TICSG324-1L) with and without the ENC1S (AES, SM4) instruction extension and related execution pipeline interface.

For comparison, we also measured the size of a memory-mapped AES module “EXTAES”. This module implements AES encryption only, not inverse AES or SM4. Table III

summarizes the relative area of ENC1S and EXTAES. Note that the SoC used in this synthesis has some additional logic that is not relevant to the current discussion. We estimate that the full (AES, AES

, SM4) instruction proposal increases the amount of core logic (LUTs) by about 10% over a typical baseline RV32I core, but much less for more complex cores.

Implementors can experiment if it is beneficial to multiplex the S-Box linear layers with the shared middle layer. The required mux logic seems large and increases the circuit depth, so our current reference implementation does not use it.

We observe that the EXTAES module requires a large amount of additional slice registers. Such a memory-mapped state is more difficult to manage and share among processes than the ENC1S state which is always contained in the register file. While the EXTAES module has 16 parallel S-Boxes and executes the core AES itself in about a dozen cycles, loading and storing of blocks causes significant additional latency.

Resource Base ENC1S () EXTAES ()
Logic LUTs 7,767 8,202 (+435) 9,795 (+2,028
Slice regs 3,319 3,342 (+23) 4,361 (+1,042)
SLICEL 1,571 1,864 (+293) 2,068 (+497)
SLICEM 734 737 (+3) 851 (+117)
TABLE III: RV32 SoC area with and without ENC1S (AES, AES, SM4); “Pluto” core on an Artix-7 FPGA. EXTAES is a CPU-external memory-mapped AES-only module, presented for comparison.

V Performance and Security Analysis

The hand-optimized AES implementation222Ko Stoffelen: “RISC-V Crypto” [12] https://github.com/Ko-/riscvcrypto referenced in [12] requires 80 core arithmetic instructions per round. The same task can be accomplished with 16 ENC1S instructions. Furthermore, 16 of those 80 are memory loads, which typically require more cycles than a simple arithmetic instruction (or ENC1S). Each AES round additionally requires a few operations for loading subkeys and managing instruction flow.

Overall, based on RV32 and RV64 instruction counts we estimate that the performance of an ENC1S AES can be expected to be more than 500% better than the fastest AES implementations that use the baseline ISA only. Much of the precise performance gain over a table-based implementation depends on the latency of memory load operations.

ENC1S-based AES and SM4 implementations are inherently constant-time and resistant to timing attacks. Stoffelen [12] also presents a constant-time, bitsliced AES implementation for RISC-V which requires times more cycles than the optimized table-based implementation. So ENC1S speedup over a timing side-channel hardened base ISA implementation is expected to be roughly 15-fold.

We are not aware of any definitive assembler benchmarks for SM4 on RISC-V, but based on instruction count estimates the performance improvement can be expected to be roughly similar or more (over 500 %). Without ENC1S simple SM4 software implementations would benefit from rotation instructions which have been proposed in the RISC-V bit manipulation extension, but are not widely available.

We have only discussed timing side-channel attacks. Since these instructions interact with the main register file, any electromagnetic emission countermeasures would probably have to be extended to additional parts of the CPU core.

It may be possible to address electromagnetic emissions with completely different types of “masking” instructions. We note that the low multiplicative complexity of our S-Box logic helps when building side-channel resistance beyond timing attacks. Goudarzi et al [13] found the Boyar-Peralta type S-Box to be ideal for masked implementations, a general countermeasure against emission side-channel attacks.

Vi Conclusions

We propose a minimalistic RISC-V ISA extension for AES and SM4 block ciphers. The resulting speedup is 500% or more for both ciphers when compared to hand-optimized base ISA assembler implementations that use lookup tables.

In addition to saving energy and reducing latency in secure communications and storage encryption, the main security benefit of the instructions is their constant-time operation and resulting resistance against cache timing attacks. Such countermeasures are expensive in pure software implementations.

The instructions require logic only for a single S-Box, which is combined with additional linear layers for increased code density and performance. The hardware footprint of the instruction is very small as a result. If both AES and SM4 are implemented on the same target they can share data paths which further simplifies hardware. However, AES and SM4 are independent of each other and AES is also optional. It is not rare to implement and use the forward AES without inverse AES as common CTR-based AES modes (such as GCM) do not require the inverse cipher for decryption [14].

This proposal is targeted towards (ultra) lightweight MCUs and SoCs. A different type of ISA extension may provide additional speedups on 64-bit and vectorized platforms, but with the cost of increased implementation area. Designers may still want to choose this minimal-footprint option if timing side-channel resistance is their primary concern.


  • [1] NIST, “Advanced Encryption Standard (AES),” Federal Information Processing Standards Publication FIPS 197, November 2001.
  • [2] E. Rescorla, “The Transport Layer Security (TLS) protocol version 1.3,” IETF RFC 8446, August 2018. [Online]. Available: https://www.rfc-editor.org/info/rfc8446
  • [3] SAC, “GB/T 32907-2016: SM4 block cipher algorithm,” Cryptographic Standards Publication, original in Chinese. Also GM/T 0002-2012, August 2016. [Online]. Available: http://www.gmbz.org.cn/upload/2018-04-04/1522788048733065051.pdf
  • [4] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermeasures: The case of AES,” in Topics in Cryptology - CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, San Jose, CA, USA, February 13-17, 2006, Proceedings, ser. Lecture Notes in Computer Science, D. Pointcheval, Ed., vol. 3860.   Springer, 2006, pp. 1–20. [Online]. Available: https://eprint.iacr.org/2005/271
  • [5] D. J. Bernstein, “Cache-timing attacks on AES,” Web-published Manuscript, April 2005. [Online]. Available: http://cr.yp.to/papers.html#cachetiming
  • [6] S. Gueron, “Intel Advanced Encryption Standard (AES) new instructions set,” White Paper, May 2010, 323641-001.
  • [7] ARM, “Arm A64 Instruction Set Architecture Armv8, for Armv8-A architecture profile,” 2019, ARM DDI 0596 (ID 120619). [Online]. Available: https://static.docs.arm.com/ddi0595/f/SysReg_xml_v86A-2019-12.pdf
  • [8] A. Waterman and K. Asanović, Eds., The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 20191213.   RISC-V Foundation, December 2019. [Online]. Available: https://riscv.org/specifications/
  • [9] K. Nyberg, “Differentially uniform mappings for cryptography,” in Advances in Cryptology - EUROCRYPT ’93, Workshop on the Theory and Application of of Cryptographic Techniques, Lofthus, Norway, May 23-27, 1993, Proceedings, ser. Lecture Notes in Computer Science, T. Helleseth, Ed., vol. 765.   Springer, 1993, pp. 55–64. [Online]. Available: https://doi.org/10.1007/3-540-48285-7_6
  • [10] J. Boyar and R. Peralta, “A small depth-16 circuit for the AES S-box,” in Information Security and Privacy Research - 27th IFIP TC 11 Information Security and Privacy Conference, SEC 2012, Heraklion, Crete, Greece, June 4-6, 2012. Proceedings, ser. IFIP Advances in Information and Communication Technology, D. Gritzalis, S. Furnell, and M. Theoharidou, Eds., vol. 376.   Springer, 2012, pp. 287–298. [Online]. Available: https://eprint.iacr.org/2011/332
  • [11] ——, “A new combinational logic minimization technique with applications to cryptology,” in Experimental Algorithms, 9th International Symposium, SEA 2010, Ischia Island, Naples, Italy, May 20-22, 2010. Proceedings, ser. Lecture Notes in Computer Science, P. Festa, Ed., vol. 6049.   Springer, 2010, pp. 178–189. [Online]. Available: https://doi.org/10.1007/978-3-642-13193-6_16
  • [12] K. Stoffelen, “Efficient cryptography on the RISC-V architecture,” in Progress in Cryptology - LATINCRYPT 2019 - 6th International Conference on Cryptology and Information Security in Latin America, Santiago de Chile, Chile, October 2-4, 2019, Proceedings, ser. Lecture Notes in Computer Science, P. Schwabe and N. Thériault, Eds., vol. 11774.   Springer, 2019, pp. 323–340. [Online]. Available: https://eprint.iacr.org/2019/794
  • [13] D. Goudarzi and M. Rivain, “How fast can higher-order masking be in software?” in Advances in Cryptology - EUROCRYPT 2017 - 36th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Paris, France, April 30 - May 4, 2017, Proceedings, Part I, ser. Lecture Notes in Computer Science, J. Coron and J. B. Nielsen, Eds., vol. 10210, 2017, pp. 567–597. [Online]. Available: https://eprint.iacr.org/2016/264
  • [14] M. Dworkin, “Recommendation for block cipher modes of operation: Methods and techniques,” NIST Special Publication SP 800-38A, December 2001.