Dustin: A 16-Cores Parallel Ultra-Low-Power Cluster with 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

by   Gianmarco Ottavi, et al.

Computationally intensive algorithms such as Deep Neural Networks (DNNs) are becoming killer applications for edge devices. Porting heavily data-parallel algorithms on resource-constrained and battery-powered devices poses several challenges related to memory footprint, computational throughput, and energy efficiency. Low-bitwidth and mixed-precision arithmetic have been proven to be valid strategies for tackling these problems. We present Dustin, a fully programmable compute cluster integrating 16 RISC-V cores capable of 2- to 32-bit arithmetic and all possible mixed-precision permutations. In addition to a conventional Multiple-Instruction Multiple-Data (MIMD) processing paradigm, Dustin introduces a Vector Lockstep Execution Mode (VLEM) to minimize power consumption in highly data-parallel kernels. In VLEM, a single leader core fetches instructions and broadcasts them to the 15 follower cores. Clock gating Instruction Fetch (IF) stages and private caches of the follower cores leads to 38% power reduction with minimal performance overhead (<3 implemented in 65 nm CMOS technology, achieves a peak performance of 58 GOPS and a peak efficiency of 1.15 TOPS/W.



There are no comments yet.


page 4

page 6

page 8

page 10

page 13


Enabling Mixed-Precision Quantized Neural Networks in Extreme-Edge Devices

The deployment of Quantized Neural Networks (QNN) on advanced microcontr...

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

The fast proliferation of extreme-edge applications using Deep Learning ...

XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes

This work introduces lightweight extensions to the RISC-V ISA to boost t...

Vega: A 10-Core SoC for IoT End-Nodes with DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode

The Internet-of-Things requires end-nodes with ultra-low-power always-on...

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

Single-issue processor cores are very energy efficient but suffer from t...

DM algorithms in healthindustry

This survey reviews several approaches of data mining (DM) in healthindu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern Near-Sensor Analytics Applications (NSAA) increasingly require to run complex workloads such as Deep Neural Networks (DNN) on Internet of Things (IoT) end-nodes. These devices are severely constrained in terms of power envelope, memory, and cost (i.e., silicon area and technology). An emerging trend to tackle this problem is to employ the most compact data representation usable from a numerical viewpoint for each given task of an application, exploiting low-precision and mixed-precision arithmetic operations. In this way, each operand of an arithmetic operation is represented with the smallest possible number of bits, reducing the complexity of arithmetic units and the memory footprint of an application [van2020bayesian, cai2020rethinking].

This approach is well-established both in the floating-point and integer domains. For what concerns floating-point workloads, transprecision techniques have been demonstrated in domains such as traditional near-sensor data analytics [BEPPETRANS] and training of neural networks [GANPU]. In the integer domain, emerging fixed-point transprecision and mixed-precision techniques can be pushed down even more significantly to extreme low-bitwidth for applications based on linear algebra [STOJANOV] and inference of deep neural networks [lee2018unpu]. For example, executing a MobileNetV1 DNN with 4-bit precision reduces memory footprint by 7x with 4.4% Top1 accuracy drop over full-precision execution. Moreover, mixed-precision quantization further reduces memory footprint by 11%, and, most importantly, reduces the classification accuracy gap versus the full precision baseline (-2.9% Top 1) [rusci2019memory].

From an architectural viewpoint, extreme low-bitwidth mixed-precision arithmetic has been mainly applied in specialized accelerators [GANPU, lee2018unpu]. Exploiting this technique with dedicated hardware is very effective since the whole datapath is typically designed for a single or a subset of functions, safely tuning each operation to the desired precision. However, applying this principle to fully programmable architectures is challenging since the algorithms to be executed are not known a priori. Consequently, multiple formats must be supported, increasing the complexity and overheads of instruction fetch and decode stages.

Still, previous work on low-bitwidth computations on instruction processors demonstrated remarkable results, especially in DNN inference. Garofalo et al. [garofalo2020pulp] proposed a C library for DNN inference exploiting 8-bit Single Instruction Multiple Data (SIMD) instructions, as well as other Digital Signal Processing (DSP) Instruction Set Architecture (ISA) extensions. This solution outperforms a commercial library (CMSIS-NN [lai2018cmsis]), implementing the same functions on ARM Cortex M4 and M7 by 4.54 or 2.54 respectively, only featuring support for 16-bit SIMD instructions. More aggressive approaches have been presented in Garofalo et al. [garofalo2020xpulpnn], where SIMD support has been extended to 4-bit and 2-bit operations, leading to further performance and energy efficiency gains.

However, the extensions proposed in Garofalo et al. [garofalo2020xpulpnn] only tackle part of the challenge, lacking featuring support for mixed-precision operations. Mixed-precision execution requires data conversion and packing/unpacking operations leading to significant overheads if not natively supported by the underlying hardware [bruschi2020enabling]. When applied to DNNs, exploiting mixed-precision computations on state-of-the-art processors dramatically reduces the memory footprint enabling the execution of MobileNets on tiny end-nodes. However, it comes with a significant performance overhead over symmetric SIMD. Furthermore, supporting too many mixed-precision formats leads to a proliferation of instructions. For example, supporting SIMD instructions for efficient execution of DNNs leads to more than 300 instructions due to precision format ranging from 16- to 2-bit and all possible permutations. This effect increases the complexity of the Instruction Fetch (IF) and Instruction Decode (ID) stages and possibly saturates the ISA encoding space.

We present Dustin, a low-power IoT processor with a software-programmable accelerator composed of 16 RISC-V cores, optimized for energy-efficient end flexible execution of integer mixed-precision operations to tackle the presented challenges. The DSP cores support mixed-precision extensions through 2b-to-16b SIMD instructions, accelerating arithmetic operation and complex packing-unpacking and conversion operations required by mixed-precision computations. The format of input operands is set through a dedicated control register to reduce the complexity of the IF and ID stages of the processors.

Furthermore, the cluster can be dynamically configured into a fine-grain Vector Lockstep Execution Mode (VLEM), turning off the IF stages and private instruction caches of all the cores except one. This technique boosts the energy efficiency of data-parallel sections of the code, reducing power consumption by 38% on average with no performance degradation on critical data-parallel kernels while still offering Multiple Instructions Multiple Data (MIMD) flexibility for general-purpose code.

Implemented in robust and cost-effective 65 nm CMOS technology, Dustin achieves 15 GOPS and 303 GOPS/W on 8-bit integer arithmetic. These results are comparable to SoA fully programmable systems implemented in much more scaled technology nodes (40 nm and 22 nm) – with a further boost in performance (3.7) and efficiency () on low-bitwidth mixed-precision workloads, up to 58 GOPS and 1.15 TOPS/W nearing the efficiency of dedicate accelerators.

2 Related Work

This section presents an overview of recent architectures for DNN inference, starting from dedicated digital and analog accelerators towards software programmable solutions more similar to those proposed in this work. We can categorize them by performance, flexibility, and power envelope. The first category includes specialized accelerators that trade-off flexibility to excel in performance and efficiency. A common feature of all the presented solutions is the exploitation of low bit-width (down to byte or even sub-byte) to improve the efficiency of DNN inference.

2.1 Dedicated Accelerators for DNNs

2.1.1 Digital Accelerators

Convolution in CNN layers can be re-mapped as MatrixMatrix multiplication. To exploit this, digital accelerators for DNNs are typically organized as 2D systolic arrays that are highly efficient for the aforementioned operation. Nevertheless, the fixed size of the array could lead to underutilization for problems that do not fit perfectly. In literature, we can see many examples of DNN accelerators [reuther2019survey] that reach performance levels ranging from tens to hundreds of GOPS and TOPS/W. We detail a few notable approaches below.

Envision [moons201714] is a DNN accelerator with a reconfigurable computational engine capable of using bit precision of 1-16b. It employs a circuit-level voltage and frequency scaling technique to improve efficiency capable of peak 76 GOPS and averages 2TOPS/W.

Thinker [yin20171] employs configurable 2D arrays that can be partitioned in sub-arrays to compute different types of layers. Each PE presents a set of 2 8-bit multipliers that can be merged for a 16-bit operation or can compute 2 8-bit (or less) in one cycle. It peaks at 380 GOPS and 5 TOPS.

Loom [sharify2018loom] is a bit-parallel DNN accelerator where performance and efficiency scale inversely with weight and activations precision. Loom can also reduce precision dynamically by inspecting groups of 256 activations that it processes concurrently, further increasing the effectiveness of bit reduction on the overall efficiency.

EyerissV2 [chen2019eyeriss] is a DNN accelerator that connects multiple PE clusters with a flexible NoC. The network can be configured to efficiently work on either high-bandwidth for networks that present low-reuse or, when reuse is high, it can still exploit spatial data reuse (via multicast or broadcast) to achieve high energy efficiency. It uses a fixed weight/activation precision format of 8-bit and is capable of a peak throughput of 153.6 GOPS.

2.1.2 Analog In-Memory Accelerators

A recent trend that leverages low-bitwidth computation is analog in-memory computing (AIMC) [verma2019memory]

. It overcomes the Von-Neumann bottleneck by executing the Matrix-Vector multiplication directly in-memory, reducing data movement and exploiting the high-parallelism of dense 2D memory arrays. Given that AIMC can only compute Matrix-Vector operations, several heterogeneous architectures have been proposed that adds digital electronics that perform the rest of the network (e.g., non-linear functions, residual layers, max-pooling, etc.). The attractiveness of these systems comes from the peak throughput and efficiency of DNN inferences. Examples are the works of Khaddam-Aljameh

et al. [hermesCor] claiming 10.5 TOPS/W; Zhou et al. [zhou2021analognets] with a peak efficiency of 112 TOPS/W; Jia et al. [new_verma] peak efficiency of 30 TOPS/W.

Even though peak performance and efficiency of AIMC macros are outstanding, several fundamental challenges are still open to achieve the claimed levels in end-to-end applications: i) variability of analog computing can significantly impact the accuracy of the network; ii) need for specialized training; iii) poor flexibility, AIMC is well matched only for a limited set of operations, mainly matrix-vector multiplication. Given these problems, most of these systems today have been demonstrated on small networks trained on simple data sets such as CIFAR-10 or MNIST 


Specialized architectures including both digital and analog accelerators can deliver remarkable performance and energy efficiency but lack flexibility. On the other hand, flexibility is a fundamental aspect when trying to accelerate DNN: neural networks are in continuous evolution, and the performance boosted thanks to an accelerator is only reachable for DNNs that can fit the target shape and size for which they have been designed.

2.1.3 FPGA based accelerators

DNN accelerators can also be deployed into FPGAs. This solution provides extra flexibility compared to ASICs since they can be reconfigured with Hardware Description Language (HDL), but this comes at an order of magnitude less performance and efficiency. The power envelope raises to Watt level for these devices, which can be problematic for battery-powered devices.

A new family of FPGAs announced by Lattice, namely Sense-AI [LatticeSENSEAI]

, provides comprehensive hardware and software solutions for always-on artificial intelligence within a power budget between 1 mW and 1 W. Despite that, these ultra-low-power FPGA families have limited LUT capabilities and are still too expensive for many applications where MCUs are traditionally chosen for their low cost.

The adoption of FPGAs remains an obstacle for the average IoT programmer who demands the highest flexibility from microcontroller systems. This work focus on more flexible and user-friendly solutions based on software programmable instruction processors.

2.2 Software Programmable Solutions

2.2.1 Low-Power MCUs

Given that the rise of DNN and quantization is relatively recent, ”classical” commercial microcontroller (MCU) cores such as Cortex M4 and M7 struggle to compete with newer architectures. This is shown in [garofalo2020pulp] where a RISC-V core111https://github.com/openhwgroup/cv32e40p significantly outperforms the ARM counterpart in CNN layers with 3.2 to 6 when using the de-facto standard quantization of 8-bit.

To address the DNN computing at the edge, ARM presented the Cortex M-55 core based on the ARMv8-1M ISA. The core’s general-purpose performance sits between an M4 and M7 with an M-Profile Vector Extension (MVE) called Helium that also supports 8-bit MAC instructions. The vector extension uses a 64-bit data interface, meaning it can execute 232-bit, 416-bit, 88-bit fixed-point operations per cycle. This result is a significant step forward compared to the M4 or M7 that do not support the vector extension ISA. However, 8-bit operands for Weights/Activation could still result in a memory footprint too big for micro-controller class devices.

Specialized accelerators can also be found on of microcontroller class of devices side by side general-purpose cores. In [VEGA] a cluster of 9 RISC-V cores with a tightly-coupled CNN accelerator improves performance by 3.3

and double efficiency w.r.t. executing the same network on the software-programmable cores. ARM adopts the same approach coupling the Cortex M-55 with the optional Ethos-55, an accelerator designed to boost machine learning tasks; depending on the configuration, the system can execute 32 to 256 MAC/Cycles. This solution can help mitigate the effect of under-utilization on ASIC acceleration, mixing high-throughput (when possible) with high flexibility.

In [garofalo2020xpulpnn] a RISC-V ISA extension called Xpulpnn is proposed. It expands on the already available DSP instructions to support the sub-byte precision format of 4- and 2-bit. It introduces MAC&LOAD instructions that simultaneously execute the dot-product while loading an operand for the next operation. Xpulpnn outperforms the commercially available M4 and M7 from 2.8 to 19.2 on Quantized convolutional layers. When going mixed-precision, the efficiency boost of Xpulpnn w.r.t. ARM cores narrow significantly because of the massive software overhead necessary for packing and unpacking data. Dustin’s cores have direct hardware support for mixed-precision operations, eliminating performance degradation compared to uniform precision. To the best of the author’s knowledge, no microcontroller class device supports hardware mixed-precision instructions.

2.2.2 Status-Based Execution

Mixed-precision support (and sub-byte format) lead to a significant increase in the number of encoded instructions since traditional general-purpose CPUs directly encode precision format leading to a cluttered ISA and increased complexity of the decode stage. The previously described DNN accelerators do not encounter this problem by directly encoding formats inside a control register, similar to the RISC-V Vector extension. When an operation between vectors occurs, a few configurations instructions are required to set precision and vector length [dabbelt2016vector].

In this work, we support a status-based execution model similar to the vector model where a single instruction is used for all precision which is encoded in a Control Status Register (CSR) of the CPU. The status-based approach enables a ”write-once executes-everywhere” policy on software development. A kernel can refer to generic operands, and the involvement of a specific functional unit depends on the core status. In practice, the programmer can write a single kernel and then run it targeting different formats.

2.2.3 SIMD vs MIMD

General-purpose multi-core CPUs offer a great deal of flexibility but fall behind in efficiency when exploiting data-level parallelism compared to SIMD style architectures. General-purpose multi-core architectures require per-core hardware for fetching and decoding instructions creating significant hardware and power overhead that could be avoided if instructions were executed in lockstep.

GPUs can efficiently capitalize on data-level parallelism, given that every multi-threaded processor needs only one unit that fetches and dispatch instructions to multiple execution units. A problem that can arise from executing code in GPUs is branch divergence: a piece of code can contain branches where some of the threads take it while others do not cause the code to be executed sequentially, degrading performance. Specifically, on NVIDIA GPUs (pre-Volta), this was dealt with by thread masking, where a mask with one entry per thread would tell to execute the branch or not. From Volta on [voltaWhitePaper], Independent Thread Scheduling has been introduced where each thread has its own program counter providing the possibility to execute the threads concurrently but not simultaneously. This allows to avoid deadlocks but does not have any performance impact when branch divergence happens.

In [dogan2013synchronizing] Dogan et al. propose a system of 8 General-Purpose Core that can synchronize and execute instructions in lockstep. On branch divergences, the cores get out-of-sync, execute the code simultaneously, and wait on a convergent point on a synchronization barrier for resuming lockstep execution. In conjunction with the lockstep, a broadcast mechanism serves multiple same-address memory requests as one, allowing significant power savings.

In our work, we follow a similar approach: the cluster can be configured dynamically to work in a MIMD or Vector Lockstep Execution Mode (VLEM). In VLEM, only one core fetches instruction (similar to a GPU) and dispatches them to the remaining 15 cores. This design allows to clock-gate the private I$ and Instruction fetch stages of the remaining cores enabling substantial power savings (in addition to the broadcasting feature). On divergent branches, the cluster is configured back to MIMD mode allowing simultaneous execution of each branch. This approach gives higher flexibility when compared to GPUs and, at the same time, removes the power overheads inherent in general-purpose CPUs that execute SIMD code.

3 SoC Architecture

Fig. 1 shows the architecture of Dustin SoC. The main contribution of the work refers to the RISC-V cores compute cluster. The IPs surrounding such domain, i.e., the RISC-V core named Fabric Controller (FC), a standard set of peripherals, the 80 kB of memory storing the code executed by both the compute cluster and the FC, as well as the FLLs for clock generation, serve as a programmable testbench.

Fig. 1: Overview of the Dustin SoC Architecture.

The programmable cluster resides in a dedicated power and clock domain, communicating with the SoC subsystem with two AXI 4 ports (one initiator and one target) connected to the SoC interconnect through dual-clock first-in-first-out buffers (FIFOs). The cluster is built around 16 32-bit RISC-V processors; the FC turns on and sets the frequency of these cores when it offloads the computation to the programmable accelerator. The cores share data on a 128 kB multi-banked L1 memory called tightly coupled data memory (TCDM), composed of 32 4-kB SRAM banks, through a single-cycle latency logarithmic interconnect (LIC). The LIC implements a word-level interleaving scheme to distribute the requests evenly, adopting a round-robin arbitration algorithm, minimizing the access contentions towards the memory banks. The TCDM can serve 32 memory requests in parallel, i.e. it has banking factor of two, with a low contention rate ( 10%) even on data-intensive kernels. The Lockstep unit and the Broadcast Unit are interposed between the cores and the LIC to enable a reconfigurable SIMD/MIMD execution model of the cluster. This key innovation will be discussed in more details in section 3.2.

The cluster can access peripherals such as timer, event unit, DMA, and the AXI4-bus via a dedicated peripheral interconnect. The DMA manages data transfer between the L2 and L1 memory, featuring 2-D data transfers and up to 16 outstanding transactions to hide the latency between the two levels of memory hierarchy. The cores share a 2-level latch-based instruction cache. The latch-based design allows to save up to 4 on instruction memory [meinerzhagen2010towards], while the two levels of caches help with long critical paths. The first level (512 B) is private, the second level (L1.5) is a 4 kB 8-banks shared cache connected to the L1s with an interconnect similar to the LIC with low latency. The L1.5 refills from a larger 80 kB L2 memory hosting resident code. For efficient parallel computing, the cluster supports parallel thread dispatching and synchronization via a dedicated hardware block, the event unit [glaser2020energy]. The cores can wait on events by doing loads on aliased, low-latency memory-mapped registers of the event unit. In addition, the event unit controls the clock-gating for all the cluster cores, meaning that a core waiting for an event can be put to sleep immediately and can resume after an event in two cycles.

3.1 Bit-Scalable Precision Processor

The cores employed in Dustin extend RI5CY [garofalo2020pulp], a 4-stages pipeline in-order single-issue processor. Fig. 2 shows a diagram of the cores’ pipeline: changed submodules w.r.t. the baseline RI5CY are highlighted in green, whereas the entirely new blocks are shown in yellow. The baseline RI5CY supports the standard RISC-V extensions (I, M, C, and F) and implements a domain-specific extension, called XpulpV2, that introduces several features useful to improve DNN inference such as hardware loops, bit manipulation instructions, load/store with post-modified access, SIMD operations for 16- and 8-bit format [garofalo2020pulp].

Fig. 2: Dustin’s Cores, an extension of CV32E40P.

The key efficiency-boosting enhancements of the proposed core are: a new mixed-precision SIMD dot product execution unit, integrated into the micro-architecture of RI5CY, providing support for power-of-two precision formats ranging from 16- down to 2-bit and all their possible permutations; a non-standard extension of the RISC-V ISA with a set of instructions to deal with mixed-precision SIMD operations through a dynamic bit-scalable execution model. Given the ten precision combinations that the SIMD Unit can execute, using a standard ISA extension approach where we encode one instruction per each type and format of operation would lead to an enormous proliferation in the number of instructions and increased complexity of the decode stage of the micro-architecture. Looking at the new dot-product instructions, which, however, are not the only supported operations, the standard approach would increase the total amount of instructions from 24 (of the baseline XpulpV2) to 120.

The dynamic bit-scalable approach we propose involves only the SIMD operations, and exploits a concept we call virtual instructions. For the user, they work as regular instructions; however, the key difference lies in the operation precision, which is not directly encoded into the instruction itself. The precision is specified at run-time by the content of a Control and Status Register (CSR), written in advance by the programmer to set the desired format of the operands. This approach reduces the amount of dot-product instructions encoded into the ISA by 10. Fig. 3 shows a graphical explanation of the dynamic bit-scalable execution model. The scalar instructions encode the format and type of the operations meaning that the decoder alone can provide full information to forward to the ex-stage. For SIMD instructions, the decoder forwards only information on the type of operations to perform to the ex-stage, i.e., it issues the virtual instruction; the additional control signals required by the execution units to determine the format of the operands are provided by the CSR.

Fig. 3: Control signals for SIMD and Scalar instructions. The SIMD instruction is a Sum-of-Dot-Product and the format is a mixed-precision 8x4. The bottom picture contains the encoding of the formats that are contained inside the CSR.

In mixed-precision convolution routines, the format of the dot-products can change between layers. The programmer sets the CSR with the desired format before calling the kernel with SIMD_FMT macro, as depicted in Fig. 4 (a). The overhead of this operation is negligible since convolution layers execute over millions of cycles.

Fig. 4: a) Procedure to set precision formats before calling the convolution function. b) Mixed-precision convolution inner loop with data unpacking and conversion overhead. c) Mixed-precision convolution inner loop with direct mixed-precision support.

Fig. 4 (b) and (c) highlight the benefits of the mixed-precision support at the ISA level introduced in this work. They compare a snippet of the assembly code of the innermost loop of a mixed-precision 8-bit4-bit convolution kernel, Fig. 4 (b) targeting the XpulpV2 ISA and Fig. 4 (c) the ISA extensions presented in this work. In Fig. 4 (b), once we load the 4-bit weights, additional unpacking and packing instructions are inserted to cast the 4-bit SIMD vector to 8-bit, since the lowest precision format supported by the XpulpV2 ISA with SIMD instructions is 8-bit. This procedure adds 6 instructions of overhead for each dot-product. Thanks to the hardware support for mixed-precision SIMD operations, the execution depicted in Fig. 4 (c) requires no additional packing/unpacking instructions, providing a significant boost in performance on Mixed-Precision executions.

At the micro-architectural level, we extend the ALU and the DOTP units of RI5CY to support the ISA instructions introduced above. We add extra CSR registers for storing the operations’ formats, and we design a mixed-precision controller to handle the selection, slicing, and routing of SIMD vector elements to the execution units of the ex_ stage of the pipeline.

We start detailing the DOTP unit architecture, omitting that of the ALU for the sake of conciseness, as its design follows a similar approach. The DOTP unit computes the dot-product (dotp) operation between two SIMD registers and accumulates the partial results over a 32-bit scalar register through an adder tree, in one clock cycle of latency. The SIMD vectors can be either symmetric or featuring mixed formats, within a precision range from 16-bit down to 2-bit.

Fig. 5: On the left, overview of the dot-product unit; on the right, internals of the DOTP-4.

Fig. 6: Mixed-Precision Dot Product 8x2, on the bottom right a portion of a kernel for DNN inference.

The dot-product operations are implemented in the DOTP unit with a number of multipliers equal to the number of elements of the SIMD vector (in the mixed-precision context the highest-precision SIMD vector determines the correct set of multipliers to be used) defining four different “bitwidth” regions, each followed by a dedicated adder tree that sums up the partial products, as shown in Fig. 5 for the 4-bit precision operation (DOTP-4). The sum-of-dot-product (sdotp), which is the SIMD equivalent of a MAC operation, is supported by adding an additional 32-bit scalar operand at the input of each adder tree.

To minimize the logic, operand B is designated to be always the smallest operand in mixed-precision SIMD operations, without loss of flexibility. Referring to Fig. 5 we introduce a slicer & router network to: a) slice operand_b according to the value coming from CSR Format Register; b) select the correct subset of operand_b with the sub-vec selector signal coming from the mixed-precision controller; c) sign-extend (or zero extend) the vector to match the size of operand_a in order to use the appropriate DOTP unit (e.g., an 8x4-bit operation requires DOTP-8).

Since dotp operations are critical from a timing closure viewpoint, replicating the hardware resources over different bitwidth “regions” of the DOTP unit avoids impacting the critical path of the RI5CY core at the cost of additional area and power. However, to mitigate the effects of hardware replication on the dynamic power consumption of the system, all the input operands of the DOTP unit are register gated to avoid switching for operands not involved in the current SIMD operation.

The last addition to the core is the mixed-precision controller, which selects the proper sub-group of elements from the second source-operand register (operand B). The controller function, depicted in Fig.6, enables a mixed-precision operation to scroll through the sub-portion of the register sequentially. To this extent, we identify a MAC operation in the ID stage and increment a counter used to select the sub-group of operands. Depending on the precision selected, this counter is capped at its proper value (e.g., the counter counts from 0 to 3 in the case of 8x2). This initial implementation is not enough to cover the reuse of operands: a set of weights (or input) can be reused multiple times depending on the kernel. A reuse counter has been added to the CSR to inform the mixed-precision controller after how many operations the sub-operand selector needs to advance. QNN kernels follow a uniform pattern in their computation, and the combination of sequential and reuse information is enough to deal with mixed-precision computation. To give maximum flexibility to the programmer, we implement the ability to control via software the subgroup of operands to use by directly writing the value in the counter. This feature is useful if the application includes an operation pattern that the mixed-precision controller cannot deal with automatically.

3.2 Vector Lockstep Execution Mode

The inner kernel of intensive workloads can be parallelized on multiple cores that execute the same instruction on different data. On a cluster of general-purpose cores using a MIMD execution model, this translates into a loss of efficiency due to the extra energy needed for fetching individual instructions. To counteract this effect, in the Dustin cluster we introduce support for a novel Vector Lockstep Execution Mode (VLEM), where all cores execute the same instructions cycle-by-cycle. When the cluster is configured in VLEM, core 0 acts as a leader core and the other 15 act as followers. While VLEM is active, the IF stages and private caches of the follower cores are clock-gated to save energy, and only the leader core fetches the instruction and forwards them to the follower cores. Fig. 7 provides a high-level overview of the two systems. To enter or exit VLEM, all cores have to i) synchronize on a barrier, and ii) write to a memory-mapped register.

Fig. 7: Cluster’s diagram for MIMD and VLEM execution models.

It is crucial to make sure that all cores are in sync when entering VLEM and stay in sync during VLEM execution. Load/store operations simultaneously accessing the same TCDM bank are a potential source of desynchronization. In MIMD mode, the TCDM interconnect via round-robin gives access to the cores, delaying multiple accesses to the same bank. The core’s memory accesses are carried out using a request/grant handshake: if the grant is not asserted, the core stalls until it arrives. To keep cores aligned in VLEM, this mechanism is extended to hold all grant signals until all memory accesses have been completed. The grants are then released simultaneously, preserving the synchronization.

Fig. 8 shows a simplified example of this behavior with 3 cores. In MIMD mode (Fig. 8(a)), the 3 cores try to access the same bank simultaneously. The request signal of the three cores is asserted by all concurrently, but the grant is given sequentially starting from core 0 to 2. Whenever the grant is received, the core can keep executing the rest of the code; otherwise, it stalls until the grant is received. In VLEM (Fig. 8b), the starting point is similar, but the grant for core 0 and 1 are held until core 2 can also be served. As all memory accesses are served simultaneously, the cores remain synced.

Fig. 8: Request/Grant handshake for memory accesses at the core’s interface. All cases present 3 concurrent memory requests on the same bank. a) MIMD case, sequentially served; b) VLEM case, sequentially served but Grant held by LKS unit for synchronization; c) Same bank and same address in VLEM which trigger broadcasting and resolution in one cycle; d) Performance overheads for different VLEM optimization implementation.

Keeping cores always in sync can lead to performance overhead whenever all cores repeatedly access the same bank. In MIMD mode, this is not an issue - the mechanism described in Fig. 8a desynchronizes them, which means the overhead is typically only paid once. In VLEM, however, without specific countermeasures, the cores would hit serialization overhead in all successive accesses – a penalty from 2 to 16 cycles per access depending on the number of conflicts. Fortunately, a common case is that of cores accessing the same word in the same bank, e.g., a pointer to the base of a shared array such as a weight filter utilized by all cores in a DNN. To avoid any overhead in this instance, we employ a hardware broadcasting mechanism in Dustin. It works by snooping the addresses of memory loads from all cores, comparing them, and propagating to memory single access if they are equal. The value extracted from memory is broadcasted to all the cores allowing 16 data accesses with one request.

Broadcasting does not help with overheads due to conflicts occurring when accessing the same bank at different offsets. In highly arithmetic-intensive kernels, this case typically arises when work is partitioned between cores so that the base addresses for each core’s data set lie on the same bank. In this case, conflicts are typically systematic: as the cores proceed in sync, they consistently access the same bank w.r.t. each other due to word-based bank interleaving. A simple yet effective software-based countermeasure is to allocate data so that the base addresses seen by each core are never aligned in the same bank. Section 4 discusses this problem and the related software countermeasures in detail.

In Fig. 8 d) we report the performance overhead of VLEM, compared with the baseline case of execution in MIMD mode, using a CNN layer as an example. Without any countermeasure, the number of cycles is increased by 73%, translating into very poor performance. Broadcasting reduces the overhead to 50%, which is still not acceptable. By adding ad-hoc data misalignment, so that base addresses are never on the same bank, overhead is down to 3%, which is the price to pay for entering and exiting VLEM since no data conflicts are present anymore.

VLEM can be activated/deactivated by setting a memory-mapped register.

Fig. 9: a)Internal components of the Lockstep Unit that sits between the cores and memory; b) Finite State Machines that govern the behavior of the Grand and Data Synchronizer.

When VLEM is engaged, the lockstep unit (LKS) is activated. Fig. 9 a) shows the main blocks contained in the LKS. When the register is set to 0, the cluster is in MIMD mode, and all blocks in the LKS are bypassed; when it is set to 1, the unit activates the VLEM functionality. In VLEM, all memory request addresses are compared by the Broadcast Unit to implement the broadcasting functionality. If they are the same, the BRDC signal is asserted and forwarded to the Requests Silencer, which silences the requests to memory coming from follower cores and propagates only the one from the leader core. This mechanism is hidden from the cores’ viewpoint; the Grant Synchronizer sends a grant to every core so that the request/grant handshake is preserved, as shown in Fig. 8 c. In the next cycle, the TCDM memory responds with the requested data, and the Data Synchronizer broadcasts it to the cores. Fig. 9 b) shows a finite state machine that governs the behavior of the Grant and Data Synchronizer. When a broadcasting situation is detected, the FSM will not enter the WAIT GRANT state, as there is no need for synchronization: no bank conflicts happen. For other kinds of accesses, two possibilities can present when either carrying out a load or store. Case i): all accesses are on different banks, no synchronization is necessary, and control/data signals can pass through the LKS unit. Case ii): one or more bank conflicts are present. In this case, memory accesses are served cycle-by-cycle sequentially. During this operation, the FSM enters the WAIT GRANT state. All requests are served to cores simultaneously to maintain the synchronization; data received while in the WAIT GRANT state needs to be buffered in the Data Synchronizer block. Once all contentions are resolved, the FSM enters DATA RELEASE, releasing the grant to all cores and, in the next cycle, the received data.

Entering VLEM also has a major impact on the instruction cache subsystem. Specifically, the leader core’s private cache is not touched, but the caches of follower cores terminate any in-flight cache refill and then enter sleep. On exit of VLEM, the follower’s PC will be set to the leader value, and the caches turned on again. Depending on the number of instructions executed during VLEM, switching back to MIMD mode can cause the follower cores to stall immediately due to instruction cache misses.

4 Software stack and programming model

We add dedicated extensions to the PULP GCC compiler and its hardware abstraction layer (HAL) to support the specific features of Dustin (i.e., multiple-precision arithmetic and VLEM). We define a set of intrinsics in the GCC backend for each mixed-precision variant of MAC and dot-product operations. SIMD vectors do not comply with the GNU vector extension but are represented as 32-bit opaque types (i.e., int32_t). This design choice is motivated by the fact that GCC does not handle data formats of less than eight bits, while we need support for 2-bits and 4-bits elements. Nevertheless, this approach is totally transparent from the user perspective; as the only limitation, the programmer does not benefit from static type checking for multi-precision arithmetic. In addition to the compiler backend, we extend the HAL with additional support to facilitate VLEM programming. The baseline HAL provides a set of primitives for core identification, synchronization, and memory allocation.

Core identification is achieved through a function that returns the unique identifier (core_id) of the core; on the Dustin platform, the is an integer value in the range . Programmers can exploit loop-level parallelism partitioning loop iterations into chunks and distributing these chunks to the executing cores in a round-robin order. This approach needs to include core identifiers into the loop control expressions (i.e., initialization, condition check, and increment). Core identification is extensively used in both modes (i.e., MIMD and VLEM). The primary synchronization primitive provided by the HAL is called barrier. This function stops a core until all other cores execute an associated barrier function, enforcing a synchronization point in the program flow. Programmers must explicitly add a call to this function to guarantee data consistency between adjacent code regions. However, the event unit component [glaser2020energy] provides low-overhead synchronization support, enabling power-saving policies for waiting cores. The barrier is the main synchronization mechanism adopted in Dustin during MIMD operation.

Finally, the original PULP HAL supports static and dynamic memory allocation. Static allocation requires specifying the data size at compile time through constants (e.g., the sizes specified for arrays must necessarily be constant values). In C++ programs, constexpr expressions can replace constant literals; anyhow, their result is constant and computed at compile time. Global variables can be placed in TCDM or L2 memory areas decorating their declaration with a preprocessor macro (PI_L1 or PI_L2, respectively) mapped on a __attribute__((section(…))) directive. Following C and C++ semantics, local variables are on the stack; as discussed later, the stack memory area is allocated on the L1 memory. Dynamic allocation on the heap memory area is available through HAL primitives using a standard heap allocator based on malloc/free functions. The malloc function requires to specify the size of the memory area to allocate and returns a pointer of type void that can be cast into a pointer of any form. The free function de-allocates a memory region, which becomes available for the following malloc calls. The HAL provides alternative malloc/free functions to allocate data in L1 or L2 memory.

From the memory access perspective, algorithms can adopt two alternative memory access patterns, strided or indirect

. The strided pattern consists of a regular sequence of accesses characterized by an initial address, a distance between adjacent accesses (called

stride), and the number of accesses in the sequence. E.g., a sequence can be expressed as , where A is an array variable, and i is a loop induction variable ranging from to with stride . The indirect pattern is a sequence of accesses where multiple memory requests are required to access each element.

Fig. 10: Element-wise vector addition with chunk size equal to 1 (top) and 64 (bottom). Access to the same memory bank by different cores executing the same instruction results in systematic bank conflicts.

An analysis of the memory access pattern becomes critical when VLEM is active because some cases can induce systematic bank conflicts that are detrimental to performance, as shown in the upper part of Fig. 10. In the case of a stridden access pattern, programmers can avoid bank conflicts by coalescing memory accesses at the bank level. This property holds if adjacent cores execute the same instruction access addresses located on contiguous banks. The actual feasibility of this technique does not depend exclusively on the memory access pattern but also on the loop parallelization strategy. Fig. 10 illustrates a code performing element-wise vector addition for two alternative chunk size values at the two extremes of the possible range (i.e., 1 and 64). The access pattern has a unit stride, but the absence of bank conflicts depends on the chunk size. Setting the chunk size to 1 prevents bank conflicts while the value 64 induces all the accesses to insist on the same bank. In the general case, if is the byte size of the array element (less or equal to 32 bits), is the word size, is the number of cores, and is the number of banks, a chunk size in the range between and avoids the presence of bank conflicts. The Dustin HAL provides the macros MIN_CHUNK_SIZE and MAX_CHUNCK_SIZE to retrieve these values specifying only the element size since the other parameters are architecture-dependent.

Fig. 11: Code template for an application kernel accessing a intermediate buffer with and expression depending on . This pattern results in systematic bank conflicts when the buffer size is a multiple of the number of banks multiplied by the word size.
Fig. 12:

Memory layout of a statically allocated global buffer, highlighting the offsets of the core sub-buffers and the padding areas required to avoid bank conflicts.

Fig. 11 shows a code template characterized by an access pattern that depends on the core_id. This case is typical in libraries that perform operations on intermediate buffers (e.g., PULP-NN [garofalo2020pulp]), and its occurrence can generate bank conflicts depending on the start address of single buffer regions. These conflicts are systematic when the buffer size is a multiple of ; in the general case, conflicts occur if the set of the remainders between the start addresses referred by each core and does not contain exactly elements. To avoid this issue, the Dustin HAL provides a function for dynamic memory allocation of intermediate buffers on L1, called l1_misaligned_malloc. This function is a wrapper around the malloc operations, forcing the starting address of the allocated memory space to be allocated on bank . This approach enables each core to allocate its local buffer with a strong guarantee on the absence of bank conflicts in the case of strided access patterns.

Some application domains strictly require static allocation of a single memory buffer shared among the cores, precluding the use of l1_misaligned_malloc. In these cases, the Dustin HAL provides a macro L1_MISALIGN_TOTAL that computes the total amount of contiguous memory required to fulfill requirements and avoid structural bank conflicts. Additional padding space is introduced to move the start address of each core on a different bank. Supposing that each core considers sub-buffer of bytes, the macro returns , with . In addition, the macro L1_MISALIGN_OFFSET returns the starting offset of the sub-buffer related to as . Fig. 12 illustrates the memory layout of a statically allocated global buffer. The padding area between adjacent sub-buffers varies based on the value , and it is equal to bytes in the worst case. This value becomes negligible (i.e., 127 bytes) when considering the effective architectural parameters for Dustin reported in Section 3.

5 Performance and Measurements

Fig. 13: Chip micrograph.
Dustin [µm] Ref [µm] Perc. Delta[%]
Cluster 100 5.66
Icache 29.2 0.08
TCDM Memory 23.3 0.00
Cluster Interconnect 4.2 1.29
Cluster Peripherals 3.5 -0.11
DMA 2.7 -0.13
Core Region 2.1 16.66
Lockstep Unit - 0.3 -
Core 100 17.03
IF Stage 8.5 3.66
ID Stage 35.5 8.53
EX Stage 46.5 28.01
Load Store Unit 2.6 9.02
CSR 6.2 7.13
TABLE I: Detailed area breakdown of Dustin compute cluster

Fig. 13 shows a die photograph of Dustin. The SoC is implemented in TSMC 65nm CMOS technology targeting a clock frequency of 200 MHz in typical operating conditions, within a die size of 10 . In the following, we analyze the measurements of Dustin’s cluster, leaving aside the measurements of the SoC subsystem, which is not the focus of this work.

Fig. 14 reports Dustin’s cluster maximum operating frequency and energy per cycle at different supply voltages, ranging from 0.8 V to 1.2 V. The measurements are carried out on the silicon prototype, running matrix multiplication kernels with 8-bit precision operands, a typical high-utilization deep neural network workload. The operating frequency increases with the voltage up to the maximum of 205 MHz, measures with an operating voltage of 1.2V. In terms of energy, we notice a significant saving factor when the cluster runs a matrix multiplication kernel in VLEM, about 38% lower energy per cycle compared with the MIMD mode, in all the voltage corners considered. This result on regular kernels like matrix multiplications is achieved thanks to the clock gating applied to the caches and IF stages of the follower cores, which are not used in VLEM, reducing the cluster dynamic power consumption.

Fig. 14: Voltage Sweep vs. Max Freq. vs. Energy/Cycle.
Fig. 15: Power in a inner loop of a CNN layer (MIMD vs. VLEM).

To understand the costs in terms of area and power of our contributions, we implement a baseline version of the cluster, stripped of VLEM logic and the Mixed-precision capabilities, with a full synthesis and place and route flow. We compare the two clusters from a physical point of view and from a performance and energy efficiency perspective. Table I reports the comparison between the two clusters in terms of area, detailing the breakdown of the main components of the cluster and the core. The Dustin cluster is 5% bigger than the reference one. The main contributors to this overhead are the cores, where most of the new logic resides. The leader and the follower cores show the same area, since the follower cores still feature IF stages to be used in MIMD mode. The additional wires of the leader core to broadcast the instructions to the other cores (when the cluster is in VLEM) contribute with negligible overhead. Each core impacts for 2.1% on the total area of the Dustin cluster, with an area overhead of about 17% compared to RI5CY: the IF stage of the pipeline accommodates new logic to enable the VLEM; the ID stage includes the mixed-precision controller and modifications to the decoder to support the new set of virtual instructions; the EX stage contributes the most (28% of overhead) since it features the new extended DOTP unit; the LSU area increases as well: the TCDM interconnect requires more driving strength to meet the clock frequency constraints of the design set at 200 MHz, due to the presence of the lockstep and the broadcast units along the path from the cores to the TCDM banks; in the end, the CSR includes new registers to store the format of SIMD mixed-precision instructions. If we put this analysis in perspective, 5% of area overhead is an acceptable trade-off considering the significant improvements in terms of performance and energy efficiency that the mixed-precision capabilities of Dustin’s cores and the energy savings of VLEM bring to the baseline cluster.

SleepRunner [SLEEPRUNNER] SamurAI [SAMURAI] VEGA [VEGA] XPULPNN [garofalo2020xpulpnn] Dustin (this work)
CMOS 28nm
CMOS 28nm
CMOS 22nm
CMOS 22nm
CMOS 65nm
Silicon Proven Yes Yes Yes No Yes
Die Area 0.68 mm 4.5 mm 12 mm 1.05 mm 10 mm
Apllications IoT GP IoT GP + DNN IoT GP + DNN IoT GP + DNN + QNN IoT GP + DNN + QNN
Thumb-2 subset
1x RI5CY
9x RI5CY
8x RI5CY
16x RI5CY
Mixed-Precision Extended
Int Precision (bits) 32 8,16,32 8,16,32 2,4,8,16,32
(plus mixed-precision)
Supply Voltage 0.4 - 0.8 V 0.45 - 0.9 V 0.5 - 0.8 V 0.6 - 0.8 V 0.8 - 1.2 V
Max Frequency 80 MHz 350 MHz 450 MHz 400 MHz 205 MHz
Power Envelope 320 uW 96 mW 49.4 mW 19.3 - 41.6 mW 156 mW
Best Integer Performance 31 MOPS(32b) 1.5 GOPS (8b) 15.6 GOPS (8b)
5.61 GOPS (8x4/2b)
3.12 GOPS (4x2b)
23 GOPS (8b)
43 GOPS (4b)
72 GOPS (2b)
8.27 GOPS (8x4/2b)
8.6 GOPS (4x2b)
15 GOPS (8b)
30 GOPS (4b)
58 GOPS (2b)
16.7 GOPS (8x4b)
18.4 GOPS (8x2b)
33.6 GOPS (4x2b)
Best Integer Efficiency
97 MOPS/mW
@ 18.6 mops (32b)
230 GOPS/W
@ 110 MOPS (8b)
614 GOPS/W
@ 7.6 GOPS (8b)
220 GOPS/W
@ 2,7 GOPS (8x4/2b)
123 GOPS/W
@ 1.52 GOPS (4x2b)
1111 GOPS/W (8b)
@ 11.4 GOPS
2565 GOPS/W (4b)
@ 21.7 GOPS
3050 GOPS/W (2b)
@ 36.2 GOPS
400 GOPS/W
@ 4.1 GOPS (8x4/2b)
513 GOPS/W
@ 4.3 GOPS (4x2b)
303 GOPS/W (8b) @
4.4 GOPS
570 GOPS/W (4b) @
8.8 GOPS
1152 GOPS/W (2b) @
17.3 GOPS
345 GOPS/W
@ 5 GOPS (8x4b)
379 GOPS/W
@ 5.5 GOPS (8x2b)
640 GOPS/W
@ 10 GOPS (4x2b)
  • 11footnotemark: 1

    OPs = 1 8-bit (or 4-bit or 2-bit) MAC on MatMul benchmark unless differently specified. 22footnotemark: 2 Execution on SW programmable Core.

TABLE II: Comparison With State-Of-The-Art

Fig. 16: Comparison in terms of Energy Efficiency of Dustin configured in MIMD and VLEM, running Mixed-precision Matrix Multiplication kernels.

To fortify the previous statement, Fig. 15 reports the power breakdown of the innermost loop of a QNN convolution layer. The values shown are the results of post-layout simulations running Dustin’s cluster at 50 MHz, at the supply voltage of 1.0V, in the typical corner (TT, 25C). We run post-layout simulations because such a study would not be possible from silicon measurements since all the components of the cluster share the same power lines and no fine-grained breakdown can be extracted. In the convolution kernel considered, which follows the data flow presented in [garofalo2020pulp], each core of the cluster process different subsets of the input feature map over the same set of weights to produce different sub-sets of the output feature map. When the cluster runs in VLEM, such a layout allows to massively leverage Dustin’s broadcast features on the QNN weights, significantly reducing the power consumption of the TCDM interconnect, in addition to the clock gated caches and IF stages of the followers cores. In the pie chart on the right of Fig. 15, we stress the fact that most of the power is spent for computation (ID-EX) in VLEM, eliminating the overhead of moving back and forth the same data from the TCDM to the cores and independently fetching the same instructions for all the cores, as it happens instead in MIMD mode.

Fig. 17: The chart compares the execution of mixed-precision convolution kernels running on the baseline 16 cores cluster with the RI5CY core (software mixed-precision kernels) and on Dustin’s cluster in VLEM (featuring the Mixed-precision ISA extensions).

5.1 Benchmarking

To highlight the performance and the energy efficiency of the silicon prototype on QNN workloads, we benchmark heavily quantized and mixed-precision convolution kernels, varying the format of the operands from 2-bit to 16-bit, covering all the relevant operand precision permutation scenarios.

In Fig. 17 we report the performance of the kernels running on Dustin’s cluster, and we compare the results with the baseline cluster described above, featuring the RI5CY cores. In the latter case, sub-byte and mixed-precision kernels are handled purely in software, as shown in [bruschi2020enabling]. We notice that on kernels where only the activations are sub-byte operands, the performance benefits of the hardware support for mixed-precision computation range from 1.9 to 2.8, due to the unpacking functions used in RI5CY, but in a less arithmetic intensive portion of the kernel. In all other configurations, the mixed-precision ISA extensions provide a significant advantage ranging from 2 to 7.7 improvements w.r.t. the baseline cluster, where the unpacking operations must be performed even in the innermost loop of the convolution, degrading the performance heavily.

To highlight the energy savings of the VLEM on regular computing kernels, we measure the energy consumption with the cluster running the matrix multiplication in two modes: the classic MIMD mode and the VLEM (enabled via software). Fig. 16 shows the related efficiency. The execution of linear kernels in VLEM achieves 1.5 better energy efficiency and no performance overhead w.r.t. the default MIMD execution.

5.2 Comparison with the state-of-the-art

Table II shows a comparison with recently published IoT end-nodes and fully programmable clusters. Compared to traditional single-core IoT end nodes [SLEEPRUNNER, SAMURAI], the proposed work delivers significantly better performance and efficiency thanks to the exploitation of parallelism. Compared to similar fully programmable multi-core IoT end-nodes [WOLF, VEGA], implemented in 40nm and 22nm technology nodes, respectively, the proposed SoC delivers similar performance and energy efficiency on an 8-bit format, despite the less scaled technology node used for its implementation. The performance goal is achieved thanks to the larger parallelism of the cluster, which compensates for the less scaled node. Most significantly, if we compare the energy efficiency, we can note that it is also similar thanks to the VLEM execution mode saving up to 38% of the overall power consumption of the cluster.

If we compare Dustin and XPULPNN [garofalo2020xpulpnn] performance in uniform-precision kernels, we note a gap ranging from 30 to 20%. This effect is related to MAC&LOAD instructions: they allow XPULPNN to compute MACs while loading the following operand simultaneously. Nevertheless, executing mixed-precision kernels adds the unpacking/packing overhead reducing the efficacy of MAC&LOAD instructions. In this scenario, Dustin firmly outperforms XPULPNN and reaches comparable energy-efficient results despite the substantial between the technologies used for implementation.

6 Conclusion

We presented Dustin, a fully programmable Multiple Instructions Multiple Data (MIMD) cluster integrating 16 RISC-V cores featuring 2b-to-32b bit-precision instruction set architecture (ISA) extensions enabling fine-grain tunable mixed-precision computation. The cluster can be dynamically configured into a Vector Lockstep Execution Mode (VLEM), turning off all IF stages and L1 I$ except one, reducing power consumption on average by 38% with no performance degradation. The cluster, implemented in 65nm CMOS technology, achieves a peak performance of 58 GOPS and a peak efficiency of 1.15 TOPS/W – competitive with IoT end-nodes using much more scaled and expensive technology nodes.