A near-threshold RISC-V core with DSP extensions for scalable IoT Endpoint Devices

08/30/2016
by   Michael Gautschi, et al.
0

Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold(NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multi-core clusters. We introduce instruction-extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure towards the shared memory hierarchy. For typical data-intensive sensor processing workloads the proposed core is on average 3.5x faster and 3.2x more energy-efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. SIMD extensions, such as dot-products, and a built-in L0 storage further reduce the shared memory accesses by 8x reducing contentions by 3.2x. With four NT-optimized cores, the cluster is operational from 0.6V to 1.2V achieving a peak efficiency of 67MOPS/mW in a low-cost 65nm bulk CMOS technology. In a low power 28nm FDSOI process a peak efficiency of 193MOPS/mW(40MHz, 1mW) can be achieved.

READ FULL TEXT

page 1

page 3

page 10

page 11

research
09/04/2023

Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

High Performance and Energy Efficiency are critical requirements for Int...
research
06/10/2019

Transport Triggered Array Processor for Vision Applications

Low-level sensory data processing in many Internet-of-Things (IoT) devic...
research
01/23/2017

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

High-performance computing systems are moving towards 2.5D and 3D memory...
research
01/21/2022

Dustin: A 16-Cores Parallel Ultra-Low-Power Cluster with 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

Computationally intensive algorithms such as Deep Neural Networks (DNNs)...
research
12/02/2021

MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration

Three-dimensional integrated circuits promise power, performance, and fo...
research
03/30/2023

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

Shared L1 memory clusters are a common architectural pattern (e.g., in G...

Please sign up or login with your details

Forgot password? Click here to reset