Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

01/23/2017
by   Erfan Azarkhish, et al.
0

High-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our codesign approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISCV cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8 (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.

READ FULL TEXT

page 1

page 6

page 10

page 14

research
08/30/2016

A near-threshold RISC-V core with DSP extensions for scalable IoT Endpoint Devices

Endpoint devices for Internet-of-Things not only need to work under extr...
research
04/15/2021

pLUTo: In-DRAM Lookup Tables to Enable Massively Parallel General-Purpose Computation

Data movement between main memory and the processor is a significant con...
research
01/10/2018

Proceedings of the Workshop on High Performance Energy Efficient Embedded Systems (HIP3ES) 2018

Proceedings of the Workshop on High Performance Energy Efficient Embedde...
research
02/05/2018

Analytical Cost Metrics : Days of Future Past

As we move towards the exascale era, the new architectures must be capab...
research
03/11/2021

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

With the growing number of data-intensive workloads, GPU, which is the s...
research
05/23/2019

In-DRAM Bulk Bitwise Execution Engine

Many applications heavily use bitwise operations on large bitvectors as ...
research
04/19/2023

Massive Data-Centric Parallelism in the Chiplet Era

Recent works have introduced task-based parallelization schemes to accel...

Please sign up or login with your details

Forgot password? Click here to reset