A Microbenchmark Characterization of the Emu Chick

09/07/2018
by   Jeffrey Young, et al.
0

The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this multi-node characterization of the Emu Chick, we extend an earlier single-node investigation (Hein, et al. AsHES 2018) of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication. We compare the Emu Chick hardware to architectural simulation and an Intel Xeon-based platform. Our results demonstrate that for many basic operations the Emu Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture although bandwidth usage suffers for computationally intensive workloads like SpMV. Moreover, the Emu Chick provides stable, predictable performance with up to 65 pointer chasing benchmark with weak locality.

READ FULL TEXT
research
12/03/2018

Programming Strategies for Irregular Algorithms on the Emu Chick

The Emu Chick prototype implements migratory memory-side processing in a...
research
06/23/2017

HourGlass: Predictable Time-based Cache Coherence Protocol for Dual-Critical Multi-Core Systems

We present a hardware mechanism called HourGlass to predictably share da...
research
04/02/2018

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Architectures with multiple classes of memory media are becoming a commo...
research
12/14/2018

Impact of Traditional Sparse Optimizations on a Migratory Thread Architecture

Achieving high performance for sparse applications is challenging due to...
research
11/18/2016

Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

Word2vec is a widely used algorithm for extracting low-dimensional vecto...
research
03/04/2021

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

The A64FX CPU is arguably the most powerful Arm-based processor design t...
research
11/21/2022

The AMD Rome Memory Barrier

With the rapid growth of AMD as a competitor in the CPU industry, it is ...

Please sign up or login with your details

Forgot password? Click here to reset