Massive Data-Centric Parallelism in the Chiplet Era

04/19/2023
by   Marcelo Orenes-Vera, et al.
0

Recent works have introduced task-based parallelization schemes to accelerate graph search and sparse data-structure traversal, where some solutions scale up to thousands of processing units (PUs) on a single chip. However parallelizing these memory-intensive workloads across millions of cores requires a scalable communication scheme as well as designing a cost-efficient computing node that makes multi-node systems practical, which have not been addressed in previous research. To address these challenges, we propose a task-oriented scalable chiplet architecture for distributed execution (Tascade), a multi-node system design that we evaluate with up to 256 distributed chips – over a million PUs. We introduce an execution model that scales to this level via proxy regions and selective cascading, which reduce overall communication and improve load balancing. In addition, package-time reconfiguration of our chiplet-based design enables creating chip products that optimized post-silicon for different target metrics, such as time-to-solution, energy, or cost. We evaluate six applications and four datasets, with several configurations and memory technologies to provide a detailed analysis of the performance, power, and cost of data-centric execution at a massive scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs – the largest of the literature – reaches 3021 GTEPS.

READ FULL TEXT

page 1

page 9

page 10

page 11

research
05/22/2022

Wireless On-Chip Communications for Scalable In-memory Hyperdimensional Computing

Hyperdimensional computing (HDC) is an emerging computing paradigm that ...
research
08/26/2021

Efficient On-Chip Communication for Parallel Graph-Analytics on Spatial Architectures

Large-scale graph processing has drawn great attention in recent years. ...
research
07/15/2022

Multi-node Acceleration for Large-scale GCNs

Limited by the memory capacity and compute power, singe-node graph convo...
research
02/18/2022

Uniting Control and Data Parallelism: Towards Scalable Memory-Driven Dynamic Graph Processing

Control parallelism and data parallelism is mostly reasoned and optimize...
research
10/16/2018

SpiNNTools: The Execution Engine for the SpiNNaker Platform

Distributed systems are becoming more common place, as computers typical...
research
01/23/2017

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

High-performance computing systems are moving towards 2.5D and 3D memory...
research
07/17/2019

A Communication-Centric Observability Selection for Post-Silicon System-on-Chip Integration Debug

Reconstruction of how components communicate with each other during syst...

Please sign up or login with your details

Forgot password? Click here to reset