Uber: Utilizing Buffers to Simplify NoCs for Hundreds-Cores

07/26/2016
by   Giorgos Passas, et al.
0

Approaching ideal wire latency using a network-on-chip (NoC) is an important practical problem for many-core systems, particularly hundreds-cores. Although other researchers have focused on optimizing large meshes, bypassing or speculating router pipelines, or creating more intricate logarithmic topologies, this paper proposes a balanced combination that trades queue buffers for simplicity. Preliminary analysis of nine benchmarks from PARSEC and SPLASH using execution-driven simulation shows that utilization rises from 2 when connecting a single core per mesh port to at least 50 in concentrator and router queues is around 6x higher compared to the ideal latency of just 20 cycles. That is, a 16-port mesh suffices because queueing is the uncommon case for system performance. In this way, the mesh hop count is bounded to three, as load becomes uniform via extended concentration, and ideal latency is approached using conventional four-stage pipelines for the mesh routers together with minor logarithmic edges. A realistic Uber is also detailed, featuring the same performance as a 64-port mesh that employs optimized router pipelines, improving the baseline by 12 develops techniques to better balance load by tuning the placement of cache blocks, and compares Uber with bufferless routing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2020

MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

A key challenge in scaling shared-L1 multi-core clusters towards many-co...
research
08/02/2018

The BaseJump Manycore Accelerator Network

The BaseJump Manycore Accelerator-Network is an open source mesh-based O...
research
09/01/2016

On-Chip Mechanisms to Reduce Effective Memory Access Latency

This dissertation develops hardware that automatically reduces the effec...
research
03/27/2021

Reducing Load Latency with Cache Level Prediction

High load latency that results from deep cache hierarchies and relativel...
research
06/17/2021

QWin: Enforcing Tail Latency SLO at Shared Storage Backend

Consolidating latency-critical (LC) and best-effort (BE) tenants at stor...
research
10/09/2018

Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver

We study the performance behaviour of a seismic simulation using the Exa...
research
12/09/2020

Efficient Bypass in Mesh and Torus NoCs

Minimizing latency and power are key goals in the design of NoC routers....

Please sign up or login with your details

Forgot password? Click here to reset