Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver

10/09/2018
by   Dominic E. Charrier, et al.
0

We study the performance behaviour of a seismic simulation using the ExaHyPE engine with a specific focus on memory characteristics and energy needs. ExaHyPE combines dynamically adaptive mesh refinement (AMR) with ADER-DG. It is parallelized using tasks, and it is cache efficient. AMR plus ADER-DG yields a task graph which is highly dynamic in nature and comprises both arithmetically expensive tasks and tasks which challenge the memory's latency. The expensive tasks and thus the whole code benefit from AVX vectorization, though we suffer from memory access bursts. A frequency reduction of the chip improves the code's energy-to-solution. Yet, it does not mitigate burst effects. The bursts' latency penalty becomes worse once we add Intel Optane technology, increase the core count significantly, or make individual, computationally heavy tasks fall out of close caches. Thread overbooking to hide away these latency penalties contra-productive with non-inclusive caches as it destroys the cache and vectorization character. In cases where memory-intense and computationally expensive tasks overlap, ExaHyPE's cache-oblivious implementation can exploit deep, non-inclusive, heterogeneous memory effectively, as main memory misses arise infrequently and slow down only few cores. We thus propose that upcoming supercomputing simulation codes with dynamic, inhomogeneous task graphs are actively supported by thread runtimes in intermixing tasks of different compute character, and we propose that future hardware actively allows codes to downclock the cores running particular task types.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2021

Using Silent Writes in Low-Power Traffic-Aware ECC

Using Error Detection Code (EDC) and Error Correction Code (ECC) is a no...
research
01/06/2017

Reducing Competitive Cache Misses in Modern Processor Architectures

The increasing number of threads inside the cores of a multicore process...
research
09/01/2016

On-Chip Mechanisms to Reduce Effective Memory Access Latency

This dissertation develops hardware that automatically reduces the effec...
research
01/03/2022

Energy-efficient Non Uniform Last Level Caches for Chip-multiprocessors Based on Compression

With technology scaling, the size of cache systems in chip-multiprocesso...
research
10/16/2022

RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory

Recent nano-technological advances enable the Monolithic 3D (M3D) integr...
research
09/26/2017

Flexible Support for Fast Parallel Commutative Updates

Privatizing data is a useful strategy for increasing parallelism in a sh...
research
07/26/2016

Uber: Utilizing Buffers to Simplify NoCs for Hundreds-Cores

Approaching ideal wire latency using a network-on-chip (NoC) is an impor...

Please sign up or login with your details

Forgot password? Click here to reset