Modeling memory bandwidth patterns on NUMA machines with performance counters

06/15/2021
by   Daniel Goodman, et al.
0

Computers used for data analytics are often NUMA systems with multiple sockets per machine, multiple cores per socket, and multiple thread contexts per core. To get the peak performance out of these machines requires the correct number of threads to be placed in the correct positions on the machine. One particularly interesting element of the placement of memory and threads is the way it effects the movement of data around the machine, and the increased latency this can introduce to reads and writes. In this paper we describe work on modeling the bandwidth requirements of an application on a NUMA compute node based on the placement of threads. The model is parameterized by sampling performance counters during 2 application runs with carefully chosen thread placements. Evaluating the model with thousands of measurements shows a median difference from predictions of 2.34 modeling can be used in a number of ways varying from: Performance debugging during development where the programmer can be alerted to potentially problematic memory access patterns; To systems such as Pandia which take an application and predict the performance and system load of a proposed thread count and placement; To libraries of data structures such as Parallel Collections and Smart Arrays that can abstract from the user memory placement and thread placement issues when parallelizing code.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 6

page 7

page 9

page 10

research
08/05/2019

Toward Efficient In-memory Data Analytics on NUMA Systems

Data analytics systems commonly utilize in-memory query processing techn...
research
03/06/2020

Bandwidth-Aware Page Placement in NUMA

Page placement is a critical problem for memoryintensive applications ru...
research
01/23/2019

Safely Abstracting Memory Layouts

Modern architectures require applications to make effective use of cache...
research
04/07/2022

Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors

Modern processors, in particular within the server segment, integrate mo...
research
01/25/2022

RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation

We propose RecShard, a fine-grained embedding table (EMB) partitioning a...
research
09/03/2022

Ridgeline: A 2D Roofline Model for Distributed Systems

In this short paper, we introduce the Ridgeline model, an extension of t...
research
03/01/2018

Learning-based Dynamic Pinning of Parallelized Applications in Many-Core Systems

This paper introduces a reinforcement-learning based resource allocation...

Please sign up or login with your details

Forgot password? Click here to reset