Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

11/30/2020
by   Benjamin Y. Cho, et al.
0

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate 12× better minimum latency than a CPU and 2.8× greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU (by up to 16×) and prior main-memory acceleration approaches (by up to 2.4× compared to the best prior approach).

READ FULL TEXT

page 1

page 4

page 5

research
03/22/2023

System and Design Technology Co-optimization of SOT-MRAM for High-Performance AI Accelerator Memory System

SoCs are now designed with their own AI accelerator segment to accommoda...
research
06/30/2020

Efficient Communication Acceleration for Next-GenScale-up Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multi...
research
06/30/2020

Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multi...
research
08/08/2023

Accelerating LLM Inference with Staged Speculative Decoding

Recent advances with large language models (LLM) illustrate their divers...
research
10/08/2019

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Graph processing is typically considered to be a memory-bound rather tha...
research
11/18/2022

PIM-tree: A Skew-resistant Index for Processing-in-Memory

The performance of today's in-memory indexes is bottlenecked by the memo...
research
08/08/2023

Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures

This paper evaluates the efficacy of recent commercial processing-in-mem...

Please sign up or login with your details

Forgot password? Click here to reset