TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

08/08/2019
by   Youngeun Kwon, et al.
1

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers and the associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-data processing cores tailored for DL tensor operations. These custom DIMMs are populated inside a GPU-centric system interconnect as a remote memory pool, allowing GPUs to utilize for scalable memory bandwidth and capacity expansion. A prototype implementation of our proposal on real DL systems shows an average 6.0-15.7x performance improvement on state-of-the-art recommender systems.

READ FULL TEXT

page 4

page 12

research
02/18/2019

Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning

As the models and the datasets to train deep learning (DL) models scale,...
research
03/06/2019

Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs

GPUs offer orders-of-magnitude higher memory bandwidth than traditional ...
research
05/10/2022

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

Personalized recommendation models (RecSys) are one of the most popular ...
research
04/05/2021

GPU Domain Specialization via Composable On-Package Architecture

As GPUs scale their low precision matrix math throughput to boost deep l...
research
04/28/2021

Continual Learning Approach for Improving the Data and Computation Mapping in Near-Memory Processing System

The resurgence of near-memory processing (NMP) with the advent of big da...
research
03/15/2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

In recent years, the training requirements of many state-of-the-art Deep...
research
04/28/2022

FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models

Training deep learning (DL) models has become a norm. With the emergence...

Please sign up or login with your details

Forgot password? Click here to reset