Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

05/10/2020
by   Dhiraj Kalamkar, et al.
0

During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook's DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance (110x) on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets. This paper discusses the optimization techniques for the various operators in DLRM and which component of the systems are stressed by these different operators. The presented techniques are applicable to a broader set of DL workloads that pose the same scaling challenges/characteristics as DLRM.

READ FULL TEXT
research
07/27/2019

HPC AI500: A Benchmark Suite for HPC AI Systems

In recent years, with the trend of applying deep learning (DL) in high p...
research
04/12/2022

The MIT Supercloud Workload Classification Challenge

High-Performance Computing (HPC) centers and cloud providers support an ...
research
04/25/2023

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures

During the past decade, Deep Learning (DL) algorithms, programming syste...
research
08/04/2021

The MIT Supercloud Dataset

Artificial intelligence (AI) and Machine learning (ML) workloads are an ...
research
05/01/2023

Full Scaling Automation for Sustainable Development of Green Data Centers

The rapid rise in cloud computing has resulted in an alarming increase i...
research
10/27/2022

Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

Cloud computing represents an appealing opportunity for cost-effective d...
research
09/11/2020

Accelerating Recommender Systems via Hardware "scale-in"

In today's era of "scale-out", this paper makes the case that a speciali...

Please sign up or login with your details

Forgot password? Click here to reset