SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

03/07/2019
by   Beidi Chen, et al.
0

Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters with enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, with the end of Moore's law, there is a limit to such scaling. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, which drastically reduce the computation during both training and inference, with simple multi-core parallelism on a modest CPU. SLIDE is an auspicious illustration of the power of smart randomized algorithms over CPUs in outperforming the best available GPU with an optimized implementation. Our evaluations on large industry-scale datasets, with some large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 2.7 times (2 hours vs. 5.5 hours) faster than the same network trained using Tensorflow on Tesla V100 at any given accuracy level. We provide codes and benchmark scripts for reproducibility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2023

BOLT: An Automated Deep Learning Framework for Training and Deploying Large-Scale Neural Networks on Commodity CPU Hardware

Efficient large-scale neural network training and inference on commodity...
research
03/06/2021

Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More

Deep learning implementations on CPUs (Central Processing Units) are gai...
research
04/21/2023

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

It is widely acknowledged that large models have the potential to delive...
research
12/04/2018

Auto-tuning TensorFlow Threading Model for CPU Backend

TensorFlow is a popular deep learning framework used by data scientists ...
research
08/07/2017

PowerAI DDL

As deep neural networks become more complex and input datasets grow larg...
research
01/29/2022

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

More than 70 of these idle compute are cheap CPUs with few cores that ar...
research
05/27/2017

AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks

New types of machine learning hardware in development and entering the m...

Please sign up or login with your details

Forgot password? Click here to reset