High-Performance Deep Learning via a Single Building Block

06/15/2019
by   Evangelos Georganas, et al.
9

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and show how the most popular DL algorithms can be formulated with this kernel as the basic building-block. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting our new kernel we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in just 3K lines of high-level code. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters, and we also provide proof-of-concept CNN kernels targeting GPUs. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.

READ FULL TEXT
research
04/25/2023

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures

During the past decade, Deep Learning (DL) algorithms, programming syste...
research
06/02/2020

PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives

Deep Neural Networks (DNNs) have revolutionized many aspects of our live...
research
04/12/2021

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads

During the past decade, novel Deep Learning (DL) algorithms/workloads an...
research
09/16/2023

Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations

Web applications are increasingly becoming the primary platform for AI s...
research
02/06/2020

PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives

At the heart of deep learning training and inferencing are computational...
research
11/01/2021

Collage: Automated Integration of Deep Learning Backends

Strong demands for efficient deployment of Deep Learning (DL) applicatio...
research
03/17/2020

Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction

Writing high-performance code requires significant expertise in the prog...

Please sign up or login with your details

Forgot password? Click here to reset