Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

11/28/2021
by   Jie Wang, et al.
0

Systolic arrays have been widely used for accelerating HPC and deep learning applications. There is a plethora of previous works on the performance tuning of systolic arrays, but usually based on a number of oversimplified assumptions (e.g., only considering divisors for loop tiling, pruning based on off-chip data communication) to reduce the design space. In this paper, we present a comprehensive design space exploration tool named Odyssey for systolic array optimization. Odyssey does not rely on artificial assumptions to limit the design space, and yet it is highly efficient and scalable with a hybrid optimization technique. For example, for a 1024x1024x1024 matrix multiplication, it finds designs that reach 90 optimal performance in 5 seconds with a single CPU thread. Moreover, using Odyssey, we unveil and quantify the suboptimality introduced by multiple commonly used oversimplifications in prior studies for systolic array design space exploration. For example, Odyssey results show that limiting to divisors for loop tiling leads to a 39 data movement results in a 45 the architecture trade-offs for matrix multiplication and convolutional neural network, providing inspiration into possible optimizations for these two applications.

READ FULL TEXT

page 4

page 9

research
03/10/2018

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs

Large-scale floating-point matrix multiplication is a fundamental kernel...
research
04/27/2020

FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training

Modern deep learning models have high memory and computation cost. To ma...
research
03/14/2023

Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization

Performance optimization is an increasingly challenging but often repeti...
research
06/07/2023

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

General Matrix Multiplication (GEMM) is a fundamental operation widely u...
research
11/22/2019

Gemmini: An Agile Systolic Array Generator Enabling Systematic Evaluations of Deep-Learning Architectures

Advances in deep learning and neural networks have resulted in the rapid...
research
09/06/2023

The Case for Asymmetric Systolic Array Floorplanning

The widespread proliferation of deep learning applications has triggered...
research
06/10/2020

Methodology for Realizing VMM with Binary RRAM Arrays: Experimental Demonstration of Binarized-ADALINE Using OxRAM Crossbar

In this paper, we present an efficient hardware mapping methodology for ...

Please sign up or login with your details

Forgot password? Click here to reset