Tensor Slicing and Optimization for Multicore NPUs

04/06/2023
by   Rafael Sousa, et al.
0

Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrai­ned Multicore Neural Processor Units (NPUs) is still a challenging problem. Given the size of convolutions' input/output tensors and the small footprint of NPU on-chip memories, minimizing memory transactions while maximizing parallelism and MAC utilization are central to any effective solution. This paper proposes a TensorFlow XLA/LLVM compiler optimization pass for Multicore NPUs, called Tensor Slicing Optimization (TSO), which: (a) maximizes convolution parallelism and memory usage across NPU cores; and (b) reduces data transfers between host and NPU on-chip memories by using DRAM memory burst time estimates to guide tensor slicing. To evaluate the proposed approach, a set of experiments was performed using the NeuroMorphic Processor (NMP), a multicore NPU containing 32 RISC-V cores extended with novel CNN instructions. Experimental results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models. Speed-ups of up to 21.7% result when comparing the TSO burst-based technique to a no-burst data slicing approach. To validate the generality of the TSO approach, the algorithm was also ported to the Glow Machine Learning framework. The performance of the models were measured on both Glow and TensorFlow XLA/LLVM compilers, revealing similar results.

READ FULL TEXT

page 10

page 11

research
03/08/2023

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Convolution is one of the most computationally intensive operations that...
research
05/02/2019

On Linear Learning with Manycore Processors

A new generation of manycore processors is on the rise that offers dozen...
research
08/29/2023

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and...
research
01/23/2022

Cuckoo Trie: Exploiting Memory-Level Parallelism for Efficient DRAM Indexing

We present the Cuckoo Trie, a fast, memory-efficient ordered index struc...
research
09/22/2019

Blocking and sparsity for optimization of convolution calculation algorithm on GPUs

Convolution neural network (CNN) plays a paramount role in machine learn...
research
09/22/2019

Performance optimization of convolution calculation by blocking and sparsity on GPU

Convolution neural network (CNN) plays a paramount role in machine learn...
research
05/12/2017

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disks

Resource utilization is one of the emerging problems in many-chip SSDs. ...

Please sign up or login with your details

Forgot password? Click here to reset