A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

11/29/2021
by   Suvadeep Banerjee, et al.
0

This work focuses on an efficient Agile design methodology for domain-specific accelerators. We employ feature-by-feature enhancement of a vertical development stack and apply it to the TVM/VTA inference accelerator. We have enhanced the VTA design space and enabled end-to-end support for additional workloads. This has been accomplished by augmenting the VTA micro-architecture and instruction set architecture (ISA), as well as by enhancing the TVM compilation stack to support a wide range of VTA configs. The VTA tsim implementation (CHISEL-based) has been enhanced with fully pipelined versions of the ALU/GEMM execution units. In tsim, memory width can now range between 8-64 bytes. Field widths have been made more flexible to support larger scratchpads. New instructions have been added: element-wise 8-bit multiplication to support depthwise convolution, and load with a choice of pad values to support max pooling. Support for more layers and better double buffering has also been added. Fully pipelining ALU/GEMM helps significantly: 4.9x fewer cycles with minimal area change to run ResNet-18 under the default config. Configs featuring a further 11.5x decrease in cycle count at a cost of 12x greater area can be instantiated. Many points on the area-performance pareto curve are shown, showcasing the balance of execution unit sizing, memory interface width, and scratchpad sizing. Finally, VTA is now able to run Mobilenet 1.0 and all layers for ResNets, including the previously disabled pooling and fully connected layers. The TVM/VTA architecture has always featured end-to-end workload evaluation on RTL in minutes. With our modifications, it now offers a much greater number of feasible configurations with a wide range of cost vs. performance. All capabilities mentioned are available in opensource forks while a subset of these capabilities have already been upstreamed.

READ FULL TEXT

page 1

page 5

page 8

research
12/10/2019

SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads

In recent years, there has been tremendous advances in hardware accelera...
research
08/03/2020

High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

Matrix multiplications between asymmetric bit-width operands, especially...
research
04/26/2022

Bifrost: End-to-End Evaluation and Optimization of Reconfigurable DNN Accelerators

Reconfigurable accelerators for deep neural networks (DNNs) promise to i...
research
08/05/2022

Hardless: A Generalized Serverless Compute Architecture for Hardware Processing Accelerators

The increasing use of hardware processing accelerators tailored for spec...
research
04/12/2021

Optimizing the Whole-life Cost in End-to-end CNN Acceleration

The acceleration of CNNs has gained increasing atten-tion since their su...
research
02/27/2023

Full Stack Optimization of Transformer Inference: a Survey

Recent advances in state-of-the-art DNN architecture design have been mo...

Please sign up or login with your details

Forgot password? Click here to reset