High performance and energy efficient inference for deep learning on ARM processors

05/19/2021
by   Adrián Castelló, et al.
0

We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several high-level transformations of the original framework, such as the development and integration of Cython routines to exploit thread-level parallelism; the design and development of micro-kernels for the matrix multiplication, vectorized with ARMs NEON intrinsics, that can accommodate layer fusion; and the appropriate selection of several cache configuration parameters tailored to the memory hierarchy of the target ARM processors. Our experiments evaluate both inference throughput (measured in processed images/s) and inference latency (i.e., time-to-response) as well as energy consumption per image when varying the level of thread parallelism and the processor power modes. The experiments with the new inference engine are reported for the ResNet50 v1.5 model on the ImageNet dataset from the MLPerf suite using the ARM v8.2 cores in the NVIDIA Jetson AGX Xavier board. These results show superior performance compared with the well-spread TFLite from Google and slightly inferior results when compared with ArmNN, the native library from ARM for DNN inference.

READ FULL TEXT

page 7

page 8

page 10

page 12

research
04/27/2023

Co-Design of the Dense Linear AlgebravSoftware Stack for Multicore Processors

This paper advocates for an intertwined design of the dense linear algeb...
research
08/01/2021

Versa: A Dataflow-Centric Multiprocessor with 36 Systolic ARM Cortex-M4F Cores and a Reconfigurable Crossbar-Memory Hierarchy in 28nm

We present Versa, an energy-efficient processor with 36 systolic ARM Cor...
research
08/09/2019

Performance of Devito on HPC-Optimised ARM Processo

We evaluate the performance of Devito, a domain specific language (DSL) ...
research
09/15/2019

Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models

Deep neural networks (DNNs) have become widely used in many AI applicati...
research
08/29/2019

PULP-NN: Accelerating Quantized Neural Networks on Parallel Ultra-Low-Power RISC-V Processors

We present PULP-NN, an optimized computing library for a parallel ultra-...
research
09/01/2020

RISC micrprocessor verification

Today's microprocessors have grown significantly in complexity and funct...
research
04/12/2023

MEMA Runtime Framework: Minimizing External Memory Accesses for TinyML on Microcontrollers

We present the MEMA framework for the easy and quick derivation of effic...

Please sign up or login with your details

Forgot password? Click here to reset