High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors

03/14/2019
by   Siqi Wang, et al.
0

IoT Edge intelligence requires Convolutional Neural Network (CNN) inference to take place in the edge device itself. ARM big.LITTLE architecture is at the heart of common commercial edge devices. It comprises of single-ISA heterogeneous multi-cores grouped in homogeneous clusters that enables performance and power trade-offs. However, high communication overhead involved in parallelization of computation from a convolution kernel across clusters is detrimental to throughput. We present an alternative framework called Pipe-it that employs a pipelined design to split the convolutional layers across clusters while limiting the parallelization of their respective kernels to the assigned clusters. We develop a performance prediction model that, from convolutional layer descriptors, predicts the execution time of each layer individually on all different core types and number of cores. Pipe-it then exploits the predictions to create a balanced pipeline using an efficient design space exploration algorithm. Pipe-it on average results in 39 throughput than the highest antecedent throughput.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

10/03/2021

Heterogeneous Dual-Core Overlay Processor for Light-Weight CNNs

Light-weight convolutional neural networks (CNNs) have small complexity ...
04/16/2021

Implementing CNN Layers on the Manticore Cluster-Based Many-Core Architecture

This document presents implementations of fundamental convolutional neur...
12/20/2019

Hurry-up: Scaling Web Search on Big/Little Multi-core Architectures

Heterogeneous multi-core systems such as big/little architectures have b...
11/23/2021

Design of Many-Core Big Little μBrain for Energy-Efficient Embedded Neuromorphic Computing

As spiking-based deep learning inference applications are increasing in ...
08/07/2021

Asymmetry-aware Scalable Locking

The pursuit of power-efficiency is popularizing asymmetric multicore pro...
11/06/2015

Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

Dense linear algebra libraries, such as BLAS and LAPACK, provide a relev...
05/09/2018

Performance evaluation over HW/SW co-design SoC memory transfers for a CNN accelerator

Many FPGAs vendors have recently included embedded processors in their d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.