Performance evaluation over HW/SW co-design SoC memory transfers for a CNN accelerator

05/09/2018
by   A. Rios-Navarro, et al.
0

Many FPGAs vendors have recently included embedded processors in their devices, like Xilinx with ARM-Cortex A cores, together with programmable logic cells. These devices are known as Programmable System on Chip (PSoC). Their ARM cores (embedded in the processing system or PS) communicates with the programmable logic cells (PL) using ARM-standard AXI buses. In this paper we analyses the performance of exhaustive data transfers between PS and PL for a Xilinx Zynq FPGA in a co-design real scenario for Convolutional Neural Networks (CNN) accelerator, which processes, in dedicated hardware, a stream of visual information from a neuromorphic visual sensor for classification. In the PS side, a Linux operating system is running, which recollects visual events from the neuromorphic sensor into a normalized frame, and then it transfers these frames to the accelerator of multi-layered CNNs, and read results, using an AXI-DMA bus in a per-layer way. As these kind of accelerators try to process information as quick as possible, data bandwidth becomes critical and maintaining a good balanced data throughput rate requires some considerations. We present and evaluate several data partitioning techniques to improve the balance between RX and TX transfer and two different ways of transfers management: through a polling routine at the userlevel of the OS, and through a dedicated interrupt-based kernellevel driver. We demonstrate that for longer enough packets, the kernel-level driver solution gets better timing in computing a CNN classification example. Main advantage of using kernel-level driver is to have safer solutions and to have tasks scheduling in the OS to manage other important processes for our application, like frames collection from sensors and their normalization.

READ FULL TEXT
research
05/17/2019

Dynamic Vision Sensor integration on FPGA-based CNN accelerators for high-speed visual classification

Deep-learning is a cutting edge theory that is being applied to many fie...
research
12/04/2017

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

Deep convolutional neural networks (CNNs) obtain outstanding results in ...
research
12/16/2019

A flexible FPGA accelerator for convolutional neural networks

Though CNNs are highly parallel workloads, in the absence of efficient o...
research
06/27/2023

Retrospective: A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Our ISCA 2015 paper provides a new programmable processing-in-memory (PI...
research
02/27/2019

FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

The computational demands of computer vision tasks based on state-of-the...
research
09/12/2019

Exploring the Behavior of Coherent Accelerator Processor Interface (CAPI) on IBM Power8+ Architecture and FlashSystem 900

The Coherent Accelerator Processor Interface (CAPI) is a general term fo...
research
07/06/2016

A configurable accelerator for manycores: the Explicitly Many-Processor Approach

A new approach to designing processor accelerators is presented. A new c...

Please sign up or login with your details

Forgot password? Click here to reset