In this paper we present esp4ml, a system-level design flow that enables the rapid realization of SoC architectures for embedded machine learning. With esp4ml, SoC designers can integrate at design time many heterogeneous accelerators that can be easily connected at run-time form various tightly-coupled pipelines (Fig. 1). These accelerator pipelines are reconfigured dynamically (and transparently to the application programmer) to support the particular embedded application that is currently running on top of Linux on the SoC processor.
To realize esp4ml, we embraced the concept of open-source hardware (OSH) [gupta17], in multiple ways. First, our main goal is to simplify the process of designing complete SoCs that can be rapidly prototyped on FPGA boards. The esp4ml users can focus on the design of specific accelerators, which is simplified with high-level synthesis (HLS), while reusing available OSH designs for the main SoC components (e.g. the Ariane risc-v processor core [ariane]). Second, esp4ml is the result of combining two existing OSH projects that have been independently developed: esp and hls4ml.
esp is a platform for developing heterogeneous SoCs that promotes the ideas of platform-based design [carloni_dac16, esp].
hls4ml is a compiler that translates ML models developed with commonly used open-source packages such as Keras and PyTorch into accelerator specifications that can be synthesized with HLS for FPGAs [Duarte_2018, hls4ml]. While originally developed for research in particle physics, hls4ml has broad applicability.
To combine these two projects and reach our main goal 111We released the contributions of this paper as part of the ESP project on Github [esp].:
We enhanced the esp architecture to support the reconfigurable activation of pipelines of accelerators, by implementing point-to-point (p2p) communication channels among them. This is done by reusing only the preexisting interconnection infrastructure without any overhead, i.e. without any addition of channel queues, routers, or links in the network-on-chip (NoC).
We augmented the esp methodology with an application programming interface (API) that for a given embedded application and a target SoC architecture allows the specification of the software part to be accelerated as a simple dataflow of computational kernels.
We developed a runtime system on top of Linux that takes this dataflow and translates it into a pipeline of accelerators that are dynamically configured, managed, and kept synchronized as they access shared data. This is done in a way that is fully transparent to the application programmer.
We enhanced the SoC integration flow of esp by designing new parameterized interface circuits (synthesizable with HLS) that encapsulate accelerators designed for Vivado HLS [vivado_hls], without requiring any modification to their designs. This provides an adapter layer to bridge the ap_fifo protocol from Vivado HLS to the esp accelerator interface so that esp4ml users are only responsible for setting the appropriate parameters for DMA transactions (i.e., transaction length and offset within the virtual address space of the accelerator).
We encapsulated hls4ml into a fully automated design flow that takes an ML application developed with KerasTensorFlow and the reuse factor parameter to control parallelization specified within hls4ml and returns an accelerator that can be integrated within a complete SoC. This required no modification to the code generated with the hls4ml compiler.
We demonstrate the successful vertical integration of these contributions by presenting a set of experimental results that we obtained with esp4ml. Specifically, we designed two complete SoC architectures, implemented them on FPGA boards, and used them to run embedded applications, which invoke various pipelines of accelerators for ML and computer vision. Compared to an Intel processor, an ARM processor, and an NVIDIA embedded GPU, energy-efficiency speedups (measured in terms of frames/Joule) are above in some cases. Furthermore, thanks to the efficient p2p-communication mechanisms of esp4ml, the execution of these applications presents a major reduction of the off-chip memory access compared to the corresponding versions that use off-chip memory for inter-accelerator communication, which is normally the most efficient accelerator cache-coherence model for non-trivial workloads with regular memory access pattern [giri_ieeemicro18].
We give a quick overview of the esp and hls4ml projects to provide basic information to read the subsequent sections.
Embedded Scalable Platforms. esp is an open-source research platform for the design of heterogeneous SoCs [carloni_dac16]. The platform combines an architecture and a methodology. The flexible tile-based architecture simplifies the integration of heterogeneous components through a combination of hardware and software sockets. The companion methodology raises the level of abstraction to system-level design by decoupling the system integration from the design and optimization of the various SoC components (accelerator, processors, etc.) [mantovani_aspdac16].
The esp tile-based architecture relies on a multi-plane packet-switched network-on-chip (NoC) as the communication medium for the entire SoC. The interface between a tile and the NoC consists of a wrapper (the hardware part of a socket) that implements the communication mechanisms together with other platform services. For example, the socket of an accelerator tile typically implements: a configurable direct-memory access (DMA) engine, interrupt-request logic, memory-mapped registers, and the register-configuration logic.
In esp, an NoC is an 2D-mesh, corresponding to a grid of tiles of configurable size: e.g., Fig. 2 shows a instance of an esp SoC with two processor tiles, one memory tile, one auxiliary tile, and five accelerator tiles. An NoC plane is a set of bi-directional links of configurable width (e.g. 32 or 64 bits) that connect pairs of adjacent tiles in the NoC. The esp architecture allots two full planes of the NoC to the accelerators, which use them to move efficiently long sequences of data between their on-chip local private memories and the off-chip main memory (DRAM). These data exchanges, called either loads or stores depending on their direction, happen via DMA, i.e. without involving the processor cores, which instead typically transfer data at a finer granularity (i.e. one or few cache lines) [mantovani_cases16]. Note that DMA requests and responses are routed through decoupled NoC planes to prevent deadlock when multiple accelerators and multiple memory tiles are present. In Section IV we show how we leverage this DMA queues decoupling to efficiently implement p2p communication for esp4ml.
The esp methodology supports a design flow that leverages SystemC and Cadence Stratus HLS [stratus_hls] for the specification and implementation of an accelerator to be plugged into the accelerator wrapper, as shown in Fig. 2. esp users are responsible for the core functionality of their accelerators and for adapting the template load/store functions provided in the synthesizable SystemC esp library.
HLS4ML. The hls4ml project allows designers to specify ML models and neural-network architectures for a specific task (e.g. image classification) by using a common open-source software such as Keras [chollet2017keras], PyTorch [paszke2017automatic], and ONNX [onnx]. A trained ML model to be used for inference is described with a couple of standard-format files: a JSON file for the network topology and a HDF5 file for the model weights and biases. These are the inputs of the hls4ml compiler, which automatically derives a hardware implementation of the corresponding ML accelerator that can be synthesized for FPGAs using HLS tools [nane2015survey]. While hls4ml currently supports only Vivado HLS [vivado_hls], its approach can be extended to other HLS tools [Duarte_2018], possibly targeting ASIC as well.
For an ML accelerator, the trade-offs among latency, initiation interval, and FPGA-resource usage depend on the degree of parallelization of its inference logic. In hls4ml, these can be balanced by setting the reuse factor
, which is a single configuration parameter that specifies the number of times a multiplier is used in the computation of a layer of neurons.
Iii The Proposed Design Flow
Fig. 3 shows the esp4ml flow to design SoCs for embedded ML applications. From the esp project, we adopted the flow to design and integrate accelerators for generic computational kernels (right) and we implemented a new flow to design accelerators for ML applications, which leverages hls4ml (left). Furthermore, we enabled the runtime reconfiguration of the communication among accelerators through a software application (generated from a user-specified dataflow) and a new platform service for reconfigurable p2p communication (implemented in the wrapper of the accelerator tile).
In order to integrate accelerators compiled by hls4ml, we extended the SoC generation flow of esp to host RTL components synthesized with Vivado HLS. We designed a new template wrapper that is split into a source file for Vivado HLS synthesis directives and an RTL adapter for the esp accelerator tile. These template source files are automatically specialized for a particular instance of ML accelerator depending on input and output size as well as on precision and data type (e.g. 16-bits fixed-point).
The portion of the wrapper processed by Vivado HLS implements the control logic to make DMA transaction requests and handles the synchronization between DMA transactions and the computational kernel. Fig. 4 shows the gist of the top-level function: the LOAD function gets and unpacks data from the data read port into local memories; the COMPUTE function calls the computational kernel (e.g. generated from hls4ml); the STORE function packs the data from local memory and pushes them to the data write port. In addition, both LOAD and STORE functions set the appropriate virtual address and length for the current transaction. This information is computed based on the current iteration index of the main loop, the size of the dataset and the size of the local buffers. Some of the parameters needed are set at runtime through configuration registers (e.g. conf_size).
The RTL portion of the wrapper includes a set of shallow FIFO queues that decouple the control requirements of the FIFO interface in Vivado HLS from the protocol of the accelerator tile in esp. In addition to FIFO queues, the wrapper binds the esp configuration registers to the corresponding signals of the accelerator, such as conf_size in Fig. 4. The list of registers is specified into an XML file for each accelerator following the default esp integration flow.
Iv Point-to-Point Communication Services
Section III explains how esp4ml users can specify the accelerators for their target embedded applications. Once these are implemented as RTL intellectual property (IP) blocks, the esp graphic configuration interface can be used to pick the location of each accelerator in the SoC and generate the appropriate hardware wrappers, including routing tables, and Linux device drivers. The esp infrastructure then generates a bitstream for Xilinx FPGAs and a bootable image of Linux that can run on the embedded risc-v processor in the ESP SoC [mantovani_dac16].
The esp design flow, however, used to lack the ability to map the application dataflow onto the user-level software and to dynamically reconfigure the NoC routers to remap DMA transactions onto p2p data transfers among accelerators. Hence, we developed a new p2p platform service for esp architectures that is compatible with the generic accelerator tile wrapper.
First, we defined two additional registers common to all accelerators. The LOCATION_REG is a read-only register that exposes the x-y coordinates of an accelerator on the NoC to the operating system. The P2P_REG is the p2p configuration register, which holds the following information: p2p store is enabled, p2p load is enabled, number of source tiles (1 to 4) for the load transactions, x-y coordinates of the source tiles. We also modified the esp device driver such that any registered accelerator, (discovered when probe is executed) is added to a global linked list protected by a spinlock. This list allows any thread executing the code of an accelerator device-driver in kernel mode to access information related to other accelerators. Since this information includes the base address of the configuration registers, a device name, already known in user space, can be mapped to the corresponding x-y coordinates. These coordinates are not exposed to user space and the application dataflow can be specified by simply using the accelerator names. Hence, the application is completely independent from the particular SoC floorplan.
To support accelerator p2p transactions we made minor modifications to translation-lookaside buffer (TLB) and DMA controller in the ESP accelerator tile wrapper [mantovani_cases16]. A key aspect of our implementation is that all p2p transactions are on-demand, that is they must be initiated by the receiver. The sender accelerator tile waits for a p2p load request before forwarding data to the NoC. Implementing p2p stores on-demand is necessary to prevent long packets of data being stalled in the NoC links while the accelerator that is downstream in the dataflow is not ready to accept them. For the same reason our solution guarantees the “consumption assumption” [song03] for all supported dataflow configurations. An accelerator tile will only request data when it has enough space to store it locally.
This mechanism is completely transparent to the accelerator, which still operates as if regular DMA transactions were to occur, while performance and energy consumption largely benefit from close-distance communication and a drastic reduction in accesses to DRAM or to the last-level cache.
We built this p2p communication service without adding any NoC planes, nor queues at the NoC interface, because we rely on queues that are otherwise unused for regular DMA transactions. Specifically, we carefully reused available queues in the esp accelerator tile.
V Runtime System for Accelerators
After implementing the p2p service, we developed a software API to hide the details of memory allocation, accelerator invocation, and synchronization from user-space software. Dependencies across accelerators are specified through a simple dataflow. By modifying a template that is automatically generated for the given SoC architecture, the esp4ml users can define a dataflow of accelerator invocations. For each invocation they can specify whether to use DMA or p2p communication and they can set other accelerator-specific communication parameters.
The snippet in Fig. 5 shows an example of automatically generated applications that reads two dataflow configurations from dflow1.h. For each configuration the application spawns as many threads as the number of running accelerators to exploit all the available parallelism in the dataflow. Since accelerators that use the p2p service are automatically synchronized in hardware, the software runtime incurs minimal overhead. This is limited to the ioctl system calls that are used to start the accelerators [mantovani_cases16]. When esp4ml users set the dataflow parameters to use DMA only, dependencies are enforced with pthread primitives. Thanks to our software runtime, esp4ml users can dynamically reshape the data traffic on the NoC to activate a reconfigurable pipeline of accelerators for the given embedded application In addition, they can tune the throughput of the system by balancing each stage of this pipeline: e.g., if a slow accelerator is feeding a faster one, multiple instances of the slower accelerator can be activated to feed a single accelerator downstream.
Vi Experimental Results
Applications. Street View House Numbers (SVHN) is a real-world image dataset obtained from Google Street View pictures [svhnDataset]. SVHN is similar to the MNIST dataset, but it is ten times bigger (600,000 images split in training, test, extra-training datasets). For SVHN, the problems get significantly more laborious due to the environmental noise in the pictures (including shadows and distortions). We developed two embedded applications for the SVHN dataset: digit classification and image denoising. For both, we adopted ML solutions and trained our models in Keras. Recalling the esp4ml flow overview of Fig. 1, the upper part of Fig. 6 shows concrete instances for these two applications.
For the digit classification problem, we defined a Multilayer Perceptron (MLP) with four hidden layers. The size of the fully connected network is 1024x256x128x64x32x10. We used dropout layers with a 0.2 rate to prevent overfitting during training. The trained model accuracy is 92%. For the denoising problem, we designed an autoencoder model. The network size is 1024x256x128x1024, and the compression factor in the bottleneck is 8. We added Gaussian noise to the SVHN dataset and trained the model with a 3.1% reconstruction error.
We also developed one application outside the ML domain, which is a night computer vision application consisting of three kernels: noise filtering, histogram, and histogram equalization. For the purpose of this evaluation, we darkened the SVHN dataset and we used this Night-Vision application as a pre-processing step of the MLP classifier described above.
|NightVision &||Denoiser &||Multi-tile|
|Frames/s Intel i7||1,858||30,435||82,476|
Accelerators and SoCs. We designed two SoCs that we synthesized for FPGA with the esp4ml flow. As shown in Fig. 6, these SoCs contain many (up to ten) accelerators for the target applications and one Ariane risc-v core. Table I shows the FPGA resources usage and the dynamic power dissipation as reported by Xilinx Vivado. We designed the Classifier and the Denoiser with Keras and we compiled them with hls4ml within the esp4ml flow. We then designed a partitioned version of the Classifier, by distributing the computation across five accelerators. Finally, we designed the accelerator for the Night-Vision kernels by leveraging another HLS-based design flow within esp: i.e., we designed them in SystemC and synthesized them with Cadence Stratus HLS.
Experimental Setup. We implemented the two esp4ml SoCs of Fig. 6 on a Xilinx Ultrascale+ FPGA board with a clock frequency of 78MHz. We ran all the experiments by using this board and executing the test embedded applications on top of Linux running on the Ariane core. We compared the execution of these applications on the esp4ml SoC with the hardware accelerators versus the execution of the same applications in software on the following two platforms: (a) an Intel i7 8700K processor and (b) an NVIDIA Jetson TX1 model, which is an embedded system that combines a 256-core NVIDIA Maxwell GPU with a Quad-Core ARM Cortex-A57 MPCore. Based on the available datasheet, we considered values of power consumption equal to and
for the ARM core and the GPU, respectively. For the Intel core, we estimated a TDP of(the nominal value is ).
Results. The three bottom lines of Table I report the performance of the three platforms measured in terms of processed frames per second. The FPGA implementations of the SoC designed with esp4ml offer better performance compared to a commercial embedded platform like the Jetson TX1. The Intel i7 cores predictably provides the best performance, aside for the case of the Night-Vision application, which is a single-threaded program.
Fig. 7 compares the execution of the applications on the three platforms in terms of energy efficiency, measured as (in logarithmic scale). Notice that all the accelerator execution-time measurements include the overhead of the esp4ml runtime system managing the accelerators invocations as well as the overhead of the accelerators Linux device drivers. The horizontal blue and red lines show the efficiency of the CPU and GPU, respectively. For the purpose of this comparison, we report the average dynamic power consumption for the two esp4ml SoCs as estimated by Xilinx Vivado for the whole SoC (i.e. not just for the accelerators active in a specific test). This is a conservative assumption, particularly if one considers that the power consumption depends on the choice of the FPGA and that a Xilinx Ultrascale+ is a particularly large FPGA. Still, the esp4ml SoCs outperforms both the GPU and the CPU across all three applications, yielding in some cases an energy-efficiency gain of over .
Each cluster of bars in Fig. 7 represents an execution based on a different pipeline of accelerators, with the number of accelerators varying from two to eight. The left bar of each cluster shows results for the case where the accelerators are invoked serially in a single-thread application. The middle bars (label pipe) correspond to concurrent executions in a reconfigurable pipeline, as the accelerators are invoked with a multi-threaded application (one thread per accelerator). The right bar adds the esp4ml p2p communication to this pipeline execution. The results for the Night-Vision and Classifier show that the performance increases significantly when the accelerators work concurrently in pipeline. While p2p communication does not provide a major gain in performance in this case, its main benefit is the reduction of off-chip memory accesses, which translates into a major energy saving: as shown in Fig. 8, this reduction varies between and for the target applications.
Vii Related Work
As efforts in accelerators for ML continue to grow, HLS is recognized as a critical technology to build efficient optimization flows [zhang_2017]. For instance, Hao et al. recently proposed a PYNQ-ZI based approach to design deep neural network accelerators [hao_2019]. Meanwhile, various optimization techniques to deploy deep neural networks on FPGA have been proposed [wang_2016, zhang_iccad16, zhang_2017, hao_2018]. In this context, hls4ml [Duarte_2018] is being increasingly adopted by research organizations and is raising interest in the industry [xilinx_cern, fastmachinelearning]. To date, however, most open-source projects focus on the design of accelerators in isolation. Instead, we propose the first automated open-source design flow that leverages esp and hls4ml to integrate multi-accelerator pipelines into SoCs. The ESP project initially focused on the integration of generic accelerators specified in SystemC that could operate in pipeline through shared memory [mantovani_aspdac16]. The esp4ml flow augments ESP with the support of accelerators designed also with common ML API and enable runtime reconfiguration of pipelines with efficient p2p communication.
esp4ml is a complete system-level design flow to implement SoCs for embedded applications that leverage tightly-coupled pipelines of many heterogeneous accelerators. We realized esp4ml by building on the prior efforts of two distinct open-source projects: esp and hls4ml. In particular, we augmented esp with a HW/SW layer that enables the reconfigurable activation of accelerators pipelines through efficient point-to-point communication mechanisms. In addition, we built a library of interface circuits that allow for the first time to integrate hls4ml accelerators for machine learning into a complete SoC using only open-source hardware components. We demonstrated our work with the FPGA implementations of various SoC instances running computer-vision applications.
This work was supported in part by DARPA (C#: FA8650-18-2-7862) and
in part by the National Science Foundation (A#: 1764000). The views
and conclusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official policies
or endorsements, either expressed or implied,of Air Force Research
Laboratory (AFRL) and Defense
Advanced Research Projects Agency (DARPA) or the U.S. Government.
We thank the developer team of hls4ml. We acknowledge the Fast Machine Learning collective as an open community of multi-domain experts and collaborators. This community was important for the development of this project.