Efficient implementation of dnn in hardware requires rigorous exploration of the design space on different layers of abstraction, including algorithmic, architectural and platform layers, as it is depicted in Fig. 1. First, the application is defined by a dataset and requirements in terms of constraints and objectives, e.g., accuracy, latency, etc. At the highest level in the design hierarchy is the algorithm, which is the most abstract description of the data and control flow in the form of a dnn topology. The architecture layer maps the topology to a hardware design, which is implemented on the platform. At the lowest level is the platform, which describes the hardware and its physical properties. All design layers introduce a large number of design choices as well as interdependencies (see Fig. 1).
In our methodology, the architectural layer is represented by a highly parametrizable architecture template that is used to instantiate various dnn topologies in hardware. We formulate hardware models for latency, power, and energy derived from the architecture template expressed in terms of topology hyperparameters to include hardware awareness in the algorithmic layer. The hardware models together with the architecture template describe a bridge between the algorithmic and platform layer. However, this one bridge is not enough, since the dependence of the application to the architecture layer cannot be formalized. To this end, we use nas in the algorithmic layer and augment it with the hardware-awareness models of the architecture layer. The nas performs a full cross-layer optimization, which is guided by optimization objectives targeting both application requirements and hardware performance. In summary, the hardware-aware nas in strong coupling with the architecture template and its modeling spans a bridge from the application down to the platform layer, which allows a fully automatic design flow for optimized implementations.
The methodology is implemented in the half framework, which comprises a hardware-aware evolutionary nas and an fpga implementation framework plus a hardware library. We demonstrate the efficiency of our approach in a case study on energy-efficient fpga implementations for atrial fibrillation detection in a real-world application scenario using a dataset provided by the Charité in Berlin. The novel contributions of the paper are:
A design space exploration methodology, which enables cross-layer optimization of dnn for efficient hardware implementation.
A framework that automatically produces low-energy, low-power or high-throughput fpga solutions.
A library of parameterizable low-power, ultra-low-latency hardware architectures of dnn layers.
We demonstrate fpga implementations for arrhythmia detection targeting different application scenarios which outperform an embedded GPU with respect to throughput, power and energy efficiency.
Ii Related work
Ii-a Neural Architecture Search
For the task of anomaly detection in ecg data,[9, 11, 5, 8]
demonstrate that 1dcnn can be used with raw ecg samples without any hand-crafted feature extraction. Instead of manually designing dnn topologies for a specific detection task, nas describes methods which automatically explore the search space spanned by dnn architectures. Among the most successful techniques are gradient-based and evolutionary methods. Recently, there is less activity around rl based approaches, as both gradient-based and evolutionary methods can produce similar results in often significantly less time[19, 12]. Nevertheless, rl methods are explored in combination with hardware-awareness for hardware accelerators  and fpga . The most prominent gradient-based method is darts , which searches for subgraphs in a larger, differentiable supergraph.  introduced hardware-awareness to darts in the form of latency minimization for mobile phones. The authors of  used a very similar darts setup as 
, but for fpga, and the layer latencies were modelled as functions of the topology hyperparameters instead of lookup tables. Although darts is very fast, the structure of the manually designed supergraph can impose a strong bias. In contrast, evolutionary algorithms do not require a supergraph. Genetic algorithms use an encoding to describe the structure of the dnns, which enables biologie inspired concepts like crossover and aging  to be implemented. Network morphisms are a class of operators on the graph structure, which change the dnn such that retraining from scratch is not necessary [18, 3].  uses network morphisms in their LEMONADE algorithm, which uses a bayesian method to generate dnn in a multi-objective search, although not hardware-aware. Both  and  distinguish computationally ”cheap” and ”expensive” objectives, and skip unnecessary, ”expensive” computations of bad candidates. Later,  augmented LEMONADE with hardware-awareness and error resilience.
Ii-B Automatic Hardware Generation and Hardware Architectures for dnn
Among the most used frameworks for automatic hardware generation are FINN , Xilinx ML Suite  that is a toolchain for development on xDNN  general processing engine, and Deep Neural Network Development Kit (DNNDK) 
that is an SDK for Deep Learning Processor Unit (DPU) programmable engine. However, none of them is a fully automatic approach, but rather a collection of tools that help to convert a dnn into a custom fpga accelerator, like FINN or compile them to a sequence of instructions executed on a programmable engine. All frameworks provide tools for dnn compression and optimization and runtime support.
Both Xilinx ML Suite and DNNDK target dnn execution on programmable engines that are designed to support a wide range of dnn topologies. They trade off flexibility for generality. Contrary, FINN uses an hls hardware library  of hardware layers and components that are used to generate streaming architectures customized for each network. Other tools for automatic hardware generation are FlexCNN 
, integrating an fpga implementation framework into Tensorflow and DNNBuilder, which uses software-hardware co-design to perform an end-to-end optimization of deep learning applications.
Our approach is similar to FINN. Specifically, it maps dnn on a set of highly optimized hardware components. Conceptually, the half framework is different from all previous approaches as it includes nas for hardware-optimized topologies. Also, we are targeting a fully automatic solution.
There are publications on augmenting nas with hardware awareness for fpga.  proposed nas for their accelerator capable to process cnn layer-by-layer similar to xDNN and DPU. With respect to hardware awareness, their approach optimizes dnn for low latency only.  proposed a hardware and software co-exploration framework that uses nas for optimizing dnn for implementation on multiple fpga, however without much consideration of optimizations on a level of separate fpga. Their approach is primarily focused on optimizing for high throughput.
Our framework is different as we use hardware-aware nas for dataflow-style fully on-chip architectures customized for each network and optimized for various objectives, namely low latency, low power, and low energy.
Iii half Framework
The half framework is comprised of two main components, which are the hardware-aware nas and the fpga implementation framework (see Fig. 2). The inputs are a dataset and requirements specified in terms of application-level and hardware-level constraints and optimization objectives. The output is a hardware configuration for the selected fpga platform that fulfills the requirements.
half generates the output automatically, and therefore it significantly accelerates the deployment of dnn on fpga. The nas takes approximately two days, depending on the complexity of the underlying search space and the dataset, while manual search would take weeks, even without considering hardware awareness. Including hardware awareness into the nas shortcuts the otherwise time-consuming manual design and evaluation cycles of different fpga implementations to identify candidates with the best trade-offs. The automatic hardware generation and implementation only take a few hours, in contrast to a manual hardware design process that can take days or even weeks, especially if hardware components have to be designed from scratch.
Iii-a Neural Architecture Search
The nas is the first step in the framework and finds optimal topologies for the implementation framework. It is based on an evolutionary algorithm, which are very flexible, as they do not impose strong restrictions on the search space or the objective functions, especially the latter do not need to be differentiable. We use the genetic algorithm proposed by , which boosts the search via the concept of dormant genes.
For the selection strategy, we use a similar, bayesian-based method as , which explores the Pareto Frontier of dnn candidates efficiently in a two-step procedure, preselecting candidates based on computationally inexpensive objectives first. Additionally to this two-step procedure,  uses network morphisms to increase the throughput of fully evaluated dnn. We do not use network morphisms, which limits the range of mutation operations and also would be a bad fit for our genetic encoding. Instead, we handle the large training workload by implementing a dynamic workload scheduler, which leverages parallel processing on hpc systems.
Hardware-awareness is incorporated twofold, i.e. via the search space and the optimization objectives. The search space is constrained to layers which are included in the hardware library, thus the models from the nas are guaranteed to be mappable to the device. This encompasses aspects like layer types and valid hyperparameter combinations, but also the quantization of the inputs, weights and feature maps. The second dimension of hardware-awareness is introduced by the optimization objectives, described in section IV. Before passing the found topology with its trained weights to the hardware implementation framework, preprocessing and tuning techniques such as batchnorm-folding are applied to further compress the model.
Iii-B fpga Implementation Framework
The fpga implementation framework comprises a hardware generator, a custom hardware library, a profiler, a software library, and a hardware-software implementation step. The hardware generator produces a hardware architecture of the neural network using components from the hardware library described in Section V. Using Xilinx Vivado HLS, the framework generates an ip core from the dnn topology, which is based on the layers of the hardware library. The model weights are also integrated into the ip core at this point because the hardware architecture uses only on-chip memory for model storage. Additionally, it instantiates interfaces for communication with external memory and fifo buffers for connecting the elements. The hardware generator also calculates parallelization factors for each layer, which are based on the required throughput and are mainly constraint by the target platform, i.e., number of resources, available memory bandwidth, fpga model. While the quantization of weights and activations is provided by the nas, the quantization for the internal accumulators is found by profiling. The profiler identifies the optimal range and precision for all accumulators in the hardware and sets the bit widths accordingly. In the last hardware-software implementation step, Xilinx Vivado Design Suite is used to generate a bitstream for fpga. The software is compiled for running on the processor cores of the board that is used to transfer input and output data to the fpga and to control the ip core.
Iv Hardware-aware Objective Functions
We choose energy, power and latency as the hardware-aware objectives and model them as functions of the topology and fpga parameters. The latency is defined as
where is the number of layers, is the number of values to initially fill the input buffer (e.g. the kernel size in case of convolution layers), is the output rate of the previous layer in clock cycles and is the latency of the layer to produce its output. Notice that evaluates recursively and describes the pipelined nature of the hardware architecture. The latencies
depend on the layer type and hyperparameters such as strides and kernel size, but also loop unrolling factorsof the fpga implementation. In section VI, the results are reported using the throughput instead of latency, which includes the contribution of data parallelism and is the batch size divided by the latency .
We model the effective power, so that the energy can be simply described as the product of the runtime and the effective power. The total power consumption is
is the power from memory transactions and mostly does not depend on the topology, because all the weights and activations are kept on the chip in our architecture. The size of the input sample, however, does influence , but since an input must always be read and cannot be optimized, its contribution to the power model is excluded. A model for can be added easily to the framework, though. is from other peripheral components of the hardware platform, but since it cannot be influenced by topology parameters, it is not modeled. and are the static and dynamic power consumption of the architecture, which we model in the total power with
We assume that the power scales linearly with the loop unrolling factors . and
are the power consumption when the layer is idling and calculating, respectively, for an unrolling factor of one, which can be estimated from the hardware profiler of the fpga implementation framework.is the time a layer is actively computing and it is the product of the total number of outputs the layer produces multiplied with the latency to produce one such output. The power can be minimized by using no unrolling (min ) and stretching out the total runtime, with compliance to latency constraints.
The total energy is the product of the effective total power with the total runtime
Looking at Eq. 4, it appears that minimizing the is the best strategy to minimize the energy. However, since high reduce both and superlinearly, it is high unrolling factors that reduce the total energy consumption. Also, the energy consumed by the entire platform is the product of and . Since can be much larger than the other contributions to , minimizing is the most effective way to reduce the measurable energy consumption.
V Hardware Library
We present an hls hardware library of custom hardware architectures for standard 1dcnn, depth-wise separable 1dcnn and various other dnn layers and components. The hardware architectures are highly customizable, which allows the implementation of various neural topologies. The hardware library is written as a collection of C++ template functions with hls annotations and modularity in mind to make it easily expandable by new layers. The hardware architecture is designed to be low power and ultra-low latency. Primarily, this is achieved by keeping all weights and intermediate results in on-chip memory since off-chip transfers consume more energy and introduce extra latency. External memory is only used to read input data and write results, reducing memory access to the absolute minimum. The architecture is fully pipelined, allowing all layers to operate concurrently and starting the computation as soon as the inputs are ready to reduce latency and energy consumption. The library is based on dataflow architectures, which can be easily customized for each network. The hardware modules are designed with streaming interfaces to facilitate fast design, debugging, interoperability, and ease of integration. Separate hardware modules dedicated to each layer are connected using on-chip data streams in a single top-level module called dnnu, as shown in Fig. 3. The top-level module is equipped with dma components that allow access to external memory independent of any processor using AXI-Master interfaces.
In a pipelined architecture, there always exists a bottleneck stage, which determines the latency of the entire pipeline. The latency of the bottleneck stage can be decreased by spatial parallelism, which we refer to as unrolling (coming from loop unrolling). The hardware library is designed with parametrizable unrolling, which parallelizes the bottleneck stages efficiently. The parametrization allows coarse-grained parallelization on a level of filters for cnn layers and neurons for fully connected layers, and fine-grained parallelization on a level of dot-products, distinguishing kernel-level and input-channel parallelism.
The effectiveness of our methodology is evaluated on the task of a binary classification for ecg-based arrhythmia detection. The dataset, provided by the university hospital Charité in Berlin, contains 16000 samples, with 2 channels and a length of 60000 each. The dataset contains equal amounts of positive and negative samples. The task performance is measured in detection and false alarm rate, where we define hard limits of 90 % and 20 % for acceptance, respectively.
The nas is performed on four Nvidia Titan X GPUs for 100 generations with 20 children per generation, which takes two days to finish. The search space constitutes of depth-wise separable convolutions with 60 different hyperparameter configurations and max pooling with 4 different strides. All dnn end with a global average-pooling layer followed by a fully-connected layer. The depth of the topology is chosen by the nas but restricted between 2 and 15 layers (final layers not included). The optimization objectives are power, energy and latency each with and without unrolling, and additionally number of parameters, detection and false alarm rate. All objectives are considered at the same time in the Pareto frontier.
Vi-a Influence of nas objectives on the topology
The network topology itself has a high impact on the final performance in terms of power and energy, which we show by comparing models optimized for the three different objectives of low-power with minimal parallelization (low P, min ), low-energy with minimal parallelization (low E, min ) and low-energy with maximal parallelization (low E, max ). The min case is applied whenever the hardware resources are relatively low compared to the size of the dnn model, so full unrolling is not possible. The case (low P, max ) is excluded, since the objective of low power and maximum unrolling leads to unreasonable topologies, which can hardly be unrolled by design. The models optimized for high-throughput are the same as the ones for energy, thus not explicitly considered here. For the implementation strategy, we set the parallelization factor to either one or the maximum, while keeping other hardware-related parameters fixed.
|low E, max||min||4.4||4.42||1010|
|low E, min||min||5.3||4.46||841|
|low P, min||min||1.4||4.40||3120|
|low E, max||max||4.8||8.22||1.7|
|low E, min||max||3.7||7.16||2.0|
|low P, min||max||8.3||6.10||73.8|
Table I shows that the best results in terms of energy and power are obtained if the optimization objective matches the implementation strategy. This demonstrates the effectiveness of the cross-layer optimization approach, where hardware-related parallelization factors influence the topology search, leading to better solutions.
The three topologies used in Table I have meaningful differences in their structure, as shown in Fig. 4. For the first two models, both optimized for energy, the search converged to a very similar solution. Especially the first layers are identical and the position of the striding layers is the same. A parent-child relation in terms of evolution is ruled out since we selected both models from two different experiments. The key difference comes from the 3 convolution layer, which becomes the bottleneck in the min case, which it is not for max . The 2 model uses a kernel size of one here, which lowers the latency of the entire pipeline significantly, resulting in higher energy-efficiency, although having two times more parameters. For the case of max the 1 model has lower energy consumption, because it needs less resources, thus more parallel instances can be implemented and the throughput is increased. The 3 topology in Fig. 4 is less deep than the energy optimized models. Without unrolling, a shallower topology requires less hardware resources. The bottleneck layer is the 3 convolution, which computes twice as long as the next slowest layer. This imposes a high idle time on the non-bottleneck layers, which results in lower power consumption, although having almost three times as many parameters as the 1 model.
In summary, it is not the size of the model alone, but its structure, which determines the hardware performance. The nas is able to find significant structural features in the dnn models, based on the optimization objectives.
Vi-B Efficiency of holistic methodology
To demonstrate the efficiency of our holistic approach, we present solutions for three different domains, namely low-power, low-energy, and high-throughput with optimizations applied on all design levels from the nas down to the fpga platform, see Table II. In each case, the target platform was selected according to the optimization goal, Pynq-Z1, Ultra96-V2, and ZCU102 for each domain, respectively. Additionally, we show results for the low-energy topology implemented on the Nvidia Jetson AGX Xavier embedded GPU optimized using TensorRT.
For the low-power domain, the nas searched for a topology that exhibits the lowest power with unrolling factor constrained to one (low P, min ). In its turn, the hardware implementation framework instantiated only a single fully-folded instance of dnnu implemented with the lowest frequency that, however, still outperforms real-time requirements. Although the nas includes separate objectives for low-energy and high-throughput, we observe that the best model for both cases is the same one, which is optimized considering the maximal unrolling factor (low E, low L, max ). The following hardware implementation step targeted different platforms but used identical strategies to achieve the highest frequency and maximally utilize the available resources by instantiating the maximal number of instances with the highest unrolling factor ( = 40). Table II demonstrates that each implementation achieves the best results in the targeted domain. The fpga designs outperform the embedded GPU implementation regarding all shown metrics, although the GPU has a higher frequency, larger batch size, and a model optimized with TensorRT.
We present a cross-layer optimization methodology, which allows searching for dnn optimized for hardware. The methodology is based on a hardware-aware nas coupled with a parametrizable hardware template. We implemented this approach in an automatic framework and demonstrate its performance by comparing power and energy for dnn models optimized for different objectives. The objectives affect the structure of the generated networks meaningfully, so that the model implementations outperform each other in their respective optimization target. Additionally, we exploit the full potential of our framework by automatically applying hardware-related optimizations that further tune the model, depending on the target platform. Considering every design aspect of the hardware implementation, we target different domain and show significant differences in the hardware metrics for a real-world application scenario of atrial fibrillation detection, outperforming the Nvidia Jetson AGX in throughput, power and energy consumption.
This work has been funded by the German Federal Ministry of Education and Research as a participant of the pilot innovation competition ”Energy-efficient AI System”.
-  Accelerating dnns with xilinx alveo accelerator cards. Note: https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf Cited by: §II-B.
-  DNNDK user guide. Note: https://www.xilinx.com/support/documentation/user_guides/ug1327-dnndk-user-guide.pdf Cited by: §II-B.
-  (2018) Efficient multi-objective neural architecture search via lamarckian evolution. In International Conference on Learning Representations, Cited by: §II-A, §III-A.
-  (2020) Optimizing fpga-based cnn accelerator using differentiable neural architecture search. In 2020 IEEE 38th International Conference on Computer Design (ICCD), pp. 465–468. Cited by: §II-A, §II-B.
-  (2019) Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature medicine 25 (1), pp. 65. Cited by: §II-A.
-  HLS library for hardware acceleration of quantized neural network using finn. Note: https://github.com/Xilinx/finn-hlslib Cited by: §II-B.
-  (2020) Hardware/software co-exploration of neural architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39 (12), pp. 4805–4815. Cited by: §II-A, §II-B.
-  (2018) Ecg heartbeat classification: a deep transferable representation. In 2018 IEEE International Conference on Healthcare Informatics (ICHI), pp. 443–444. Cited by: §II-A.
Real-time patient-specific ecg classification by 1-d convolutional neural networks. IEEE Transactions on Biomedical Engineering 63 (3), pp. 664–675. Cited by: §II-A.
-  (2018) DARTS: differentiable architecture search. In International Conference on Learning Representations, Cited by: §II-A.
-  (2017) Cardiologist-level arrhythmia detection with convolutional neural networks. arXiv preprint arXiv:1707.01836. Cited by: §II-A.
Regularized evolution for image classifier architecture search. In
Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §II-A.
-  (2020) Automated design of error-resilient and hardware-efficient deep neural networks. Neural Computing and Applications, pp. 1–19. Cited by: §II-A.
-  (2020) End-to-end optimization of deep learning applications. FPGA ’20, New York, NY, USA, pp. 133–139. External Links: Cited by: §II-B.
A genetic programming approach to designing convolutional neural network architectures. In
Proceedings of the genetic and evolutionary computation conference, pp. 497–504. Cited by: §III-A.
-  (2020) Automatically designing cnn architectures using the genetic algorithm for image classification. IEEE Transactions on Cybernetics. Cited by: §II-A.
Finn: a framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 65–74. Cited by: §II-B.
-  (2018) Deep learning architecture search by neuro-cell-based evolution with function-preserving mutations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 243–258. Cited by: §II-A.
-  (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In , pp. 10734–10742. Cited by: §II-A.
-  Xilinx ml suite overview. Note: https://www.xilinx.com/publications/events/machine-learning-live/colorado/xDNN_ML_Suite.pdf Cited by: §II-B.
-  (2018) DNNBuilder: an automated tool for building high-performance dnn hardware accelerators for fpgas. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Vol. , pp. 1–8. External Links: Cited by: §II-B.
-  (2021) Rethinking co-design of neural architectures and hardware accelerators. arXiv preprint arXiv:2102.08619. Cited by: §II-A.
-  Zynq dpu v3.2. product guide. Note: https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_2/pg338-dpu.pdf Cited by: §II-B.