DeepAI
Log In Sign Up

Algorithm and Hardware Co-design for Reconfigurable CNN Accelerator

11/24/2021
by   Hongxiang Fan, et al.
IEEE
UCL
9

Recent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization problem due to the expensive training cost and the time-consuming hardware implementation, which makes the exploration on the vast design space of neural architecture and hardware design intractable. In this paper, we demonstrate that our proposed approach is capable of locating designs on the Pareto frontier. This capability is enabled by a novel three-phase co-design framework, with the following new features: (a) decoupling DNN training from the design space exploration of hardware architecture and neural architecture, (b) providing a hardware-friendly neural architecture space by considering hardware characteristics in constructing the search cells, (c) adopting Gaussian process to predict accuracy, latency and power consumption to avoid time-consuming synthesis and place-and-route processes. In comparison with the manually-designed ResNet101, InceptionV2 and MobileNetV2, we can achieve up to 5 with other state-of-the-art co-design frameworks, our found network and hardware configuration can achieve 2 latency and 8.5x higher energy efficiency.

READ FULL TEXT VIEW PDF

page 1

page 3

04/26/2021

HAO: Hardware-aware neural Architecture Optimization for Efficient Inference

Automatic algorithm-hardware co-design for DNN has shown great success i...
08/01/2021

FLASH: Fast Neural Architecture Search with Hardware Optimization

Neural architecture search (NAS) is a promising technique to design effi...
06/23/2021

NAX: Co-Designing Neural Network and Hardware Architecture for Memristive Xbar based Computing Systems

In-Memory Computing (IMC) hardware using Memristive Crossbar Arrays (MCA...
09/14/2020

AutoML for Multilayer Perceptron and FPGA Co-design

State-of-the-art Neural Network Architectures (NNAs) are challenging to ...
09/14/2020

DANCE: Differentiable Accelerator/Network Co-Exploration

To cope with the ever-increasing computational demand of the DNN executi...
03/17/2017

Algorithm/Architecture Co-design of Proportionate-type LMS Adaptive Filters for Sparse System Identification

This paper investigates the problem of implementing proportionate-type L...

I Introduction

The success of deep learning, and especially neural networks (NNs), has attracted enormous research and industrial interests in applying NNs in real-life scenarios such as in autonomous driving 

[grigorescu2020survey]. However, the heavy computational and memory demand of running NNs imposes a large overhead on their hardware performance, in particular while considering resource-constrained platforms [fan2018real]. Currently, there are two research directions that focus on improving the hardware performance of deployed NNs. First, algorithm-level design of efficient NNs through neural architecture search (NAS) [zoph2016neural], which automatically designs NN architectures with high accuracy and low computational complexity for different scenarios [wu2019fbnet]. Second, hardware-level efforts to design highly-optimized and specialized hardware accelerators for NNs [fan2019f, liu2021toward, yu2020collaborative]. However, most of the time, the algorithm-level optimization and the hardware-level design are not considered jointly, which can lead to sub-optimal solutions in terms of both the resultant algorithmic or the hardware performance. For example, the authors in [fan2018real] demonstrate that the hardware architecture designed for the regular convolution is not suitable for depthwise convolution commonly used in NAS [wu2019fbnet].

To address the aforementioned sub-optimality, there is a growing demand for a method that can perform NAS to design accurate NNs and at the same time, co-develop hardware designs customized for the NN found. To meet this demand, reconfigurable hardware, such as field-programmable gate arrays (FPGAs), represents an ideal platform to implement algorithm-hardware co-design. Given its reconfigurability, FPGA can be utilized to provide highly-optimized hardware, customized for different NNs found by NAS. Previous work has attempted to apply evolutionary algorithm 

[linneural, colangelo2019artificial, colangelo2019evolutionary]

, reinforcement learning 

[jiang2019hardware, abdelfattah2020best] and differentiable NAS [li2020edd, fan2020optimizing] on algorithm-hardware co-design for NNs on FPGA. However, these approaches iteratively perform the network training and design space exploration for multiple times, making the process time-consuming. Also, the characteristics of the accelerator are only considered during NAS in their work. To address these issues, our contributions include:

  • [leftmargin=*]

  • A novel three-phase co-design framework, which decouples network training from design space exploration of both hardware design and neural architecture to avoid iterative time-consuming optimization. A hardware-friendly neural architecture space is also proposed by considering the characteristics of the underlying hardware to construct search cells before the neural architecture searching (Section III);

  • An accurate and efficient cross-entropy loss, latency and energy consumption model based on Gaussian process regression, together with a genetic algorithm, which enable fast design space exploration within few minutes (Section 

    IV);

  • A demonstration of the effectiveness of the proposed method on the ImageNet dataset. The network found and its custom hardware design lie in the Pareto frontier, and can achieve better accuracy, energy efficiency and latency in comparison to other state-of-the-art co-design methods (Section 

    V).

Fig. 1: The overview of the proposed framework.

Ii Background

Ii-a Algorithm and Hardware Co-design

The joint design of NNs and hardware has been recently an activate research area [hao2019fpga]. Based on an evolutionary algorithm (EA), Lin et al. [linneural] propose a two-stage method for algorithm-hardware co-design. Although their work claims speedup and

energy savings, their results are based entirely on simulations, and the performance is estimated without running on a real hardware. At the same time, EA has been adopted in other NAS methods 

[colangelo2019artificial, colangelo2019evolutionary]. However, training cost is expensive and the generated NNs lack in accuracy.

Reinforcement learning (RL) is another approach used for the algorithm-hardware co-design [jiang2019hardware]. Nevertheless, a common drawback in these RL-based algorithm-hardware co-design approaches is that they demand significant number of GPU hours for search of both algorithm and hardware-defining parameters, which is unfeasible for real-life applications. To reduce the search cost, differentiable NAS (DNA) has been used in algorithm and hardware co-design [li2020edd]. However, it has been demonstrated in [li2020random] that the NNs found by DNA can only achieve similar accuracy to the NNs generated by random search.

The once-for-all (OFA) proposed in [cai2019once] provides another paradigm for NAS. The progressive shrinking algorithm has been demonstrated to be effective in training the supernet. However, their work only optimizes neural architectures without searching the optimal hardware architectures. Also the neural architecture space in their paper does not consider the characteristics of underlying hardware design, which leads to sub-optimal hardware performance. Compared with their work, we are able to achieve a higher accuracy and hardware performance as demonstrated in Section V. Although [lin2021naas] tries to search the accelerator architecture, they only focus on processing engine (PE) connectivity and compiler mappings for an ASIC design.

Ii-B Gaussian Process

Gaussian process (GP) is a model built around Bayesian probabilistic theory which can embody prior knowledge into the predictive model and can be used for regression of real valued non-linear targets [rasmussen2010gaussian]. A GP is specified by a mean function and a covariance function-kernel. A common choice for kernels includes polynomial, Gaussian or Matérn kernels [rasmussen2010gaussian]. The mean function represents the supposed average of the estimated data. The kernel computes correlations between inputs and it encapsulates the structure of the hypothesised function. GP allows fast estimation, which is especially useful in design space exploration [ferianc2020improving]. However, [ferianc2020improving] only explored latency estimation for a single layer, while this paper adopts GP to estimate the loss, latency and energy consumption of the whole NN.

Iii Algorithm-Hardware Co-design

The problem of algorithm-hardware co-design can be defined as follows:

(1)

The denotes the NN architecture space and represents the hardware design space. To minimize the loss , we aim to find the optimal hardware configuration and NN architecture with the associated weights .

In this paper, we decouple the training of weights and the optimization of and into two separate steps. First, we train a supernet, encompassing all our NN architecture options, with respect to the weights using the following objective function based on a cross-entropy (CE) loss:

(2)

During this process, we randomly sample sub-NNs from the supernet and independently train each sampled network to minimize the overall loss. Then, once the training is finished, we perform the optimization with respect to the and using the overall objective function containing both the CE loss and hardware costs as follows:

(3)

Aiming at solving (2) and (3), a novel algorithm-hardware co-design framework is proposed, which is illustrated in Figure 1. To make sure the framework is applicable to any reconfigurable hardware system, we generalize it into three phases: 1) Specify and Train, 2) Modeling and 3) Exploration. Note that the first and second phases are only required once, while the Exploration is briefly performed given a specific deployment scenarios, which makes our framework efficient.

Phase 1: Specify and Train  To define the hardware design space for exploration, it first requires the users to specify a reconfigurable hardware system to accelerate NNs. Then, the neural architecture search space is built based on the supported operations provided by the underlying hardware system. The neural architecture search space often considers different algorithmic configurations, ordering and connections between operations inside the NNs [zoph2016neural]. Note that, our framework does not apply any restrictions on the neural architecture space, and it can be changed accordingly for different reconfigurable hardware designs. Therefore, our framework is general enough to cover any reconfigurable hardware, and has potential to gain higher accuracy and hardware performance.

During the training, in order to efficiently solve (2), we use the progressive shrinking algorithm [cai2019once] to train all the sub-networks within the supernet by random sampling of candidate NNs. All the sub-networks share the same set of parameters within the supernet. Once the training is finished, we can quickly sample a sub-network from the supernet without extra effort. All these sub-networks form the final neural architecture space, which enables exploration at the algorithm-level in later phase.

Phase 2: Modeling  In the second phase, we model different metrics: CE loss (), latency, energy and resource consumption, to enable fast exploration in the last phase.

For loss, latency and energy models, we adopt the GP regression for fast estimation. The training data used for GP regression is obtained by randomly sampling a small number of sub-networks from the supernet. These sampled NNs are then evaluated on the dataset to get the CE loss, and run on our reconfigurable hardware with different configurations to obtain their latency and energy consumption. For the resource model, we propose to use a simple analytic formulation to estimate the DSP and memory resource consumption.

Phase 3: Exploration  In the last phase, as the regression models for loss, latency, energy and resources are available, Genetic algorithm (GA) is adopted for fast design space exploration in both neural architecture and hardware design search spaces. Our GA contains five operations, i.e., population initialization, fitness function evaluation, selection, crossover and mutation. The population is initialized by the randomly generated neural architectures and hardware designs. The loss (fitness) function is defined as follows:

(4)

The , and are user-defined hyper-parameters which denote the importance for the CE loss, latency and energy consumption. The , and are the regression results of GP-based CE loss, latency and energy models. Different values for , and may lead to different results and this is explored in Section V. The term is defined as follows:

(5)

where the and represent the available DSPs and memory resources on the target hardware platform and and denote DSP and memory consumption provided by the resource model. The

denotes the penalty added to the loss function when the hardware resource consumption exceeds the budget. In our very last step, if the underlying hardware supports different precision other than the one used for training the supernet, the quantization aware finetuning 

[jacob2018quantization] will be enabled to tailor the resultant NNs to the hardware system. The process of GA is illustrated in Figure 1.

Iv Design Space and Modelling

Iv-a Design Space

The design space is composed of two parts: hardware design space and neural architecture space.

Iv-A1 Hardware Design Space

This paper adopts an example design which uses a single configurable processing unit to process different layers. Although there are other designs, such as the streaming design [li2020edd, hao2019fpga] with layer-wise reconfigurability, they usually require a large amount of on-chip memory to cache all the intermediate results, which restricts the model size of the NN and limits the neural architecture space. In this paper, we adopted the single processing engine design, such that our search space encompasses larger CNNs. Note that, although we use the single engine design, our framework is general enough to be applied to any reconfigurable design such as the streaming design by changing the hardware design space.

The adopted reconfigurable design is illustrated in Figure 2. The accelerator consists of an input buffer, a weight buffer, a convolutional (Conv) engine and other functional modules including Shortcut (SC[he2016deep], Pooling (Pool

) and Rectified linear unit (

ReLU) activation. The computation of the NN is performed sequentially, layer-by-layer, and only one layer is processed in the Conv engine at a time. This computational pattern allows the accelerator to support NNs even with a large number of layers because only one layer’s input data and weights need to be cached in the on-chip memory. To achieve higher hardware performance, the accelerator is designed to support 8-bit integer operations.

Fig. 2: Overview of the FPGA-based accelerator.

The Conv engine supports three types of configurable parallelism: filter parallelism (), channel parallelism (

) and vector parallelism (

). Different types of convolutions require different combinations of , and to achieve an optimal performance. For instance, a convolution with small number of channels can achieve lower latency with the combination of low and high and values, since there is no available concurrency in the channel dimension. Our hardware design space is represented by memory size , bandwidth and different parallelism levels including , and . The domain for both and is and can be chosen from . depends on the available memory resources on the FPGA board and is selected from bits. Thus, there are totally potential different configurations for the hardware design.

Iv-A2 Neural Architecture Space

The example architecture space is illustrated in Figure 3. In this paper, we argue that the design of neural architecture search should consider the underlying hardware design before the NAS optimization. By analyzing the characteristics of the selected hardware architecture, we found that it is efficient in performing the regular convolution with residual addition [he2016deep]. Also, [bello2021revisiting] demonstrate that the basic building block of ResNet is still one of the most effective architectures with the proper scaling strategies. Therefore, our core neural architecture search space follows the backbone of ResNet-50 which is composed of four residual blocks with gradually reduced feature map size and increased channel sizes. In each block, we search for the number of units ranging from 2 to , where denotes the maximal number of units in block. In each cell, we search for the expansion ratio () chosen from {, , }. As there are totally cells in our neural architecture space, the total number of combinations is . Together with the 300 different hardware configuration options, there are more than different combinations in our co-design space.

Fig. 3: The search space of neural architectures.

Iv-B Loss, Latency, Energy and Resource Models

Iv-B1 Loss Model

Evaluating in Equation 2 for all 12 billion configurations on a large dataset such as ImageNet [deng2009imagenet] is time-consuming. To enable fast evaluation, we adopt GP regression to estimate the for all sub-networks. To represent the neural architecture, we encode the neural architecture space, which contains 16 searchable cells, into a 16-dimension vector with each dimension representing the expansion ratio used in that cell. The expansion ratio is 0, if a cell is skipped. We construct a training dataset by randomly sampling and evaluating a certain number of sub-networks. Based on the encoded input vector and evaluated , we perform regression using the GP model with a Matérn covariance kernel with a constant mean function.

Iv-B2 Latency and Energy Models

Measuring the hardware performance of all sub-networks for the FPGA-based design for different design parameters is time-consuming because of synthesis and place and route processes that are needed for the real hardware implementation. We again use GP regression model to estimate the latency and energy consumption. To represent the NN together with the hardware configuration, we encode it into a 19-dimensional vector with the first 16 dimensions representing the neural architecture and the last 3 dimensions being , and .

Iv-B3 Resource Model

As DSPs and memory are the limiting resource for FPGA-based CNN accelerator [liu2018optimizing], we primarily consider DSP and memory consumption in this paper. The DSP consumption can be described as: , which is dominated by the parallelism level used in the Conv engine.

The memory resources are mainly consumed by the input and weight buffers. As the input buffer needs to cache all the input feature maps in the current th layer, its usage can be represented as: , where , and denote the number of channels, height and width of the input feature map, is the data width and is the total depth of the net. As for the weight buffer, because weights are shared along parallelism, it only needs to cache the current filters, so the memory consumption can be formulated as: , where is the kernel size of the th layer. Due to the use of ping-pong buffer technique, the total memory consumption is: .

V Experiments

The PyTorch and GPyTorch libraries are used for the implementation of the supernet training and the GP models respectively. ImageNet 

[deng2009imagenet] dataset contains over 10,000,000 labeled images of 1000 object categories for classification. The hardware design used in all experiments is implemented on an Intel Arria 10 SX660 FPGA platform using Verilog. 1GB DDR4 SDRAM is installed on the platform as the off-chip memory. Quartus 17 Prime Pro was used for synthesis and implementation. An Intel Xeon E5-2680 v2 CPU was used as the host processor. We train the supernet on a GPU cluster with six NVIDIA GTX 1080 Ti GPUs for days. A power meter is plugged in to measure the runtime power performance.

V-a Accuracy of Gaussian Process-based Model

To train our GP-based loss model, 2000 sub-networks were sampled and evaluated on ImageNet [deng2009imagenet]. We used 1500 samples for training and 500 samples for evaluation. The model was trained for 50 iterations using an Adam optimizer. The result is shown in Table I. The mean absolute error (MAE) is only , which demonstrates the GP-based loss model is sufficiently accurate for the modeling.

Kernel Function Mean Absolute Error
Loss Model Matérn () 0.01005
Latency Model Matérn () 0.06521ms
Energy Model Matérn () 0.01804W
TABLE I: Results of Gaussian process-based models.

Similarly, 4600 random samples with different network configurations and hardware designs were collected for latency and energy modeling. We used 3000 and 1600 samples for training and evaluation respectively. The training was again performed with respect to 50 iterations and an Adam optimizer. As shown in Table I, the MAE of our GP-based latency and energy models is only ms and W. Therefore, the proposed GP-based latency and energy models can be used as an accurate estimator for the latency and energy consumption.

V-B Effectiveness of Design Space Exploration

For reference and demonstration, we iterated through and evaluated all samples in the co-design space to get the reference Pareto frontier. The Pareto-optimal points, which are better in either loss or latency or energy with respect to any other point, form a Pareto frontier, which is drawn as blue points in Figure 4. Because the whole design space is too large to show in the Figure. we randomly drew 2000 non-Pareto-optimal samples as purple points to visualize the rest of the design space.

Fig. 4: The performance of various NAS-generated NNs on different candidate hardware design. Pareto-optimal is denoted by blue points.

Then, to demonstrate the effectiveness of our framework, we used GA to perform design space exploration, and check whether these found configurations match the reference Pareto frontier. The time cost for the proposed GP-based models and GA to find one optimized design is only GPU hour, which demonstrates the efficiency of our framework. In contrast, other approaches [jiang2019hardware, dong2021hao] require tens to hundreds of GPU hours in searching. As mentioned in Section III, the user-defined hyper-parameters , and specified in the GA represent the importance of accuracy, latency and energy consumption respectively, we chose three sets of , and : , and , to demonstrate how the GA is able to find different Pareto-optimal designs according to users’ requirements. The resultant designs found by the GA are highlighted by black arrows in Figure 4, which all lay on the reference Pareto frontier. Their NN architectures and hardware configurations are illustrated in Figure 5. Therefore, our framework can effectively identify the Pareto-optimal designs in the vast algorithm-hardware co-design space.

Fig. 5: Neural architecture and hardware configuration of NNs found.

We also evaluated the resultant networks on different hardware platforms including Intel Xeon Silver 4110 CPU and NVIDIA GTX 1080 Ti GPU. The results are presented in Table II. TensorRT and CuDNN libraries were used for GPU implementation, and the MKLDNN was used to optimize the performance of the CPU implementation. The batch size was set to one for a fair comparison. Compared with GPU and CPU implementations, the networks found for the reconfigurable FPGA-based accelerator can achieve approximately and reduction in latency and up to and higher energy efficiency.

CPU GPU FPGA Acc
Lat. Enrg. Eff. Lat. Enrg. Eff. Lat. Enrg. Eff.
(ms) (FPS/W) (ms) (FPS/W) (ms) (FPS/W)
0.28 0.94 4.52 5.07 77.63%
0.30 1.06 3.66 6.27 76.30%
0.38 1.38 3.14 7.32 74.91%
TABLE II: Accuracy, latency and energy efficiency on ImageNet.

V-C Comparison with Manually Designed Networks

To demonstrate that the auto-generated NN architectures can outperform manually-designed networks in terms of accuracy, latency, energy and model size on our FPGA accelerator, we evaluated several commonly benchmarked NNs including ResNet-101 [he2016deep], VGG-16 [simonyan2014very] and Inception-v2 [szegedy2016rethinking] on the ImageNet. The hardware configurations with respect to these networks were manually optimized. The results are shown in Figure 6. The network found with highest accuracy (, , ) is nearly more accurate and faster than ResNet-101. Compared with VGG-16, the network found can achieve nearly higher accuracy while reducing the latency by nearly . We also compared our work with the MobileNetV2 [sandler2018mobilenetv2] implemented in [cai2019once]. Our design achieves a similar latency while improving the accuracy by nearly %.

Fig. 6: Comparison of accuracy and latency among our work, the manually-designed neural networks and other algorithm-hardware co-design methods.

V-D Comparison with Existing Co-Design Work

We compared our proposed approach with four other state-of-the-art co-design methods, including Co-Explore [jiang2019hardware], EDD [li2020edd], HAO [dong2021hao], and OFA [cai2019once]. Although there are other co-design works, they suffer from low accuracy [colangelo2019artificial, colangelo2019evolutionary, hao2019fpga, jiang2019accuracy_vs_eff] or only evaluated on a small dataset [abdelfattah2020best, chen2020you, fan2020optimizing]. Therefore, we did not include them in our comparison. The results are shown in Figure 6. Table III summarizes their underlying hardware platforms and implementation details. Compared with the network generated by [jiang2019hardware], our network found can achieve % higher accuracy, more than speed up and nearly higher energy efficiency. We can also achieve nearly % higher accuracy than HAO [dong2021hao] with better hardware performance even with the latency being normalized by the DSP consumption. In comparison with OFA that consumes nearly twice more DSPs, we achieve a similar latency with % higher accuracy.

Platform Number Latency Accuracy Energy Eff.
of DSPs (ms) (GOPS/W)
Co-Explore [jiang2019hardware] Xilinx XC7Z015 150 95.24 70.24% 0.74
EDD [li2020edd] Xilinx ZCU102 2520 7.96 74.60% -
HAO [dong2021hao] Xilinx ZU3EG 360 22.27 72.68% -
OFA [cai2019once] Xilinx ZU9EG 2520 3.30 73.60% -
Our Work Intel GX1150 1345 3.66 76.30% 6.27
TABLE III: Details of hardware implementations.

Vi Conclusion

This paper proposes a novel algorithm-hardware co-design framework for reconfigurable NN accelerators. To reduce the search cost, we adopt genetic algorithm and Gaussian process regression, which enables fast design space exploration within few minutes. The network and hardware configuration generated by the proposed framework on our reconfigurable CNN accelerator can achieve 1% to 5% higher accuracy while reducing the latency by 2 to 10

on the ImageNet dataset, in comparison with manually-designed NNs on the same hardware. Compared with the other state-of-the-art algorithm-hardware co-design approaches, our found NNs achieve better accuracy, energy efficiency, latency and search cost. Future work includes expanding the search space with more choices of operations, integrating optimization for recurrent neural networks into the current optimization step and supporting end-to-end automation.

Acknowledgement

The support of the United Kingdom EPSRC (No. EP/L016796/1, EP/N031768/1, EP/P010040/1, EP/V028251/1 and EP/S030069/1), the National Natural Science Foundation of China (No. 62001165), Hunan Provincial Natural Science Foundation of China (No. 2021JJ40357), Changsha Municipal Natural Science Foundation (No. kq2014079), Corerain, Maxeler, Intel and Xilinx is gratefully acknowledged.

References