I Introduction
The success of deep learning, and especially neural networks (NNs), has attracted enormous research and industrial interests in applying NNs in reallife scenarios such as in autonomous driving
[grigorescu2020survey]. However, the heavy computational and memory demand of running NNs imposes a large overhead on their hardware performance, in particular while considering resourceconstrained platforms [fan2018real]. Currently, there are two research directions that focus on improving the hardware performance of deployed NNs. First, algorithmlevel design of efficient NNs through neural architecture search (NAS) [zoph2016neural], which automatically designs NN architectures with high accuracy and low computational complexity for different scenarios [wu2019fbnet]. Second, hardwarelevel efforts to design highlyoptimized and specialized hardware accelerators for NNs [fan2019f, liu2021toward, yu2020collaborative]. However, most of the time, the algorithmlevel optimization and the hardwarelevel design are not considered jointly, which can lead to suboptimal solutions in terms of both the resultant algorithmic or the hardware performance. For example, the authors in [fan2018real] demonstrate that the hardware architecture designed for the regular convolution is not suitable for depthwise convolution commonly used in NAS [wu2019fbnet].To address the aforementioned suboptimality, there is a growing demand for a method that can perform NAS to design accurate NNs and at the same time, codevelop hardware designs customized for the NN found. To meet this demand, reconfigurable hardware, such as fieldprogrammable gate arrays (FPGAs), represents an ideal platform to implement algorithmhardware codesign. Given its reconfigurability, FPGA can be utilized to provide highlyoptimized hardware, customized for different NNs found by NAS. Previous work has attempted to apply evolutionary algorithm
[linneural, colangelo2019artificial, colangelo2019evolutionary][jiang2019hardware, abdelfattah2020best] and differentiable NAS [li2020edd, fan2020optimizing] on algorithmhardware codesign for NNs on FPGA. However, these approaches iteratively perform the network training and design space exploration for multiple times, making the process timeconsuming. Also, the characteristics of the accelerator are only considered during NAS in their work. To address these issues, our contributions include:
[leftmargin=*]

A novel threephase codesign framework, which decouples network training from design space exploration of both hardware design and neural architecture to avoid iterative timeconsuming optimization. A hardwarefriendly neural architecture space is also proposed by considering the characteristics of the underlying hardware to construct search cells before the neural architecture searching (Section III);

An accurate and efficient crossentropy loss, latency and energy consumption model based on Gaussian process regression, together with a genetic algorithm, which enable fast design space exploration within few minutes (Section
IV); 
A demonstration of the effectiveness of the proposed method on the ImageNet dataset. The network found and its custom hardware design lie in the Pareto frontier, and can achieve better accuracy, energy efficiency and latency in comparison to other stateoftheart codesign methods (Section
V).
Ii Background
Iia Algorithm and Hardware Codesign
The joint design of NNs and hardware has been recently an activate research area [hao2019fpga]. Based on an evolutionary algorithm (EA), Lin et al. [linneural] propose a twostage method for algorithmhardware codesign. Although their work claims speedup and
energy savings, their results are based entirely on simulations, and the performance is estimated without running on a real hardware. At the same time, EA has been adopted in other NAS methods
[colangelo2019artificial, colangelo2019evolutionary]. However, training cost is expensive and the generated NNs lack in accuracy.Reinforcement learning (RL) is another approach used for the algorithmhardware codesign [jiang2019hardware]. Nevertheless, a common drawback in these RLbased algorithmhardware codesign approaches is that they demand significant number of GPU hours for search of both algorithm and hardwaredefining parameters, which is unfeasible for reallife applications. To reduce the search cost, differentiable NAS (DNA) has been used in algorithm and hardware codesign [li2020edd]. However, it has been demonstrated in [li2020random] that the NNs found by DNA can only achieve similar accuracy to the NNs generated by random search.
The onceforall (OFA) proposed in [cai2019once] provides another paradigm for NAS. The progressive shrinking algorithm has been demonstrated to be effective in training the supernet. However, their work only optimizes neural architectures without searching the optimal hardware architectures. Also the neural architecture space in their paper does not consider the characteristics of underlying hardware design, which leads to suboptimal hardware performance. Compared with their work, we are able to achieve a higher accuracy and hardware performance as demonstrated in Section V. Although [lin2021naas] tries to search the accelerator architecture, they only focus on processing engine (PE) connectivity and compiler mappings for an ASIC design.
IiB Gaussian Process
Gaussian process (GP) is a model built around Bayesian probabilistic theory which can embody prior knowledge into the predictive model and can be used for regression of real valued nonlinear targets [rasmussen2010gaussian]. A GP is specified by a mean function and a covariance functionkernel. A common choice for kernels includes polynomial, Gaussian or Matérn kernels [rasmussen2010gaussian]. The mean function represents the supposed average of the estimated data. The kernel computes correlations between inputs and it encapsulates the structure of the hypothesised function. GP allows fast estimation, which is especially useful in design space exploration [ferianc2020improving]. However, [ferianc2020improving] only explored latency estimation for a single layer, while this paper adopts GP to estimate the loss, latency and energy consumption of the whole NN.
Iii AlgorithmHardware Codesign
The problem of algorithmhardware codesign can be defined as follows:
(1) 
The denotes the NN architecture space and represents the hardware design space. To minimize the loss , we aim to find the optimal hardware configuration and NN architecture with the associated weights .
In this paper, we decouple the training of weights and the optimization of and into two separate steps. First, we train a supernet, encompassing all our NN architecture options, with respect to the weights using the following objective function based on a crossentropy (CE) loss:
(2) 
During this process, we randomly sample subNNs from the supernet and independently train each sampled network to minimize the overall loss. Then, once the training is finished, we perform the optimization with respect to the and using the overall objective function containing both the CE loss and hardware costs as follows:
(3) 
Aiming at solving (2) and (3), a novel algorithmhardware codesign framework is proposed, which is illustrated in Figure 1. To make sure the framework is applicable to any reconfigurable hardware system, we generalize it into three phases: 1) Specify and Train, 2) Modeling and 3) Exploration. Note that the first and second phases are only required once, while the Exploration is briefly performed given a specific deployment scenarios, which makes our framework efficient.
Phase 1: Specify and Train — To define the hardware design space for exploration, it first requires the users to specify a reconfigurable hardware system to accelerate NNs. Then, the neural architecture search space is built based on the supported operations provided by the underlying hardware system. The neural architecture search space often considers different algorithmic configurations, ordering and connections between operations inside the NNs [zoph2016neural]. Note that, our framework does not apply any restrictions on the neural architecture space, and it can be changed accordingly for different reconfigurable hardware designs. Therefore, our framework is general enough to cover any reconfigurable hardware, and has potential to gain higher accuracy and hardware performance.
During the training, in order to efficiently solve (2), we use the progressive shrinking algorithm [cai2019once] to train all the subnetworks within the supernet by random sampling of candidate NNs. All the subnetworks share the same set of parameters within the supernet. Once the training is finished, we can quickly sample a subnetwork from the supernet without extra effort. All these subnetworks form the final neural architecture space, which enables exploration at the algorithmlevel in later phase.
Phase 2: Modeling — In the second phase, we model different metrics: CE loss (), latency, energy and resource consumption, to enable fast exploration in the last phase.
For loss, latency and energy models, we adopt the GP regression for fast estimation. The training data used for GP regression is obtained by randomly sampling a small number of subnetworks from the supernet. These sampled NNs are then evaluated on the dataset to get the CE loss, and run on our reconfigurable hardware with different configurations to obtain their latency and energy consumption. For the resource model, we propose to use a simple analytic formulation to estimate the DSP and memory resource consumption.
Phase 3: Exploration — In the last phase, as the regression models for loss, latency, energy and resources are available, Genetic algorithm (GA) is adopted for fast design space exploration in both neural architecture and hardware design search spaces. Our GA contains five operations, i.e., population initialization, fitness function evaluation, selection, crossover and mutation. The population is initialized by the randomly generated neural architectures and hardware designs. The loss (fitness) function is defined as follows:
(4) 
The , and are userdefined hyperparameters which denote the importance for the CE loss, latency and energy consumption. The , and are the regression results of GPbased CE loss, latency and energy models. Different values for , and may lead to different results and this is explored in Section V. The term is defined as follows:
(5) 
where the and represent the available DSPs and memory resources on the target hardware platform and and denote DSP and memory consumption provided by the resource model. The
denotes the penalty added to the loss function when the hardware resource consumption exceeds the budget. In our very last step, if the underlying hardware supports different precision other than the one used for training the supernet, the quantization aware finetuning
[jacob2018quantization] will be enabled to tailor the resultant NNs to the hardware system. The process of GA is illustrated in Figure 1.Iv Design Space and Modelling
Iva Design Space
The design space is composed of two parts: hardware design space and neural architecture space.
IvA1 Hardware Design Space
This paper adopts an example design which uses a single configurable processing unit to process different layers. Although there are other designs, such as the streaming design [li2020edd, hao2019fpga] with layerwise reconfigurability, they usually require a large amount of onchip memory to cache all the intermediate results, which restricts the model size of the NN and limits the neural architecture space. In this paper, we adopted the single processing engine design, such that our search space encompasses larger CNNs. Note that, although we use the single engine design, our framework is general enough to be applied to any reconfigurable design such as the streaming design by changing the hardware design space.
The adopted reconfigurable design is illustrated in Figure 2. The accelerator consists of an input buffer, a weight buffer, a convolutional (Conv) engine and other functional modules including Shortcut (SC) [he2016deep], Pooling (Pool
) and Rectified linear unit (
ReLU) activation. The computation of the NN is performed sequentially, layerbylayer, and only one layer is processed in the Conv engine at a time. This computational pattern allows the accelerator to support NNs even with a large number of layers because only one layer’s input data and weights need to be cached in the onchip memory. To achieve higher hardware performance, the accelerator is designed to support 8bit integer operations.The Conv engine supports three types of configurable parallelism: filter parallelism (), channel parallelism (
) and vector parallelism (
). Different types of convolutions require different combinations of , and to achieve an optimal performance. For instance, a convolution with small number of channels can achieve lower latency with the combination of low and high and values, since there is no available concurrency in the channel dimension. Our hardware design space is represented by memory size , bandwidth and different parallelism levels including , and . The domain for both and is and can be chosen from . depends on the available memory resources on the FPGA board and is selected from bits. Thus, there are totally potential different configurations for the hardware design.IvA2 Neural Architecture Space
The example architecture space is illustrated in Figure 3. In this paper, we argue that the design of neural architecture search should consider the underlying hardware design before the NAS optimization. By analyzing the characteristics of the selected hardware architecture, we found that it is efficient in performing the regular convolution with residual addition [he2016deep]. Also, [bello2021revisiting] demonstrate that the basic building block of ResNet is still one of the most effective architectures with the proper scaling strategies. Therefore, our core neural architecture search space follows the backbone of ResNet50 which is composed of four residual blocks with gradually reduced feature map size and increased channel sizes. In each block, we search for the number of units ranging from 2 to , where denotes the maximal number of units in block. In each cell, we search for the expansion ratio () chosen from {, , }. As there are totally cells in our neural architecture space, the total number of combinations is . Together with the 300 different hardware configuration options, there are more than different combinations in our codesign space.
IvB Loss, Latency, Energy and Resource Models
IvB1 Loss Model
Evaluating in Equation 2 for all 12 billion configurations on a large dataset such as ImageNet [deng2009imagenet] is timeconsuming. To enable fast evaluation, we adopt GP regression to estimate the for all subnetworks. To represent the neural architecture, we encode the neural architecture space, which contains 16 searchable cells, into a 16dimension vector with each dimension representing the expansion ratio used in that cell. The expansion ratio is 0, if a cell is skipped. We construct a training dataset by randomly sampling and evaluating a certain number of subnetworks. Based on the encoded input vector and evaluated , we perform regression using the GP model with a Matérn covariance kernel with a constant mean function.
IvB2 Latency and Energy Models
Measuring the hardware performance of all subnetworks for the FPGAbased design for different design parameters is timeconsuming because of synthesis and place and route processes that are needed for the real hardware implementation. We again use GP regression model to estimate the latency and energy consumption. To represent the NN together with the hardware configuration, we encode it into a 19dimensional vector with the first 16 dimensions representing the neural architecture and the last 3 dimensions being , and .
IvB3 Resource Model
As DSPs and memory are the limiting resource for FPGAbased CNN accelerator [liu2018optimizing], we primarily consider DSP and memory consumption in this paper. The DSP consumption can be described as: , which is dominated by the parallelism level used in the Conv engine.
The memory resources are mainly consumed by the input and weight buffers. As the input buffer needs to cache all the input feature maps in the current ^{th} layer, its usage can be represented as: , where , and denote the number of channels, height and width of the input feature map, is the data width and is the total depth of the net. As for the weight buffer, because weights are shared along parallelism, it only needs to cache the current filters, so the memory consumption can be formulated as: , where is the kernel size of the ^{th} layer. Due to the use of pingpong buffer technique, the total memory consumption is: .
V Experiments
The PyTorch and GPyTorch libraries are used for the implementation of the supernet training and the GP models respectively. ImageNet
[deng2009imagenet] dataset contains over 10,000,000 labeled images of 1000 object categories for classification. The hardware design used in all experiments is implemented on an Intel Arria 10 SX660 FPGA platform using Verilog. 1GB DDR4 SDRAM is installed on the platform as the offchip memory. Quartus 17 Prime Pro was used for synthesis and implementation. An Intel Xeon E52680 v2 CPU was used as the host processor. We train the supernet on a GPU cluster with six NVIDIA GTX 1080 Ti GPUs for days. A power meter is plugged in to measure the runtime power performance.Va Accuracy of Gaussian Processbased Model
To train our GPbased loss model, 2000 subnetworks were sampled and evaluated on ImageNet [deng2009imagenet]. We used 1500 samples for training and 500 samples for evaluation. The model was trained for 50 iterations using an Adam optimizer. The result is shown in Table I. The mean absolute error (MAE) is only , which demonstrates the GPbased loss model is sufficiently accurate for the modeling.
Kernel Function  Mean Absolute Error  

Loss Model  Matérn ()  0.01005 
Latency Model  Matérn ()  0.06521ms 
Energy Model  Matérn ()  0.01804W 
Similarly, 4600 random samples with different network configurations and hardware designs were collected for latency and energy modeling. We used 3000 and 1600 samples for training and evaluation respectively. The training was again performed with respect to 50 iterations and an Adam optimizer. As shown in Table I, the MAE of our GPbased latency and energy models is only ms and W. Therefore, the proposed GPbased latency and energy models can be used as an accurate estimator for the latency and energy consumption.
VB Effectiveness of Design Space Exploration
For reference and demonstration, we iterated through and evaluated all samples in the codesign space to get the reference Pareto frontier. The Paretooptimal points, which are better in either loss or latency or energy with respect to any other point, form a Pareto frontier, which is drawn as blue points in Figure 4. Because the whole design space is too large to show in the Figure. we randomly drew 2000 nonParetooptimal samples as purple points to visualize the rest of the design space.
Then, to demonstrate the effectiveness of our framework, we used GA to perform design space exploration, and check whether these found configurations match the reference Pareto frontier. The time cost for the proposed GPbased models and GA to find one optimized design is only GPU hour, which demonstrates the efficiency of our framework. In contrast, other approaches [jiang2019hardware, dong2021hao] require tens to hundreds of GPU hours in searching. As mentioned in Section III, the userdefined hyperparameters , and specified in the GA represent the importance of accuracy, latency and energy consumption respectively, we chose three sets of , and : , and , to demonstrate how the GA is able to find different Paretooptimal designs according to users’ requirements. The resultant designs found by the GA are highlighted by black arrows in Figure 4, which all lay on the reference Pareto frontier. Their NN architectures and hardware configurations are illustrated in Figure 5. Therefore, our framework can effectively identify the Paretooptimal designs in the vast algorithmhardware codesign space.
We also evaluated the resultant networks on different hardware platforms including Intel Xeon Silver 4110 CPU and NVIDIA GTX 1080 Ti GPU. The results are presented in Table II. TensorRT and CuDNN libraries were used for GPU implementation, and the MKLDNN was used to optimize the performance of the CPU implementation. The batch size was set to one for a fair comparison. Compared with GPU and CPU implementations, the networks found for the reconfigurable FPGAbased accelerator can achieve approximately and reduction in latency and up to and higher energy efficiency.
CPU  GPU  FPGA  Acc  

Lat.  Enrg. Eff.  Lat.  Enrg. Eff.  Lat.  Enrg. Eff.  
(ms)  (FPS/W)  (ms)  (FPS/W)  (ms)  (FPS/W)  
0.28  0.94  4.52  5.07  77.63%  
0.30  1.06  3.66  6.27  76.30%  
0.38  1.38  3.14  7.32  74.91%  
VC Comparison with Manually Designed Networks
To demonstrate that the autogenerated NN architectures can outperform manuallydesigned networks in terms of accuracy, latency, energy and model size on our FPGA accelerator, we evaluated several commonly benchmarked NNs including ResNet101 [he2016deep], VGG16 [simonyan2014very] and Inceptionv2 [szegedy2016rethinking] on the ImageNet. The hardware configurations with respect to these networks were manually optimized. The results are shown in Figure 6. The network found with highest accuracy (, , ) is nearly more accurate and faster than ResNet101. Compared with VGG16, the network found can achieve nearly higher accuracy while reducing the latency by nearly . We also compared our work with the MobileNetV2 [sandler2018mobilenetv2] implemented in [cai2019once]. Our design achieves a similar latency while improving the accuracy by nearly %.
VD Comparison with Existing CoDesign Work
We compared our proposed approach with four other stateoftheart codesign methods, including CoExplore [jiang2019hardware], EDD [li2020edd], HAO [dong2021hao], and OFA [cai2019once]. Although there are other codesign works, they suffer from low accuracy [colangelo2019artificial, colangelo2019evolutionary, hao2019fpga, jiang2019accuracy_vs_eff] or only evaluated on a small dataset [abdelfattah2020best, chen2020you, fan2020optimizing]. Therefore, we did not include them in our comparison. The results are shown in Figure 6. Table III summarizes their underlying hardware platforms and implementation details. Compared with the network generated by [jiang2019hardware], our network found can achieve % higher accuracy, more than speed up and nearly higher energy efficiency. We can also achieve nearly % higher accuracy than HAO [dong2021hao] with better hardware performance even with the latency being normalized by the DSP consumption. In comparison with OFA that consumes nearly twice more DSPs, we achieve a similar latency with % higher accuracy.
Platform  Number  Latency  Accuracy  Energy Eff.  
of DSPs  (ms)  (GOPS/W)  
CoExplore [jiang2019hardware]  Xilinx XC7Z015  150  95.24  70.24%  0.74 
EDD [li2020edd]  Xilinx ZCU102  2520  7.96  74.60%   
HAO [dong2021hao]  Xilinx ZU3EG  360  22.27  72.68%   
OFA [cai2019once]  Xilinx ZU9EG  2520  3.30  73.60%   
Our Work  Intel GX1150  1345  3.66  76.30%  6.27 
Vi Conclusion
This paper proposes a novel algorithmhardware codesign framework for reconfigurable NN accelerators. To reduce the search cost, we adopt genetic algorithm and Gaussian process regression, which enables fast design space exploration within few minutes. The network and hardware configuration generated by the proposed framework on our reconfigurable CNN accelerator can achieve 1% to 5% higher accuracy while reducing the latency by 2 to 10
on the ImageNet dataset, in comparison with manuallydesigned NNs on the same hardware. Compared with the other stateoftheart algorithmhardware codesign approaches, our found NNs achieve better accuracy, energy efficiency, latency and search cost. Future work includes expanding the search space with more choices of operations, integrating optimization for recurrent neural networks into the current optimization step and supporting endtoend automation.
Acknowledgement
The support of the United Kingdom EPSRC (No. EP/L016796/1, EP/N031768/1, EP/P010040/1, EP/V028251/1 and EP/S030069/1), the National Natural Science Foundation of China (No. 62001165), Hunan Provincial Natural Science Foundation of China (No. 2021JJ40357), Changsha Municipal Natural Science Foundation (No. kq2014079), Corerain, Maxeler, Intel and Xilinx is gratefully acknowledged.