1. Introduction
AI algorithms have gained everincreasing research interests, and remarkable achievements have been demonstrated especially for deep neural networks (DNNs). To improve algorithm quality, usually accuracy, a recent approach called neural architecture search (NAS) (zoph2017learning; huang2017densely; real2018regularized) has shown great success in automatically developing DNNs that outperform humancrafted designs. Meanwhile, the optimization techniques of implementing AI algorithms on hardware are also being intensively studied. The goal of implementation is to improve hardware performance, such as latency, throughput and energy efficiency, etc. Typical implementation techniques include kernel and DNN optimizations on GPUs and customized accelerator designs on FPGAs and ASICs (qiu2016going; zhang2018dnnbuilder; x2017High; chen2019cloud; hao2019fpga; xu2020autodnnchip). On top of these achievements, to further improve AI solution quality, AI algorithm designers and hardware developers begin to explore joint optimization opportunities. For example, hardwareaware NAS has been drawing a lot of attention by considering hardware features for DNN designs (cai2018proxylessnas; tan2019mnasnet; stamoulis2019single; wu2019fbnet; gong2019mixed; zhang2020skynet). Meanwhile, hardware/software codesign approaches (jiang2019accuracy; hao2019fpga) focus on FPGA implementation characteristics and study their influences on DNN software design.
Despite these achievements of hardwareaware NAS and hardware/software codesign, there is still a large optimization opportunity missing: the hardware implementation should be simultaneously searched during NAS. Here, for generalpurpose computing devices, such as CPUs and GPUs, implementation search means optimizing DNN implementations such as kernel fusion and memory access optimization. For reconfigurable devices, such as FPGAs, implementation search means optimizing a customized DNN accelerator through techniques such as quantization, loop tiling and parallelization. Hardware implementation search not only provides more accurate performance evaluations on latency and throughput, but more importantly, provides instant guidance to hardwareaware DNN design during NAS. Currently, all existing works are missing the large design space of implementation search
in their NAS flows, using estimated hardware performance from a fixed implementation
(cai2018proxylessnas; tan2019mnasnet; stamoulis2019single; wu2019fbnet; gong2019mixed). We provided an initial discussion of the potential of simultaneous neural architecture and implementation cosearch in (hao2019nais), called NAIS. Inspired by (hao2019nais), in this work, we propose a simultaneous and efficient DNN and implementation cosearch methodology. We summarize our contributions as follows:
This is the first work that proposes a mathematical formulation to solve the simultaneous DNN architecture and hardware implementation cosearch problem. The formulation fuses the search variables for DNN architecture and its hardware implementation into one search space, to simultaneously maximize the DNN accuracy and implementation performance.

The proposed formulation is differentiable with respect to the fused search space, which allows a gradient descent algorithm to be applied and enables an efficient differentiable DNN and implementation cosearch methodology (EDD).

The formulation is unified and comprehensive. It can be applied on various hardware platforms such as GPUs, FPGAs and dedicated accelerators; it can target various performance objectives such as latency, throughput or energy; it also formulates resource usage considering resource sharing.

We demonstrate our EDD methodology targeting three different architectures: GPU, recursive FPGA accelerator, and pipelined FPGA accelerator. Each model produced by EDD achieves similar accuracy as the best existing DNNs on ImageNet but with superior performance: our GPUtargeted DNN is faster than the stateoftheart Proxyless solution (cai2018proxylessnas), and our FPGAtargeted DNN delivers higher throughput than the stateoftheart DNNBuilder solution (zhang2018dnnbuilder).
2. Related Works
Neural architecture search (NAS) is a technique for automating the design of DNNs which can outperform handcrafted ones (zoph2017learning; huang2017densely; real2018regularized). Among all the NAS approaches, differentiable NAS is becoming prevailing because of its high search efficiency in terms of GPU hours (liu2018darts)
. One typical way is to construct a DNN supernet composed of multiple branch candidates associated with architecture weights. The architecture weights are differentiable with respect to the loss function of the supernet, and are updated through stochastic gradient descent. The final searched DNNs will keep the branches with larger architecture weights and eliminate others. For example,
(cai2018proxylessnas)uses binarized parameters, and
(wu2019fbnet) uses GumbelSoftmax to choose between different DNN branches. Previously published literature also introduces hardwareaware NAS (cai2018proxylessnas; tan2019mnasnet; stamoulis2019single; wu2019fbnet; gong2019mixed). Some works incorporate hardware latency into the objective of NAS (cai2018proxylessnas; stamoulis2019single; wu2019fbnet), while some treat latency as a hard constraint (tan2019mnasnet). In (zhang2020skynet), a bottomup DNN design approach is proposed for hardwareefficient models.Meanwhile, various embedded FPGAbased accelerators have been studied to support more efficient DNN inference (qiu2016going; zhang2018dnnbuilder). Other approaches proposed in (jiang2019accuracy; hao2019fpga) involve hardware/software codesign. Specifically, researchers in (jiang2019accuracy)
proposed a reinforcement learning based architecture search with FPGA implementation performance integrated into the reward function. The work in
(hao2019fpga) proposed a bundlebased codesign methodology, where a bundle is the basic building block for both FPGA accelerator and DNN model. However, none of the previous works is able to explore the DNN architecture and implementation cosearch space simultaneously and comprehensively.3. EDD problem formulation
Our simultaneous DNN and implementation cosearch method fuses the design space of DNN architecture search and hardware implementation search, as shown in Fig. 1. We collectively denote the variables used in DNN search and implementation search as and , respectively, and the fused space of cosearch is . The objective of DNN search is to quickly find a DNN architecture while minimizing accuracy loss, denoted as . For implementation search, we define performance loss, denoted as , which can be specified by users, such as endtoend inference latency, throughput, energy, DNN model complexity, etc. We denote the resource utilization as , and the resource upperbound of the target hardware as . The DNN and implementation cosearch problem is to minimize accuracy loss and performance loss simultaneously by effectively searching :
(1) 
In Eq. 1, is a function of and ; and are functions of . Resource upperbound is expressed in an exponent term to introduce large penalty when being violated. Worth noting, in the existing hardwareaware NAS approaches, only is searched while is fixed during NAS; while in our cosearch formulation, is also variable.
As introduced in Sec. 2, motivated by the high search efficiency and appealing model accuracy of differentiable NAS, in this work, we propose a differentiable formulation for both : in Eq. 1, is differentiable with respect to and , and and are differentiable with respect to . By descending with respect to the variables in on validation set as , will be searched simultaneously.
Fig. 1 shows our proposed overall differentiable design space. The blue blocks represent the DNN search space, while the red blocks represent the hardware implementation search space. We first introduce DNN search space in Sec. 3.1, and then introduce the merged design space with implementation in Sec. 3.2.
3.1. NAS Design Space
The differentiable NAS space is shown as the blue blocks in Fig. 1. First, the DNN is composed of basic building blocks, , where . In this work, in order to design hardwarefriendly DNNs and to reduce search time, we adopt the singlepath DNN structure without branches (stamoulis2019single). Inside the  block, there are candidate operations, denoted as (). We let the operations to be the most commonly used DNN blocks in NAS approaches, called MBConv (tan2019mnasnet). It is composed of sequential layers of ,  (depthwise convolution with kernel size ) and . Between  and , the number of channels expands/shrinks by a ratio of . The output of a block is calculated based on the outputs of its candidate operations. For example in (liu2018darts), the output is the weighted sum of the operations, where the weights are determined by a Softmax function. Instead of Softmax, in this work, we use the GumbelSoftmax function in (wu2019fbnet) in order to sample only one operation out of during feedforward propagation, since GumbelSoftmax function can convert the discrete nondifferentiable sampling to continuous differentiable sampling. This greatly reduces the memory requirement and speeds up the feedforward propagation. The sampling parameters organize a twodimension array, denoted as , which is the primary DNN search variable.
3.2. Implementation Formulation
As shown in the red block in Fig. 1, each candidate operation has its own implementation variables, forming an implementation search space . The primary implementation variable is quantization , i.e., data precision, since it has a large impact on DNN accuracy, implementation performance and hardware resource. Rather than a trainandquantize manner, the quantization shall be searched together with DNN structure to provide implementation performance feedback. Besides quantization, other implementation variables may be device oriented. For example, FPGA implementation design space includes parallelism, loop tiling factors, etc.
To formulate the final and of a DNN in Eq. 1, we need to capture the intermediate performance and resource of each operation and DNN block. As shown in the bottom four blocks in Fig. 1, there are four stages to derive Eq. 1:

Stage1: we first formulate the quantization in a differentiable way; then, for each operation candidate under bit, we formulate the performance as and resource as .

Stage2: the performance and resource of regardless of quantization, and , can be derived from and in Stage1.

Stage3: the performance and resource of  DNN block, and , are derived from and in Stage2.

Stage4: the overall DNN performance loss and resource usage, and , are derived from and in Stage3.

Finally, and will be plugged into Eq. 1 as the objective function during our EDD cosearch.
In the following, we introduce the differentiable quantization and performance and resource formulations stage by stage.
3.2.1 Stage1: Differentiable Quantization
As shown in the red block in Fig. 1, to enable differentiable quantization formulation, we create quantization paths for each operation , indicating each operation has quantization choices, from  to . Similar to differentiable NAS formulation, each quantization scheme is also sampled by the GumbelSoftmax function with a sampling parameter , generating a possibility for to be quantized to . The organizes a threedimension array of size , denoted as . In this formulation, we have the flexibility to choose different quantizations for different layers of a DNN; such a mixed precision computation can be well supported by reconfigurable hardware and dedicated accelerators.
Under  quantization, the performance and resource of operation , and , should be functions of implementation variables in (including quantization ), expressed as and . Since one operation contains multiple layers as shown in Fig. 1, for simplicity, we treat them as a whole: the  quantization applies to all layers within , and the latency and resource are the summation of all layers. and largely vary with devices and will be further discussed in Section 4.
Given differentiable DNN search variables , differentiable quantization variables and formulations under each quantization scheme, the DNN and implementation search spaces are fused as shown in Fig. 2.
3.2.2 From Stage1 to Stage2
Given array , following GumbelSoftmax sampling rule, denoted as , the performance and resource can be computed as the following:
(2) 
(3) 
where both and are differentiable with respect to . This is to compute the performance and resource expectation under different quantizations, which follows GunbelSoftmax distribution with parameter .
3.2.3 From Stage2 to Stage3
Similar to Eq. 2 and Eq. 3, given array , the performance and resource of th DNN block, can be expressed as:
(4) 
(5) 
3.2.4 From Stage3 to Stage4 — Performance
Given the performance and resource of th DNN block, we can compute the overall DNN performance loss and resource usage, which need to be tailored towards specific search objectives and different devices.
First, if the overall objective is endtoend latency, total energy or model size, the performance loss can be expressed using the summation of all DNN blocks as:
(6) 
where scales to the same magnitude of in Eq. 1.
If the objective is throughput, the performance loss is the maximum latency of all blocks. Since getting the maximum value is a nondifferentiable operation, we use a smooth maximum, LogSumExp (LSE) function (polak2012optimization), for differentiable approximation as:
(7) 
If there are multiple objectives such as minimizing both latency and energy, as long as the objectives are not conflicting, we can simply let be the production of different objectives.
3.2.5 From Stage3 to Stage4 — Resource
The formulation of overall resource usage has two situations: without and with resource sharing. Without sharing, the total resource can be computed as the summation of all blocks:
(8) 
However, resource sharing is very common especially in IPbased FPGA or ASIC accelerators. Fig. 2 demonstrates a resource sharing scenario. In this example, we assume the operation of  block, and operation of  block, will share a same piece of computing resource. For example, in FPGA or ASIC, it is a reusable IP. To allow sharing, the quantization and other implementation variables of and shall be the same^{1}^{1}1Some accelerators allow different bitwidth operations to share resource. In these cases this constraint is not needed.: we have .
Second, we discuss the resource estimation with resource sharing. As shown in Fig. 3,  row is the  DNN block; the blue entries are the operations with the largest possibility to be selected. In this example, and are most likely to be chosen. Since they share the same computing resource , it shall be counted only once. If the  operation is not selected in any of the blocks, the resource shall not be counted.
To describe resource sharing, we propose the following differentiable approximation for the resource usage for operation , , which is shared across blocks as:
(9) 
In the above formula, means the unit resource expectation of in block . To avoid operation resource being redundantly counted across block, we use to suppress the maximum expectation of th operation to be 1 before multiplied by .
Thus, the overall DNN resource is computed as:
(10) 
4. Devicespecific formulation
In this section we discuss the devicespecific formulations, which describe the performance and resource of operation under bit quantization, and .
4.1. Fpga
For FPGA implementation, we follow an IPbased accelerator architecture: for each operation , there is a customizable IP instance to conduct its computation. Consider the following two FPGA implementation architectures, recursive and pipelined:
Either way, the operation performance will be the operation latency. We let the operation resource be the number of DSPs, which are usually the most critical resource on FPGA. To formulate and , we introduce additional implementation variables for FPGA, the parallel factors of the IPs, denoted as . Parallel factors describe the parallelism, indicating how many multiplications can be done concurrently. In FPGA design, since the parallelism usually increases exponentially such as 64, 128, 256, etc., we use the exponential form of to describe parallelism.
4.1.1 Latency —
As defined in Section 3.1, each operation
is composed of a set of sequential DNN layers such as convolution, batch normalization and activation, and the latency and resource of
should be the summation of all layers. For an operation with parallel factor of in bit, its latency can be approximated as:(11) 
(12) 
In the above equation, is the convolution kernel size; , , and represent the data dimension of operation ; is the calibration for latency under bitwidth of . Intuitively, smaller bitwidth leads to shorter latency because of less offchip data movement and less computation. For simplification, we let to simulate such a phenomenon.
4.1.2 Resource —
The resource (number of DSPs) of the IP with parallel factor of in bit can be approximated as:
(13) 
where is the calibration for resource under bitwidth of . On FPGA, the number of DSPs is nonlinear to bitwidth. For example, if the data precision is lower than 8bit, then two multiplications can be calculated on one Xilinx DSP48, reducing the DSP usage by half; if the data precision is lower than 4bit, we assume that multiplications are computed using Lookuptables (LUTs). Therefore, we use a piecewise function to describe : when ; when ; when .
4.2. Gpu
On GPUs, the most widely used performance metric is latency, so we let to be the latency assuming the batch size is 1. We assume the resource is fixed given a GPU. Since GPU latency is relatively easy to measure, we use normalized latency from directly measured values to represent inference latency under bit data precision. Therefore, is a constant under a specific . Currently, the GPU data precision is greatly restricted by the framework support. Since the current TensorRT only supports 8bit fixed and 16/32bit floating data, we limit data precisions to be 8/16/32bit for now but can be easily extended to support more bitwidths. Meanwhile, since the current mixed precision inference has not been well supported by GPU development framework, we constrain the overall DNN to use the same data precision. Therefore, , we have , which simplifies Eq. 2.
4.3. Dedicated Accelerators
Besides GPU and FPGA, there are also dedicated ASIC accelerators for efficient DNN implementation, such as Stripes (judd2016stripes), Loom (sharify2018loom) and BitFusion (sharma2018bit), which are DNN accelerators that support dynamic data precisions efficiently. As an example, in the Loom (sharify2018loom) work, the computation latency and energy of convolution layers scale inversely and almost proportionally with the precisions of weights and activations. Our proposed method can be directly applied to such accelerators as well, by formulating the latency and energy of an operation proportionally to data precision. We will leave this for future work.
5. overall algorithm
First of all, the DNN variables and implementation variables are initialized, including the number of blocks , the number of operation candidates , and sampling possibilities and . For recursive FPGA architecture, the algorithm initializes IP instances, each with an initial parallel factor . For pipelined FPGA architecture, the algorithm initializes parallel factors for all the operation candidates, each . For different devices, only the initializations are different, and the remaining cosearch follows the same procedure.
After initialization, the algorithm optimizes Eq. 1 using stochastic gradient descent on the fused search variables , following a bilevel approach (liu2018darts). In each iteration, it first fixes , and , and updates DNN weights by minimizing the training loss on training dataset. Then, it fixes the DNN weights and updates , and by descending Eq. 1
on the validation set. The algorithm iterates until DNN training converges or reaches a fixed number of epochs. Finally, the searched DNN needs to be trained from scratch on the target dataset, e.g. ImageNet, and the implementation variables, such as parallel factors, also need to be retuned.
Test Error (%)  GPU Latency  FPGA Latency  
Top1  Top5  Titan RTX  ZCU102 (CHaiDNN)  
Baseline Models  
GoogleNet  30.22  10.47  27.75 ms  13.25 ms 
MobileNetV2 (sandler2018mobilenetv2)  28.1  9.7  17.87 ms  10.85 ms 
ShuffleNetV2 (ma2018shufflenet)  30.6  11.7  21.91 ms  NA 
ResNet18  30.2  10.9  9.71 ms  10.15ms 
Hardwareaware NAS Models  
MNasNetA1 (tan2019mnasnet)  24.8  7.5  17.94 ms  8.78 ms 
FBNetC (wu2019fbnet)  24.9  7.6  22.54 ms  12.21 ms 
Proxylesscpu (cai2018proxylessnas)  24.7  7.6  21.34 ms  10.81 ms 
ProxylessMobile (cai2018proxylessnas)  25.4  7.8  21.23 ms  10.78 ms 
Proxylessgpu (cai2018proxylessnas)  24.9  7.5  15.72 ms  10.79 ms 
EDDNet1  25.3  7.7  11.17 ms  11.15 ms 
EDDNet2  25.4  7.9  13.00 ms  7.96 ms 
32bit Floating  16bit Floating  8bit Integer  
Test Error  25.5%  25.3%  26.4% 
Latency  2.83 ms  2.29 ms  1.74 ms 
Top1 Error (%)  Top5 Error (%)  Throughput (ZC706)  
VGG16  29.5  10.0  27.7 fps 
EDDNet3  25.6  7.7  40.2 fps 
6. experiments
We apply our EDD cosearch on a subset of ImageNet dataset randomly sampled from 100 classes. The searched DNNs are trained from scratch on the entire ImageNet with 1000 classes. We run for fixed 50 epochs during the EDD search. The initial DNN has 20 MBConv blocks (). Each MBConv has a filter size within and a channel expansion ratio within . So there are operations within a MBConv block with different filer sizes and the numbers of channels. During the search, for GPUs, the DNN weights are 8/16/32bit and activations are 32bit; for FPGAs, the DNN weights are 4/8/16bit and activations are 16bit fixed point.
We demonstrate our EDD methodology targeting three different hardware platforms, each with a searched DNN model, called EDDNet. We then compare our DNNs with the ones searched by the stateoftheart hardwareaware NAS approaches. The three DNNs are targeting: (1) lowlatency oriented GPU (EDDNet1); (2) recursive FPGA architecture (EDDNet2); (3) pipelined FPGA architecture (EDDNet3). The three DNNs are shown in Fig. 4; each is produced through EDD within a 12hour search on a P100 GPU.
First, for GPUtargeted EDDNet1, the algorithm suggests the 16bit precision for weights for the combined objective function including both accuracy and latency. We compare EDDNet1 with the stateoftheart hardwareaware NAS approaches as shown in Table 1, where the GPU inference latency is tested on Titan RTX GPU. It shows that, EDDNet1 reaches the similar accuracy as the stateoftheart DNN models, while achieving the shortest inference latency, 11.17 ms, which is 1.4 faster than ProxylessGPU (cai2018proxylessnas), the previous best result reported through the NAS approach. Compared to other mobileoriented NAS results as a reference, it is 2.0 faster than FBNetC (wu2019fbnet) and 1.6 faster than MNasNet (tan2019mnasnet). Table 2 shows the accuracy and latency results of EDDNet1 on Nvidia 1080 Ti GPU after retraining and finetuning using TensorRT under different data precisions.
Second, we intend to compare FPGAtargeted EDDNet2 and EDDNet3 with existing FPGA/DNN codesign works such as (jiang2019accuracy) and (hao2019fpga), but neither of them provided accuracy results on ImageNet. Therefore, to make a relatively fair comparison, for EDDNet2, which targets a recursive FPGA accelerator, we adopt the wellrecognized CHaiDNN framework (CHaiDNN), which is also a recursive FPGA accelerator. The FPGA latency is collected by running various DNN models with CHaiDNN accelerators under the same data precision on Xilinx ZCU102 FPGA as shown in Table 1. The ShuffleNet (ma2018shufflenet) is currently not supported by CHaiDNN. It shows that EDDNet2 on FPGA delivers the shortest latency among the FPGA implementations of all the DNNs, 7.96 ms. It is 1.37 faster than the FPGA implementation of ProxylessNet (cai2018proxylessnas), 1.53 faster than FBNet (wu2019fbnet) and 1.1 faster than MNasNet (tan2019mnasnet). This result shows that our methodology generalizes well to FPGA and can search for FPGAfriendly DNNs effectively.
Third, EDDNet3 is searched targeting a pipelined FPGA accelerator. In this case, we limit the total number of DNN blocks because more DNN blocks require more resource and complicated memory control logic. In Fig 4, it shows that EDDNet3 is shallower but with more channels and larger kernels. We compare the throughput of EDDNet3 with a stateoftheart pipelined FPGA accelerator, DNNBuilder (zhang2018dnnbuilder), on ZC706 FPGA with 900 DSPs. As shown in Table 3, under 16bit fixed point, EDDNet3 achieves 1.45 higher throughput with a much higher accuracy.
7. conclusion
In this work, we proposed EDD, a fully simultaneous, efficient differentiable DNN architecture and implementation cosearch methodology, which can target different hardware devices with different performance objectives. We formulated the cosearch problem as an elegant differentiable mathematical formulation by fusing DNN architecture search variables and hardware implementation variables into one solution space, considering both accuracy loss and hardware performance loss. In the experiments, we demonstrated three DNN models, targeting lowlatency GPU, recursive FPGA accelerator and pipelined FPGA accelerator, respectively. Our EDD models deliver similar accuracy as the best existing DNN models searched by NAS on ImageNet, with an improved performance by 1.4 on GPU and 1.45 on FPGA. The future works include GPU power and resource formulation, and EDD search for dedicated accelerators.
Acknowledgement
This work is supported in part by the IBMIllinois Center for Cognitive Computing System Research (C3SR), Semiconductor Research Corporation (SRC) and Campus for Research Excellence and Technological Enterprise (CREATE) programme in Singapore.
Comments
There are no comments yet.