AI algorithms have gained ever-increasing research interests, and remarkable achievements have been demonstrated especially for deep neural networks (DNNs). To improve algorithm quality, usually accuracy, a recent approach called neural architecture search (NAS) (zoph2017learning; huang2017densely; real2018regularized) has shown great success in automatically developing DNNs that outperform human-crafted designs. Meanwhile, the optimization techniques of implementing AI algorithms on hardware are also being intensively studied. The goal of implementation is to improve hardware performance, such as latency, throughput and energy efficiency, etc. Typical implementation techniques include kernel and DNN optimizations on GPUs and customized accelerator designs on FPGAs and ASICs (qiu2016going; zhang2018dnnbuilder; x2017High; chen2019cloud; hao2019fpga; xu2020autodnnchip). On top of these achievements, to further improve AI solution quality, AI algorithm designers and hardware developers begin to explore joint optimization opportunities. For example, hardware-aware NAS has been drawing a lot of attention by considering hardware features for DNN designs (cai2018proxylessnas; tan2019mnasnet; stamoulis2019single; wu2019fbnet; gong2019mixed; zhang2020skynet). Meanwhile, hardware/software co-design approaches (jiang2019accuracy; hao2019fpga) focus on FPGA implementation characteristics and study their influences on DNN software design.
Despite these achievements of hardware-aware NAS and hardware/software co-design, there is still a large optimization opportunity missing: the hardware implementation should be simultaneously searched during NAS. Here, for general-purpose computing devices, such as CPUs and GPUs, implementation search means optimizing DNN implementations such as kernel fusion and memory access optimization. For reconfigurable devices, such as FPGAs, implementation search means optimizing a customized DNN accelerator through techniques such as quantization, loop tiling and parallelization. Hardware implementation search not only provides more accurate performance evaluations on latency and throughput, but more importantly, provides instant guidance to hardware-aware DNN design during NAS. Currently, all existing works are missing the large design space of implementation search
in their NAS flows, using estimated hardware performance from a fixed implementation(cai2018proxylessnas; tan2019mnasnet; stamoulis2019single; wu2019fbnet; gong2019mixed). We provided an initial discussion of the potential of simultaneous neural architecture and implementation co-search in (hao2019nais), called NAIS. Inspired by (hao2019nais), in this work, we propose a simultaneous and efficient DNN and implementation co-search methodology. We summarize our contributions as follows:
This is the first work that proposes a mathematical formulation to solve the simultaneous DNN architecture and hardware implementation co-search problem. The formulation fuses the search variables for DNN architecture and its hardware implementation into one search space, to simultaneously maximize the DNN accuracy and implementation performance.
The proposed formulation is differentiable with respect to the fused search space, which allows a gradient descent algorithm to be applied and enables an efficient differentiable DNN and implementation co-search methodology (EDD).
The formulation is unified and comprehensive. It can be applied on various hardware platforms such as GPUs, FPGAs and dedicated accelerators; it can target various performance objectives such as latency, throughput or energy; it also formulates resource usage considering resource sharing.
We demonstrate our EDD methodology targeting three different architectures: GPU, recursive FPGA accelerator, and pipelined FPGA accelerator. Each model produced by EDD achieves similar accuracy as the best existing DNNs on ImageNet but with superior performance: our GPU-targeted DNN is faster than the state-of-the-art Proxyless solution (cai2018proxylessnas), and our FPGA-targeted DNN delivers higher throughput than the state-of-the-art DNNBuilder solution (zhang2018dnnbuilder).
2. Related Works
Neural architecture search (NAS) is a technique for automating the design of DNNs which can outperform hand-crafted ones (zoph2017learning; huang2017densely; real2018regularized). Among all the NAS approaches, differentiable NAS is becoming prevailing because of its high search efficiency in terms of GPU hours (liu2018darts)
. One typical way is to construct a DNN supernet composed of multiple branch candidates associated with architecture weights. The architecture weights are differentiable with respect to the loss function of the supernet, and are updated through stochastic gradient descent. The final searched DNNs will keep the branches with larger architecture weights and eliminate others. For example,(cai2018proxylessnas)
uses binarized parameters, and(wu2019fbnet) uses Gumbel-Softmax to choose between different DNN branches. Previously published literature also introduces hardware-aware NAS (cai2018proxylessnas; tan2019mnasnet; stamoulis2019single; wu2019fbnet; gong2019mixed). Some works incorporate hardware latency into the objective of NAS (cai2018proxylessnas; stamoulis2019single; wu2019fbnet), while some treat latency as a hard constraint (tan2019mnasnet). In (zhang2020skynet), a bottom-up DNN design approach is proposed for hardware-efficient models.
Meanwhile, various embedded FPGA-based accelerators have been studied to support more efficient DNN inference (qiu2016going; zhang2018dnnbuilder). Other approaches proposed in (jiang2019accuracy; hao2019fpga) involve hardware/software co-design. Specifically, researchers in (jiang2019accuracy)
proposed a reinforcement learning based architecture search with FPGA implementation performance integrated into the reward function. The work in(hao2019fpga) proposed a bundle-based co-design methodology, where a bundle is the basic building block for both FPGA accelerator and DNN model. However, none of the previous works is able to explore the DNN architecture and implementation co-search space simultaneously and comprehensively.
3. EDD problem formulation
Our simultaneous DNN and implementation co-search method fuses the design space of DNN architecture search and hardware implementation search, as shown in Fig. 1. We collectively denote the variables used in DNN search and implementation search as and , respectively, and the fused space of co-search is . The objective of DNN search is to quickly find a DNN architecture while minimizing accuracy loss, denoted as . For implementation search, we define performance loss, denoted as , which can be specified by users, such as end-to-end inference latency, throughput, energy, DNN model complexity, etc. We denote the resource utilization as , and the resource upper-bound of the target hardware as . The DNN and implementation co-search problem is to minimize accuracy loss and performance loss simultaneously by effectively searching :
In Eq. 1, is a function of and ; and are functions of . Resource upper-bound is expressed in an exponent term to introduce large penalty when being violated. Worth noting, in the existing hardware-aware NAS approaches, only is searched while is fixed during NAS; while in our co-search formulation, is also variable.
As introduced in Sec. 2, motivated by the high search efficiency and appealing model accuracy of differentiable NAS, in this work, we propose a differentiable formulation for both : in Eq. 1, is differentiable with respect to and , and and are differentiable with respect to . By descending with respect to the variables in on validation set as , will be searched simultaneously.
Fig. 1 shows our proposed overall differentiable design space. The blue blocks represent the DNN search space, while the red blocks represent the hardware implementation search space. We first introduce DNN search space in Sec. 3.1, and then introduce the merged design space with implementation in Sec. 3.2.
3.1. NAS Design Space
The differentiable NAS space is shown as the blue blocks in Fig. 1. First, the DNN is composed of basic building blocks, , where . In this work, in order to design hardware-friendly DNNs and to reduce search time, we adopt the single-path DNN structure without branches (stamoulis2019single). Inside the - block, there are candidate operations, denoted as (). We let the operations to be the most commonly used DNN blocks in NAS approaches, called MBConv (tan2019mnasnet). It is composed of sequential layers of -, - (depth-wise convolution with kernel size ) and -. Between - and -, the number of channels expands/shrinks by a ratio of . The output of a block is calculated based on the outputs of its candidate operations. For example in (liu2018darts), the output is the weighted sum of the operations, where the weights are determined by a Softmax function. Instead of Softmax, in this work, we use the Gumbel-Softmax function in (wu2019fbnet) in order to sample only one operation out of during feedforward propagation, since Gumbel-Softmax function can convert the discrete non-differentiable sampling to continuous differentiable sampling. This greatly reduces the memory requirement and speeds up the feedforward propagation. The sampling parameters organize a two-dimension array, denoted as , which is the primary DNN search variable.
3.2. Implementation Formulation
As shown in the red block in Fig. 1, each candidate operation has its own implementation variables, forming an implementation search space . The primary implementation variable is quantization , i.e., data precision, since it has a large impact on DNN accuracy, implementation performance and hardware resource. Rather than a train-and-quantize manner, the quantization shall be searched together with DNN structure to provide implementation performance feedback. Besides quantization, other implementation variables may be device oriented. For example, FPGA implementation design space includes parallelism, loop tiling factors, etc.
To formulate the final and of a DNN in Eq. 1, we need to capture the intermediate performance and resource of each operation and DNN block. As shown in the bottom four blocks in Fig. 1, there are four stages to derive Eq. 1:
Stage-1: we first formulate the quantization in a differentiable way; then, for each operation candidate under -bit, we formulate the performance as and resource as .
Stage-2: the performance and resource of regardless of quantization, and , can be derived from and in Stage-1.
Stage-3: the performance and resource of - DNN block, and , are derived from and in Stage-2.
Stage-4: the overall DNN performance loss and resource usage, and , are derived from and in Stage-3.
Finally, and will be plugged into Eq. 1 as the objective function during our EDD co-search.
In the following, we introduce the differentiable quantization and performance and resource formulations stage by stage.
3.2.1 Stage-1: Differentiable Quantization
As shown in the red block in Fig. 1, to enable differentiable quantization formulation, we create quantization paths for each operation , indicating each operation has quantization choices, from - to -. Similar to differentiable NAS formulation, each quantization scheme is also sampled by the Gumbel-Softmax function with a sampling parameter , generating a possibility for to be quantized to -. The organizes a three-dimension array of size , denoted as . In this formulation, we have the flexibility to choose different quantizations for different layers of a DNN; such a mixed precision computation can be well supported by reconfigurable hardware and dedicated accelerators.
Under - quantization, the performance and resource of operation , and , should be functions of implementation variables in (including quantization ), expressed as and . Since one operation contains multiple layers as shown in Fig. 1, for simplicity, we treat them as a whole: the - quantization applies to all layers within , and the latency and resource are the summation of all layers. and largely vary with devices and will be further discussed in Section 4.
Given differentiable DNN search variables , differentiable quantization variables and formulations under each quantization scheme, the DNN and implementation search spaces are fused as shown in Fig. 2.
3.2.2 From Stage-1 to Stage-2
Given array , following Gumbel-Softmax sampling rule, denoted as , the performance and resource can be computed as the following:
where both and are differentiable with respect to . This is to compute the performance and resource expectation under different quantizations, which follows Gunbel-Softmax distribution with parameter .
3.2.3 From Stage-2 to Stage-3
3.2.4 From Stage-3 to Stage-4 — Performance
Given the performance and resource of -th DNN block, we can compute the overall DNN performance loss and resource usage, which need to be tailored towards specific search objectives and different devices.
First, if the overall objective is end-to-end latency, total energy or model size, the performance loss can be expressed using the summation of all DNN blocks as:
where scales to the same magnitude of in Eq. 1.
If the objective is throughput, the performance loss is the maximum latency of all blocks. Since getting the maximum value is a non-differentiable operation, we use a smooth maximum, Log-Sum-Exp (LSE) function (polak2012optimization), for differentiable approximation as:
If there are multiple objectives such as minimizing both latency and energy, as long as the objectives are not conflicting, we can simply let be the production of different objectives.
3.2.5 From Stage-3 to Stage-4 — Resource
The formulation of overall resource usage has two situations: without and with resource sharing. Without sharing, the total resource can be computed as the summation of all blocks:
However, resource sharing is very common especially in IP-based FPGA or ASIC accelerators. Fig. 2 demonstrates a resource sharing scenario. In this example, we assume the operation of - block, and operation of - block, will share a same piece of computing resource. For example, in FPGA or ASIC, it is a reusable IP. To allow sharing, the quantization and other implementation variables of and shall be the same111Some accelerators allow different bit-width operations to share resource. In these cases this constraint is not needed.: we have .
Second, we discuss the resource estimation with resource sharing. As shown in Fig. 3, - row is the - DNN block; the blue entries are the operations with the largest possibility to be selected. In this example, and are most likely to be chosen. Since they share the same computing resource , it shall be counted only once. If the - operation is not selected in any of the blocks, the resource shall not be counted.
To describe resource sharing, we propose the following differentiable approximation for the resource usage for operation , , which is shared across blocks as:
In the above formula, means the unit resource expectation of in block . To avoid operation resource being redundantly counted across block, we use to suppress the maximum expectation of -th operation to be 1 before multiplied by .
Thus, the overall DNN resource is computed as:
4. Device-specific formulation
In this section we discuss the device-specific formulations, which describe the performance and resource of operation under -bit quantization, and .
For FPGA implementation, we follow an IP-based accelerator architecture: for each operation , there is a customizable IP instance to conduct its computation. Consider the following two FPGA implementation architectures, recursive and pipelined:
Either way, the operation performance will be the operation latency. We let the operation resource be the number of DSPs, which are usually the most critical resource on FPGA. To formulate and , we introduce additional implementation variables for FPGA, the parallel factors of the IPs, denoted as . Parallel factors describe the parallelism, indicating how many multiplications can be done concurrently. In FPGA design, since the parallelism usually increases exponentially such as 64, 128, 256, etc., we use the exponential form of to describe parallelism.
4.1.1 Latency —
As defined in Section 3.1, each operation
is composed of a set of sequential DNN layers such as convolution, batch normalization and activation, and the latency and resource ofshould be the summation of all layers. For an operation with parallel factor of in -bit, its latency can be approximated as:
In the above equation, is the convolution kernel size; , , and represent the data dimension of operation ; is the calibration for latency under bit-width of . Intuitively, smaller bit-width leads to shorter latency because of less off-chip data movement and less computation. For simplification, we let to simulate such a phenomenon.
4.1.2 Resource —
The resource (number of DSPs) of the IP with parallel factor of in -bit can be approximated as:
where is the calibration for resource under bit-width of . On FPGA, the number of DSPs is non-linear to bit-width. For example, if the data precision is lower than 8-bit, then two multiplications can be calculated on one Xilinx DSP48, reducing the DSP usage by half; if the data precision is lower than 4-bit, we assume that multiplications are computed using Lookup-tables (LUTs). Therefore, we use a piece-wise function to describe : when ; when ; when .
On GPUs, the most widely used performance metric is latency, so we let to be the latency assuming the batch size is 1. We assume the resource is fixed given a GPU. Since GPU latency is relatively easy to measure, we use normalized latency from directly measured values to represent inference latency under -bit data precision. Therefore, is a constant under a specific . Currently, the GPU data precision is greatly restricted by the framework support. Since the current TensorRT only supports 8-bit fixed and 16-/32-bit floating data, we limit data precisions to be 8/16/32-bit for now but can be easily extended to support more bitwidths. Meanwhile, since the current mixed precision inference has not been well supported by GPU development framework, we constrain the overall DNN to use the same data precision. Therefore, , we have , which simplifies Eq. 2.
4.3. Dedicated Accelerators
Besides GPU and FPGA, there are also dedicated ASIC accelerators for efficient DNN implementation, such as Stripes (judd2016stripes), Loom (sharify2018loom) and Bit-Fusion (sharma2018bit), which are DNN accelerators that support dynamic data precisions efficiently. As an example, in the Loom (sharify2018loom) work, the computation latency and energy of convolution layers scale inversely and almost proportionally with the precisions of weights and activations. Our proposed method can be directly applied to such accelerators as well, by formulating the latency and energy of an operation proportionally to data precision. We will leave this for future work.
5. overall algorithm
First of all, the DNN variables and implementation variables are initialized, including the number of blocks , the number of operation candidates , and sampling possibilities and . For recursive FPGA architecture, the algorithm initializes IP instances, each with an initial parallel factor . For pipelined FPGA architecture, the algorithm initializes parallel factors for all the operation candidates, each . For different devices, only the initializations are different, and the remaining co-search follows the same procedure.
After initialization, the algorithm optimizes Eq. 1 using stochastic gradient descent on the fused search variables , following a bilevel approach (liu2018darts). In each iteration, it first fixes , and , and updates DNN weights by minimizing the training loss on training dataset. Then, it fixes the DNN weights and updates , and by descending Eq. 1
on the validation set. The algorithm iterates until DNN training converges or reaches a fixed number of epochs. Finally, the searched DNN needs to be trained from scratch on the target dataset, e.g. ImageNet, and the implementation variables, such as parallel factors, also need to be re-tuned.
|Test Error (%)||GPU Latency||FPGA Latency|
|Top-1||Top-5||Titan RTX||ZCU102 (CHaiDNN)|
|GoogleNet||30.22||10.47||27.75 ms||13.25 ms|
|MobileNet-V2 (sandler2018mobilenetv2)||28.1||9.7||17.87 ms||10.85 ms|
|ShuffleNet-V2 (ma2018shufflenet)||30.6||11.7||21.91 ms||NA|
|Hardware-aware NAS Models|
|MNasNet-A1 (tan2019mnasnet)||24.8||7.5||17.94 ms||8.78 ms|
|FBNet-C (wu2019fbnet)||24.9||7.6||22.54 ms||12.21 ms|
|Proxyless-cpu (cai2018proxylessnas)||24.7||7.6||21.34 ms||10.81 ms|
|Proxyless-Mobile (cai2018proxylessnas)||25.4||7.8||21.23 ms||10.78 ms|
|Proxyless-gpu (cai2018proxylessnas)||24.9||7.5||15.72 ms||10.79 ms|
|EDD-Net-1||25.3||7.7||11.17 ms||11.15 ms|
|EDD-Net-2||25.4||7.9||13.00 ms||7.96 ms|
|32-bit Floating||16-bit Floating||8-bit Integer|
|Latency||2.83 ms||2.29 ms||1.74 ms|
|Top-1 Error (%)||Top-5 Error (%)||Throughput (ZC706)|
We apply our EDD co-search on a subset of ImageNet dataset randomly sampled from 100 classes. The searched DNNs are trained from scratch on the entire ImageNet with 1000 classes. We run for fixed 50 epochs during the EDD search. The initial DNN has 20 MBConv blocks (). Each MBConv has a filter size within and a channel expansion ratio within . So there are operations within a MBConv block with different filer sizes and the numbers of channels. During the search, for GPUs, the DNN weights are 8-/16-/32-bit and activations are 32-bit; for FPGAs, the DNN weights are 4-/8-/16-bit and activations are 16-bit fixed point.
We demonstrate our EDD methodology targeting three different hardware platforms, each with a searched DNN model, called EDD-Net. We then compare our DNNs with the ones searched by the state-of-the-art hardware-aware NAS approaches. The three DNNs are targeting: (1) low-latency oriented GPU (EDD-Net-1); (2) recursive FPGA architecture (EDD-Net-2); (3) pipelined FPGA architecture (EDD-Net-3). The three DNNs are shown in Fig. 4; each is produced through EDD within a 12-hour search on a P100 GPU.
First, for GPU-targeted EDD-Net-1, the algorithm suggests the 16-bit precision for weights for the combined objective function including both accuracy and latency. We compare EDD-Net-1 with the state-of-the-art hardware-aware NAS approaches as shown in Table 1, where the GPU inference latency is tested on Titan RTX GPU. It shows that, EDD-Net-1 reaches the similar accuracy as the state-of-the-art DNN models, while achieving the shortest inference latency, 11.17 ms, which is 1.4 faster than Proxyless-GPU (cai2018proxylessnas), the previous best result reported through the NAS approach. Compared to other mobile-oriented NAS results as a reference, it is 2.0 faster than FBNet-C (wu2019fbnet) and 1.6 faster than MNasNet (tan2019mnasnet). Table 2 shows the accuracy and latency results of EDD-Net-1 on Nvidia 1080 Ti GPU after re-training and fine-tuning using TensorRT under different data precisions.
Second, we intend to compare FPGA-targeted EDD-Net-2 and EDD-Net-3 with existing FPGA/DNN co-design works such as (jiang2019accuracy) and (hao2019fpga), but neither of them provided accuracy results on ImageNet. Therefore, to make a relatively fair comparison, for EDD-Net-2, which targets a recursive FPGA accelerator, we adopt the well-recognized CHaiDNN framework (CHaiDNN), which is also a recursive FPGA accelerator. The FPGA latency is collected by running various DNN models with CHaiDNN accelerators under the same data precision on Xilinx ZCU102 FPGA as shown in Table 1. The ShuffleNet (ma2018shufflenet) is currently not supported by CHaiDNN. It shows that EDD-Net-2 on FPGA delivers the shortest latency among the FPGA implementations of all the DNNs, 7.96 ms. It is 1.37 faster than the FPGA implementation of ProxylessNet (cai2018proxylessnas), 1.53 faster than FBNet (wu2019fbnet) and 1.1 faster than MNasNet (tan2019mnasnet). This result shows that our methodology generalizes well to FPGA and can search for FPGA-friendly DNNs effectively.
Third, EDD-Net-3 is searched targeting a pipelined FPGA accelerator. In this case, we limit the total number of DNN blocks because more DNN blocks require more resource and complicated memory control logic. In Fig 4, it shows that EDD-Net-3 is shallower but with more channels and larger kernels. We compare the throughput of EDD-Net-3 with a state-of-the-art pipelined FPGA accelerator, DNNBuilder (zhang2018dnnbuilder), on ZC706 FPGA with 900 DSPs. As shown in Table 3, under 16-bit fixed point, EDD-Net-3 achieves 1.45 higher throughput with a much higher accuracy.
In this work, we proposed EDD, a fully simultaneous, efficient differentiable DNN architecture and implementation co-search methodology, which can target different hardware devices with different performance objectives. We formulated the co-search problem as an elegant differentiable mathematical formulation by fusing DNN architecture search variables and hardware implementation variables into one solution space, considering both accuracy loss and hardware performance loss. In the experiments, we demonstrated three DNN models, targeting low-latency GPU, recursive FPGA accelerator and pipelined FPGA accelerator, respectively. Our EDD models deliver similar accuracy as the best existing DNN models searched by NAS on ImageNet, with an improved performance by 1.4 on GPU and 1.45 on FPGA. The future works include GPU power and resource formulation, and EDD search for dedicated accelerators.
This work is supported in part by the IBM-Illinois Center for Cognitive Computing System Research (C3SR), Semiconductor Research Corporation (SRC) and Campus for Research Excellence and Technological Enterprise (CREATE) programme in Singapore.