The world has seen tremendous improvements in AI algorithms as well as their high performance implementations in recent years. Remarkable achievements have been demonstrated for AI algorithms in many areas with expeditious improvements in algorithm quality and robustness. Deep neural network (DNN) is one of the most popular AI algorithms with impressive advancements, from AlexNet (Krizhevsky and others, 2012) to modern models (Szegedy and others, 2015; Simonyan and Zisserman, 2014; He and others, 2016). Meanwhile, the optimization techniques for high performance implementations of AI algorithms on hardware are also being intensively studied. Such implementation techniques include kernel and DNN optimizations on GPUs and TPUs (18; 40; 12; 16), accelerator designs on customizable hardware such as FPGAs (Zhang and others, 2015; Sharma and others, 2016; Zhang and others, 2018b, a; Chen and others, ) and AI chips (Chen and others, 2016; Yin and others, 2017).
Despite many of these accomplishments, there are still many challenges, one of which is the gap between high quality DNN models during design and their implementation performance during deployment. One reason for such a gap is isolated design of DNNs and optimization of their implementations, where the former does not integrate sufficient hardware knowledge, and the later does not have enough freedom to accommodate pre-designed DNNs at such a late stage. Instead, DNNs and their hardware implementations need to be designed simultaneously, i.e., DNN/implementation co-design, as illustrated in Fig. 1. We call it Neural Architecture and Implementation Search (NAIS). The outputs of NAIS include both DNNs that are of high quality of result (QoR), and implementations that are of high quality of service (QoS). The NAIS methodology brings immense optimization opportunities for:
Proposing specific hardware-oriented DNN models. For DNN deployment, there are many hardware candidates such as GPUs, cloud and edge TPUs, cloud and embedded FPGAs, each of which has largely different characteristics such as computation capability, memory capacity and bandwidth. The NAIS method will explore DNNs based on specific hardware features and search for DNNs with the best match.
Meeting resource and performance constraints. The NAIS method will search for DNNs within available hardware resources and performance constraints, which provides predictable and guaranteed performance for DNN deployment.
Shortening design cycles. While existing top-down design methods require back-and-forth efforts to find satisfying solutions, an automated NAIS flow can simultaneously find an optimized DNN model and its deployment on hardware.
In modern industry applications, as AI algorithms are increasingly adopted, high performance computing platforms are in great need, especially with reconfigurable devices for acceleration. Take autonomous driving as an example, which is one of the most demanding areas for high QoR AI algorithms and high performance computing implementation. Fig. 2 shows three types of computing platforms for autonomous driving: commodity platform composed of commercial CPUs, GPUs or DSPs, semi-customized platform composed of GPUs and FPGAs, and fully-customized platform composed of dedicated ASICs. As shown in the figure, though fully-customized platforms are most favorable in terms of high performance and low price-performance ratio (e.g. $/Gops), they suffer from high non-recurring engineering (NRE) cost, long design cycle and high risk in making mistakes, which hinders their wide adoption. In contrast, with reconfigurable devices such as FPGAs, semi-customized platforms become a competitive alternative with a good trade-off in performance and cost. Moreover, once the AI algorithms and their hardware implementations have been fully validated on FPGAs, the design can be made into ASICs to take advantage of what a fully-customized platform can offer. Thus, finding high quality AI algorithms with their optimized implementations on reconfigurable devices not only provides good solutions for semi-customized platforms, but also provides a good path to move from semi to fully customized platforms. Because of this, there is a pressing need for NAIS, an automatic co-design of AI algorithms and their optimized implementations, on GPUs, FPGAs and ASICs, given the widely varying device characteristics and the large design space of both algorithmic and implementation optimization.
Motivated by those opportunities, in this work, we propose NAIS as a simultaneous DNN/implementation co-design approach to effectively search for high quality DNN models and high performance implementations for different hardware platforms. We demonstrate how such a NAIS approach can be utilized to solve real-world applications, including autonomous driving.
2. NAIS Design Methdology
A NAIS methodology has two tasks: to search for DNNs of high QoR (e.g. accuracy), and for implementations of high QoS (e.g. latency, throughput). Such an implementation can be an optimized software stack on a given accelerator device such as GPUs, or a customized hardware accelerator on FPGAs, CGRAs, and ASICs.
Neural Architecture Search (NAS). For DNN search, most existing NAS engines can find high quality DNNs. As illustrated in Fig. 2(a), given a model search space , a NAS engine applies a certain search strategy
is estimated and provided back to the NAS engine. After NAS generates a satisfying DNN, it will be implemented and deployed on GPU, FPGA or other devices. During the search, however, implementation optimization is not considered. For example in a recent hardware-aware NAS approach(Cai and others, 2018), it considers directly measured inference latency on the GPU but does not explore optimization techniques. This will result in a large performance gap between estimation and final implementation, especially when there are multiple candidate devices, each requiring different optimization techniques. When targeting FPGAs, it becomes more important that DNN search and implementation search being tightly coupled during NAS: different accelerator implementation configurations can result in large performance variation.
Neural Architecture and Implementation Search (NAIS) — Beyond NAS. To fully explore implementation optimizations and to consider the impacts of implementation on DNNs, we propose a fully simultaneous DNN/implementation co-design approach: it not only searches for neural architectures, but also searches for implementation optimizations, i.e., a Neural Architecture and Implementation Search, NAIS. As illustrated in Fig. 2(b), the NAIS search space includes both model search space and implementation search space . We combine the two spaces as a co-design space , and apply a joint search strategy on the co-design space. During NAIS, each solution is composed of two parts: a DNN model solution , and a corresponding implementation solution , where specific optimization techniques have been applied to . After searching, the NAIS engine outputs both the DNN model and its optimized hardware implementation. The design space of NAIS is the product of the design space of DNN search and the design space of implementation optimization, which can be huge. Such combined design space makes the co-design procedure time-consuming and hard to converge. Innovative research is needed to address this new challenge.
In this position paper, we first prototype a NAIS methodology in the context of DNN/FPGA co-design, and show how we effectively narrow the co-design space to generate high quality DNNs and their FPGA implementations within the resource constraints of a target FPGA. We then discuss how such a NAIS design methodology can be extended for GPU in a similar fashion.
3. nais for fpga
3.1. DNN/Implementation Co-design Space
The FPGA accelerator optimization problem is very complicated and requires comprehensive domain-specific knowledge. For example, the overall accelerator architecture (pipelined or folded), the number of IPs and parallelism of each IP, data quantization, buffer allocation, data reuse, etc., and each has a significant impact on the final performance. Besides, the FPGA underlying characteristics (DSP structure, block RAM, bandwidth, etc.) and available resources can be very different between FPGA devices or families.
To efficiently narrow down the combined design space of NAIS for a target FPGA, we propose to co-design both DNN structure and its FPGA accelerator implementation using hardware-aware basic building blocks, named Bundles (Hao and others, 2019). A Bundle represents a set of sequential DNN layers, and a DNN can be constructed by replicating a Bundle for times with configurations (the ’A’ in NAIS). Meanwhile, a Bundle is composed of a set of FPGA configurable IPs, where each IP is well designed and highly optimized, and the Bundle is used to construct the FPGA implementation (the ’I’ in NAIS). For DNN, each Bundle replication can be configured to have different number of channels of its layers; for FPGA, a Bundle can be configured to have a certain number of IP instances, and each IP instance with specific parallel factors, data precision, on-chip buffers, etc. When a Bundle is selected and configured, both the DNN model and its accelerator can be determined. That is, Bundles provide a stylized approach to design both the DNNs and FPGA implementations, thus narrowing the search space efficiently.
3.2. Overall Co-Design Flow
. The inputs include a machine learning task such as image classification or object detection, resource constraints of a specific FPGA device, and performance target such as frame rate. The outputs include both DNN models and corresponding FPGA accelerator with achieved performance. Inside the co-design flow, there are three major steps.
Step 1: FPGA-oriented Bundle generation. First, we design a pool of FPGA-oriented IPs considering specific FPGA characteristics such as DSP and BRAM structures. The IPs may have same functionality but different designs. For example, to best utilize the DSP resource, a Xilinx FPGA may best support 8-bit 10-bit multiplication IPs, while an Intel FPGA may best support 9-bit 9-bit multiplication IPs. Based on the IPs, we build FPGA-oriented Bundles, where the data tiling, pipelining and data movement between these IPs are considered.
Step 2: Bundle selection. Second, we apply Bundle evaluation to reduce the co-design space by only selecting the most promising Bundles for future exploration. Each Bundle will be evaluated regarding its resource utilization and potential contribution to DNN accuracy. We build a Bundle-wise DNN template with fixed front-end and back-end structures, and insert one Bundle (with replications) in the middle each time (Hao and others, 2019; Zhang and others, 2019)
. Such Bundle-wise DNNs will be quickly trained using a small number of epochs to evaluate the accuracy. The Bundles on the resource-accuracy Pareto curve will be selected.
Step 3: Hardware-aware DNN search and update. Third, we perform hardware-aware DNN search. The inputs include the initial DNNs, performance objectives such as latency, and resource constraints. We use stochastic coordinate descent (SCD) to update three variables related to DNN structure: the number of Bundle replications; down-sampling configuration between Bundles; and the number of channels in each Bundle. During the iterations of SCD, only DNNs within the resource constraints and performance requirements are kept for downstream training. In such a way, the final generated DNNs are more structured, resulting in more efficient hardware implementations.
3.3. FPGA-oriented IP design
Since the FPGA’s characteristics vary with device vendors and types, a well designed IP must fully consider such characteristics to achieve the maximum performance while minimizing resource utilization. We discuss two most important factors as an illustration: the structure of DSPs and embedded block memory.
3.3.1. DSP consideration
Table 2 shows different multiplication and accumulation precision of DSPs in different FPGA devices, where the variation can be large even within the same vendor. The computational IPs should be carefully designed based on the underlying DSP structure to take full advantage of its computation capability, which, in turn, affects the DNN design.
One important factor that must be considered is DNN’s data precision. Take the Xilinx DSP48E1 and DSP48E2 as examples. Assume a simple case of two multiplications, and with a common multiplier , and , , have , , bits, respectively. To increase multiplication parallelism, one possibility is to let two multiplicands (in this case and ) occupy one DSP input , and let the common multiplier (in this case ) occupy the other input , so that the two multiplications can be conducted at one clock cycle. To ensure correctness, there must be at least empty bits between and , so that the two products do not overlap with each other in the output. When using DSP48E1, which supports 18-bit 25-bit multiplications and 48-bit accumulation, if and are both 8-bit and occupy the 25-bit operand, then must not exceed 9-bit (); when using DSP48E2, can be 10-bit (). In a scenario where and are activations and is the weight, if the target FPGA has DSP48E1, the DNN weights should be quantized to 9-bit or less, while with DSP48E2, the weights can be 10-bit. Similarly, if the target device is Intel FPGA, the preferable quantization changes accordingly. For example, on Stratix V, ¡9-bit, 9-bit¿ is more preferred than ¡10-bit, 9-bit¿ for weights and activation, because Stratix V DSPs support -bit multiplications.
Moreover, the DSP structure also affects the computation pattern and parallelism, which determines the detailed IP design. Fig. 5 shows an example of a convolution IP targeting Xilinx and Intel Arria V devices, respectively. On Xilinx DSPs, to conduct two multiplications in parallel by sharing a common multiplier, one kernel will once consume two pieces of feature map data to fully utilize one DSP. On Intel Arria V series, where one DSP is capable of running three independent multiplications, three kernels will consume three pieces of different feature map data at a time. Such differences between DSPs will result in disparate IP designs and performance, and need to be considered in the NAIS engine.
3.3.2. BRAM consideration
|Xilinx (42)||RAMB18E1||1, 2, 4, 9, 18|
|RAMB36E1||1, 2, 4, 9, 18, 36|
|Intel (25)||MLAB||8, 9, 10, 16, 18, 20|
|M9K||1, 2, 4, 8, 9, 16, 18, 32, 36|
|M20K||8, 10, 16, 20, 32, 40|
|M144K||8, 9, 16, 18, 32, 36, 64, 72|
|*Only applicable for single-port RAM, simple-dual port RAM, and single-port ROM|
On-chip block memory (BRAM) is another important design factor to consider. Effectively utilizing on-chip memory for data buffering can greatly reduce the amount of off-chip data movement, thus reducing both latency and energy consumption. Table 1 shows the supported data width of different FPGA devices. For Xilinx, the commonly used bit widths are 9, 18 and 36 in its RAMB18E1 and RAMB36E1. For Intel, the common block memory is M20K, which has a capacity of 20Kb organized into either 10- or 20-bit storage words and read/write operations. The on-chip data buffers need to be carefully allocated to align with the block memory depth and width. For example, if a continuous buffer is allocated to be 21Kb, it will occupy two blocks of Intel M20k, resulting in a large waste of the second block.
The differences in block memory structure can affect the desirable DNN designs as well. Take the buffer allocation for feature maps using Xilinx RAMB18E1 as an example. If the input feature map dimension of a layer is (one channel) using 8-bit data, the number of occupied RAM blocks is 4, and a slightly larger feature map will consume an additional block. Usually, the intermediate feature map dimensions are closely related to the original input size and up/down sampling. Therefore, resizing the input image to may be better than as far as on-chip buffer allocation is concerned.
The discussions in this section show that the structures of DSPs and BRAMs play an important role in guiding the DNN design in a NAIS framework.
|Xilinx 7 series (DSP48E1) (43)||One 25 18||48-bit|
|Xilinx UltraScale (DSP48E2) (44)||One 27 18||48-bit|
|Intel Stratix V (27)||Three 9 9||64-bit|
|Two 18 18|
|One 18 36|
|One 27 27|
|Intel Arria V (24)||Three 9 9||64-bit|
|Intel Stratix 10 (26)||Two 18 19||64-bit|
|Intel Arria 10 (23)||One 2727|
4. nais for gpu
There are recent works discussing hardware-aware NAS targeting GPUs (Cheng and others, 2018; Marculescu and others, 2018). However, during NAS, GPU kernel configuration and optimization were ignored, which is a non-trivial problem that has attracted a lot of research interest (Zhou and others, 2017; Guerreiro and others, 2015; Tsai and others, 2016). Table 3 summarizes a set of GPU architecture-specific and kernel-specific parameters, which can affect the kernel configuration and performance on a specific GPU (Guerreiro and others, 2015). These parameters vary greatly with different GPU generations. Hence, the selection of the most adequate configurations of the GPU kernels has proven to be a difficult design optimization problem (Guerreiro and others, 2015). In (Tsai and others, 2016), it is demonstrated that for just a single AlexNet layer with 4 tunable parameters, the possible configurations are , and the performance ranges from 44.7 to 5735.8 Gflop/s on an AMD Fury X GPU. Even with GPU optimization tools, such as TensorRT (40) on top of cuDNN and cuBLAS, one kernel can still have varied performance. Fig.6 shows the variation of GPU throughput when computing one convolution layer with different filter configurations.
To apply NAIS in DNN and GPU implementation co-design, we can generalize the aforementioned DNN/FPGA co-design methodology. For example, a GPU-oriented Bundle can be defined as well. One GPU Bundle is composed of a set of GPU kernels, which shall be configured and optimized targeting the specific GPU device (usually the GPUs used for training and for inference are different). The parameters of the Bundle may include favorable matrix shapes for a matrix-multiplication kernel, the number of threads, the batch size, etc. Such Bundle optimization problem is being intensively studied with auto-tuning tools such as (Tsai and others, 2016) and (Guerreiro and others, 2015), where (Guerreiro and others, 2015) especially targets multi-kernel optimizations. With optimized kernel Bundles, structured NAS (Zoph and others, 2017) can be applied. Similar to the normal cells and reduction cells used in (Zoph and others, 2017), we can search for DNNs with different configurations of normal Bundles and reduction Bundles, which are optimized GPU kernels in NAIS. Leveraging both GPU Bundle optimization and structured NAS search, we can develop a NAIS engine that can be naturally applied to GPU and DNN co-design.
With more advanced profiling capabilities, such as a recent MLModelScope (Dakkak and others, 2019) tool, we can easily evaluate and profile DNN models across different datasets, frameworks and hardware at scale and across stack. With such detailed layer-wise and kernel-wise profiling data, roofline models for all kernels can be built to understand whether a kernel configuration is computation or memory bound. All those performance models can be leveraged to furtherance the development of NAIS for GPU.
|Architecture||Max. number of blocks per SM|
|specific||Max. number of warps per SM|
|Shared memory per SM|
|Shared memory alloc. unit size|
|Max. number of registers per SM|
|Registers alloc. unit size|
|Max. number of threads per SM|
|Kernel||Number of warps per thread block|
|specific||Shared memory per block|
|Number of registers per thread|
|Architecture||Max. number of thread blocks|
|& kernel specific||Hardware utilization measure|
5. nais for autonomous driving
An autonomous driving system collects a large amount of data from surrounding environment, and executes a complicated software pipeline for localization, perception, prediction, planning and control. To support a safe and robust software pipeline, a powerful computing platform as well as high quality AI algorithms are indispensable, and a NAIS approach is imperative to support both.
5.1. Computing Platforms
Currently, GPUs are the prevailing computing platforms for autonomous driving with high programmablity, flexibility and performance. In this demand, Nvidia brought Drive AGX (2)
, a powerful autonomous driving hybrid platform built on Nvidia Xavier, incorporating 8-core CPUs, deep learning accelerators (DLA), integrated GPU and programmable vision accelerators (PVA). Within these components, DLA is most adequate for DNN-based inference, which can be replaced by ASICs or FPGAs. Therefore, for competitive differentiation, some leading autonomous driving companies have started to adopt specialized platforms. For example, the Mobileye(1) and Tesla (4) have developed their own chips to achieve outstanding AI performance and low power. FPGAs, on the other hand, have also been a popular computing platform for autonomous driving cars because of its appealing advantages such as industrial reliability, specialization, high performance and low power. There are ongoing efforts from technology companies and academia institutions for FPGA based solutions (Nithin and others, 2014; Okuda and others, 2014). Xilinx, for example, has developed their ADAS using Zynq-7000 SoC-based FPGA devices (5). In a recent collaborative work of UIUC and XMotors (Cong and others, 2019), a hybrid GPU + FPGA computing system for autonomous driving has been proposed. Fig. 7 illustrates the hybrid system, where the GPU serves as a primary system, and the FPGA serves as a secondary system for failure fallback and providing auxiliary information for assistive driving.
Given the emerging needs of semi-customized platforms with reconfigurable devices and a full-customized platform as a direct next step, the DNN and implementation co-design is highly expected to boost the ongoing productivity and platform evolving.
5.2. Autonomous Driving Algorithms
Self-driving is a comprehensive robotic capability including parking, driving and in-cabin intelligence functions, and each contains a set of varied sub-functions with different AI algorithms, input sources and performance requirements.
Varied sub-functions. Self-driving pipeline requires different functions and algorithms in different scenarios. For example, the algorithms for parking and driving can be very different: parking task focuses on parking lot detection with near distance, localization and low speed vehicle control, while driving task focuses on motion objects, obstacle, lane detection within hundreds of meters and high speed vehicle control. In-cabin intelligence also has multiple sub-functions such as DSM (Driver State Monitoring), voice recognition, gesture based interactions, and passenger detection. Another example can be seen in the hybrid GPU and FPGA system proposed in (Cong and others, 2019), shown in Fig. 7. In the system error mode, the FPGA executes different tasks: when the car is on a highway, it keeps driving and while maintaining a minimum speed limit; when in urban area, it slows down the car and applies a safe pull-over. Each scenario requires different DNNs to be mapped to FPGA.
Varied input sources. The self-driving system will be provided with varied input sources. For example, parking functions usually use surrounding cameras and ultra-sonic sensors, while highway drive uses multiple front, side and rear cameras with assistance of radars. Another example is shown in Fig. 7, where the FPGA accepts input images with different resolutions: in normal mode, it may conduct traffic light detection using high resolution input images, while in error mode, it runs simplified autonomous driving pipeline using low resolution input image for object and lane detection.
Varied performance requirements. Autonomous driving algorithms need to cope with numerous and complicated driving scenarios with different performance requirements. For example, when driving in highways, the perception module requires at least 30 FPS but the number of objects to be detected may be limited to cars, lanes and traffic signs; in urban area, it requires 20 FPS but with a larger number of objects to detect; in school area, it may require 15 FPS but need a higher accuracy especially for pedestrian detection.
Given such variations in sub-functions, input sources and performance requirements, the detailed AI algorithms to each situation will be very different. Accordingly, the overall pipeline including other traditional algorithms will be significantly different, and all have to run on the same centralized electronic control unit (ECU) platform. Thus, an automatic NAIS co-design flow will enable us to explore the optimal solution under each situation.
6. experiment results
|Device||Mul. Precision||Max. GMACs||# of DSPs||Accuracy|
|Input Size||FM precision||Latency||Accuracy|
|15 FPS||20 FPS||30 FPS|
|Bundle 5||Bundle 4||Bundle 4|
|13 Replication||14 Replication||13 Replication|
|Max. 1264 ch||Max. 1008 ch||Max. 1024 ch|
|mAP 46.1||mAP 42.4||mAP 43.9|
|Bundle 1||Bundle 1||Bundle 5|
|15 Replication||14 Replication||15 Replication|
|Max. 1120 ch||Max. 784 ch||Max. 736 ch|
|mAP 45.4||mAP 44.3||mAP 39.7|
|Bundle 1: conv_3x3_stride1|
|Bundle 2: conv_5x5_stride1|
|Bundle 3: conv_3x3_stride1 + conv_5x5_stride1|
|Bundle 4: dw-conv_3x3_stride1 + conv_1x1|
|Bundle 5: dw-conv5x5_stride1 + conv1x1|
We first demonstrate that FPGA-oriented IP and DNN design will have a large impact on accelerator performance. We use SkyNet (Zhang and others, 2019), a light-weight object detection network, as the baseline. Table 4 shows that different data precisions result in 30% to 50% difference in peak performance under 250MHz on Xilinx FPGA and Intel FPGA, while the accuracy does not change dramatically. It implies that the data precision sometimes is a more sensitive design factor in FPGA accelerator than in DNN model. Exploiting device-oriented NAIS co-design can take advantage of such difference in sensitivity and come up with DNNs that best match the hardware.
Table 5 shows another example regarding BRAM consideration in Xilinx Ultra96 FPGA using RAMB18E1. It shows that when the precisions of feature map data are the same, the model accuracy shows negligible difference but the latency shows 32% and 38% difference between two input image sizes. This is because when the image is resized to , the total bit number of one image tile exceeds 18Kb and occupies two memory blocks, while being resized to , one image tile (following the same tiling rule) only consumes one memory block. Besides the computation capacity, the input results in less efficient BRAM utilization and more off-chip data movements, and thus longer latency.
We then apply our NAIS methodology on an object detection task on FPGA for autonomous driving under different input image resolutions and latency constraints. The target device is Xilinx UltraScale+ ZCU102, a large scaled FPGA with 599,550 logic cells, 32.1Mb block RAM and 2,520 DSP slices. We set the performance requirements to be 15 FPS, 20 FPS and 30 FPS, respectively, corresponding to different driving speeds in busy downtown, urban street and highway. We also consider two input resolutions, and , respectively. As shown in Table 6, under each constraint and input resolution, our co-design engine proposes a DNN that is built by replicating a pre-optimized Bundle, as described in Section 3.2. In each scenario, we show the Bundle used for building the DNN, as well as the number of replications and maximum number of channels. The DNNs are trained and tested on a subset of VOC 2012 dataset, including bike, car, bus and person, which are most related to autonomous driving. It shows that with different inputs and target performance, the generated DNNs are different. For example, when the input resolution is , more light-weight depth-wise Bundles are selected such as Bundle 4 and 5; when the input resolution is , Bundle 1 seems more preferable. This result implies that such a co-design is helpful in searching for the best DNNs within performance constraints under varied circumstances.
|DNN||Titan V||1080 Ti||2080 Ti|
|YOLO v3||3.3 fps||12.7 fps||51.5 @ ()||COCO|
|SkyNet||20.5 fps||67.3 fps||73.1 @ (IoU)||DAC-SDC (DJI, 2018)|
For GPU platform, we first show a summary of popular DNN models regarding their inference latency on various GPU platforms, including Titan V, 1080 Ti and 2080 Ti. The summary is shown in Table 7
, where part of the data are obtained from open-source repository(3). The input images are with a batch size of 16 with single and half precision. In addition to powerful GPUs, we also make a performance comparison between YOLO v3 (Redmon and Farhadi, 2018) to SkyNet (Zhang and others, 2019) on an embedded GPU, Nvidia Jetson TX2, where SkyNet is showing appealing real-time performance. SkyNet is a light-weight object detection network we proposed that won the 2019 DAC-SDC competition (13). It is composed of basic Bundles, where each Bundle has a depth-wise convolution layer followed by a point-wise convolution layer. Instead of a traditional top-down design method which starts from a large DNN and prunes it till reaching required performance, SkyNet was designed by utilizing our proposed NAIS idea discussed in Section 3.2.
Our SkyNet design on Jetson TX2 is an initial demonstration of the potential of such NAIS approach. As a future research direction, GPU implementation shall be optimized during NAIS.
7. related work
For FPGA-based DNN implementations, technologies such as quantization (Qiu and others, 2016; Cheng and others, 2019) and model compression (Han and others, 2017) are used to reduce DNN model size, while FPGA resource allocation (Zhang and others, 2017) and fine-grained pipeline architecture (Zhang and others, 2018b) are proposed to deliver low latency accelerators. Other works explore FPGA accelerator parameter configuration (Motamedi and others, 2016; Zhong and others, 2017; Chen and others, ) and optimizations such as loop unrolling and pipelining, but they do not explore configurations on the DNN side. Besides, there are works on DNN and FPGA co-design, which explores both DNN model and accelerator designs. The work in (Kwon and others, 2018) discussed the DNN and accelerators for embedded vision applications. It first designed a specific DNN accelerator targeting SqueezeNet (Iandola and others, 2016), and then proposed a tailored DNN model called SqueezeNext according to the hardware utilization of different layers of SqueezeNet. Another work (Jiang and others, 2019) proposed a framework named FNAS, which is a reinforcement learning based NAS by combining the estimated FPGA inference latency into the reward function. However, none of these works applied simultaneous DNN and FPGA implementation search, as NAIS proposed in our work.
On the other hand, for GPU based DNN search and implementation, NAS has seen a big success in designing high quality DNN models that outperform manually designed ones (Elsken and others, 2019). Most early NAS works target purely on improving model accuracy, while recent works have been conducting performance-aware searches by incorporating estimated hardware performance such as inference latency on GPU or CPU into the NAS engine. One representative work (Cai and others, 2018) addressed the high memory consumption issue as well as the high computational cost of differentiable NAS, and solved the problem with gradient-based approach to enable hardware-aware neural architecture search. Another work (Cheng and others, 2018) discussed device-aware neural architecture search by extending the NAS into a multiple-objective problem. Targeting difference devices, their framework came up with a Pareto Frontier regarding DNN accuracy and energy. Though these works are closely related to DNN and GPU co-design, they missed the opportunities in device-oriented implementation optimizations and their possible guidance to DNN design, which is an essential goal in NAIS.
8. conclusions and future works
In this paper, we proposed a DNN and implementation co-design methodology, called Neural Architecture and Implementation Search (NAIS), to explore the opportunities of boosting the development productivity and efficiency of mapping AI algorithms to targeted platforms. The NAIS searches DNN models and the underlying hardware implementations simultaneously in a pre-defined co-design space, with the goal of converging to the best hardware specific solution efficiently. We first demonstrated how NAIS works for DNN/FPGA co-design, and then discussed the NAIS approach for DNN/GPUs co-design. The NAIS approach can generate various design solutions with different accuracy, latency and computing complexity, which helps to find an optimized implementation for application deployment. We believe that such a NAIS design methodology can benefit the development productivity and the algorithm/hardware system quality for general DNN algorithms. We also provide a detailed application level case study on how autonomous driving can benefit from such a NAIS approach. Our future work includes systematic design space definition and application specific full stack optimizations.
This work is supported in part by XMotors.ai, Semiconductor Research Corporation (SRC), the IBM-Illinois Center for Cognitive Computing System Research (C3SR) and Advanced Digital Sciences Center (ADSC) in Singapore. The authors would also like to thank Vibhakar Vemulapati for helpful discussions.
-  . Note: https://www.mobileye.com/our-technology/evolution-eyeq-chip/ Cited by: §5.1.
-  . Note: https://www.nvidia.com/en-us/self-driving-cars/drive-platform/hardware/ Cited by: §5.1.
-  . Note: https://github.com/ryujaehun/pytorch-gpu-benchmark Cited by: Table 7, §6.
-  . Note: https://www.teslarati.com/tesla-tsla-fsd-chip-4-years-ahead-analyst/ Cited by: §5.1.
-  . Note: https://www.xilinx.com/applications/megatrends/automotive-driver-assist.html Cited by: §5.1.
-  (2018) ProxylessNAS: direct neural architecture search on target task and hardware. arXiv:1812.00332. Cited by: §2, §7.
-  Cloud-DNN: an open framework for mapping DNN models to cloud FPGAs. In FPGA, 2019, Cited by: §1, §7.
Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44, pp. 367–379. Cited by: §1.
-  (2018) Searching toward pareto-optimal device-aware neural architectures. In ICCAD, Cited by: §4, §7.
-  (2019) l2q: An ultra-low loss quantization method for DNN. In IJCNN, Cited by: §7.
-  (2019) A hybrid GPU + FPGA system design for autonomous driving cars. In SiPS, Cited by: Figure 7, §5.1, §5.2.
-  cuDNN. Note: https://developer.nvidia.com/cuDNN Cited by: §1.
-  DAC-SDC. Note: http://www.cse.cuhk.edu.hk/~byu/2019-DAC-SDC/index.html Cited by: §6.
-  (2019) Frustrated with replicating claims of a shared model? a solution. arXiv:1811.09737. Cited by: §4.
-  (2018) DAC-SDC dataset. Note: https://github.com/xyzxinyi zhang/2018-DAC-System-Design-Contest Cited by: Table 8.
-  Edge TPU. Note: https://cloud.google.com/edge-tpu/ Cited by: §1.
-  (2019) Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: Figure 3, §7.
-  (2015) Multi-kernel auto-tuning on GPUs: performance and energy-aware optimization. In PDP, Cited by: §1, Table 3, §4, §4.
-  (2017) Ese: efficient speech recognition engine with sparse LSTM on FPGA. In FPGA, Cited by: §7.
-  (2019) FPGA/DNN co-design: an efficient design methodology for IoT intelligence on the edge. In DAC, Cited by: §3.1, §3.2, §3.2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv:1602.07360. Cited by: §7.
-  Intel Arria 10 DSP. Note: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/arria-10/a10_overview.pdf Cited by: Table 2.
-  Intel Arria V DSP. Note: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/arria-v/av_51001.pdf Cited by: Table 2.
-  Intel Block RAM. Note: https://perso-etis.ensea.fr/olivier.romain/Teaching_2A_IUT_UCP_files/ug_ram_rom.pdf Cited by: Table 1.
-  Intel Stratix 10 DSP. Note: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/ug-s10-dsp.pdf Cited by: Table 2.
-  Intel Stratix V DSP. Note: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/wp/wp-01131-stxv-dsp-architecture.pdf Cited by: Table 2.
-  (2019) Accuracy vs. efficiency: achieving both through fpga-implementation aware neural architecture search. DAC. Cited by: §7.
-  (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
-  (2018) Co-design of deep neural nets and neural net accelerators for embedded vision applications. In DAC, Cited by: §7.
-  (2018) Hardware-aware machine learning: modeling and optimization. In ICCAD, Cited by: §4.
-  (2016) Design space exploration of FPGA-based deep convolutional neural networks.. In ASP-DAC, Cited by: §7.
-  (2014) Advanced driver assistance system using FPGA. White Paper. Cited by: §5.1.
-  (2014) A survey of technical trend of ADAS and autonomous driving. VLSI-DAT. Cited by: §5.1.
-  (2016) Going deeper with embedded FPGA platform for convolutional neural network. In FPGA, Cited by: §7.
-  (2018) Yolov3: an incremental improvement. arXiv:1804.02767. Cited by: §6.
-  (2016) From high-level deep neural models to FPGAs. In MICRO, Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §1.
-  (2015) Going deeper with convolutions. In CVPR, Cited by: §1.
-  TensorRT. Note: https://developer.nvidia.com/tensorrt Cited by: §1, §4.
-  (2016) Performance-portable autotuning of opencl kernels for convolutional layers of deep neural networks. In MLHPC, Cited by: §4, §4.
-  Xilinx Block RAM. Note: https://www.xilinx.com/support/documentation/user_guides/ug473_7Series_Memory_Resources.pdf Cited by: Table 1.
-  Xilinx DSP48E1. Note: https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf Cited by: Table 2.
-  Xilinx DSP48E2. Note: https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf Cited by: Table 2.
-  (2017) A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE JSSC 53 (4), pp. 968–982. Cited by: §1.
-  (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA, Cited by: §1.
-  (2018) Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. IEEE TCAD. Cited by: §1.
-  (2017) High-performance video content recognition with long-term recurrent convolutional network for FPGA. In FPL, Cited by: §7.
-  (2018) DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. In ICCAD, Cited by: §1, §7.
-  (2019) SkyNet: A Champion Design for DAC-SDC on Low Power Object Detection. arXiv:1906.10327. Cited by: §3.2, Table 4, Table 5, §6, §6.
-  (2017) Design space exploration of FPGA-based accelerators with multi-level parallelism. In DATE, Cited by: §7.
-  (2017) A performance analysis framework for exploiting GPU microarchitectural capability. In ICS, Cited by: §4.
-  (2017) Learning transferable architectures for scalable image recognition. arXiv:1707.07012. Cited by: §4.