DNN-Chip Predictor: An Analytical Performance Predictor for DNN Accelerators with Various Dataflows and Hardware Architectures

02/26/2020 ∙ by fcq, et al. ∙ 0

The recent breakthroughs in deep neural networks (DNNs) have spurred a tremendously increased demand for DNN accelerators. However, designing DNN accelerators is non-trivial as it often takes months/years and requires cross-disciplinary knowledge. To enable fast and effective DNN accelerator development, we propose DNN-Chip Predictor, an analytical performance predictor which can accurately predict DNN accelerators' energy, throughput, and latency prior to their actual implementation. Our Predictor features two highlights: (1) its analytical performance formulation of DNN ASIC/FPGA accelerators facilitates fast design space exploration and optimization; and (2) it supports DNN accelerators with different algorithm-to-hardware mapping methods (i.e., dataflows) and hardware architectures. Experiment results based on 2 DNN models and 3 different ASIC/FPGA implementations show that our DNN-Chip Predictor's predicted performance differs from those of chip measurements of FPGA/ASIC implementation by no more than 17.66 architectures, and dataflows. We will release code upon acceptance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have achieved record-breaking performance in various applications, such as image classification [1, 2]

and natural language processing 

[3]. However, their powerful performance often comes with a prohibitive complexity [4, 5, 6, 7, 8]. Moreover, DNN-based applications often require not only high accuracy, but also aggressive hardware performance, including high throughput, low latency, and high energy efficiency. As such, there has been intensive research on DNN accelerators in order to take advantage of different hardware platforms, such as FPGAs and ASICs, for improving DNN acceleration efficiency  [9, 10, 11, 12, 13, 14].

While DNN accelerators can be 1000 more efficient than general purpose computing platforms [15], developing DNN accelerators presents significant challenges, because: (1) mainstream DNNs have millions of parameters and billions of operations; (2) the design space of DNN accelerator is large due to numerous design choices of architectures, hardware IPs, DNN-to-accelerator-mappings, etc.; and (3) there is an algorithm/hardware co-design need for the same DNN functionality to have a different decomposition that would require different hardware IPs and thus correspond to dramatically different hardware performance/energy/area trade-offs. Therefore, high-quality DNN accelerators often take months/years to design and require a large team of cross-disciplinary experts with knowledge in DNN algorithms, micro-architectures, and physical chip design. Such a barrier makes it difficult to scientifically explore innovative DNN accelerator design and thus limits DNNs’ more extensive applications.

To address the aforementioned challenges, we propose DNN-Chip Predictor, an analytical performance predictor which can efficiently and accurately predict DNN accelerators’ performance prior to time-consuming ASIC/FPGA hardware implementation. Specifically, our Predictor formulates DNN accelerators’ energy, throughput, and latency based on parameters that characterize the DNN models and corresponding accelerators’ architectures and algorithm-to-hardware mapping methods (i.e., dataflows). Such a generic Predictor (1) enables fast evaluation of DNN accelerator innovations and (2) can be used as an efficient design exploration and optimization tool for DNN accelerators, given their large design space. To the best of our knowledge, our proposed Predictor is the first that highlights the following three features simultaneously for practical and wide adoption: (1) analytical and thus fast; (2) covering both ASIC and FPGA DNN accelerators; (3) are validated using different DNN models and accelerator designs (i.e., architectures, dataflows, and process technologies).

2 Background

DNN Accelerators. There have been intensive studies of DNN accelerators. For example, the first well-optimized FPGA DNN accelerator [16] uses loop tiling; the DianNao series [13, 17] is an early effort on synthesis based ASIC accelerators; Eyeriss proposes a row-stationary dataflow [14] to reduce expensive DRAM accesses; and Google TPUs [11, 12] use a systolic array to achieve high throughput.
DNN Accelerator Performance Prediction. DNNs often feature a high complexity while there exists various opportunities for reuse, pipeline, and resource allocation to maximize DNN accelerators’ performance. Therefore, an accurate yet fast performance predictor is desired to enable efficient design space exploration and optimization with different performance trade-offs. Various methods have been developed for predicting or simulating DNN accelerators’ performance. Roofline models [16, 18] and customized analytical models which are closely tied to the specific design attributes  [9, 19, 20]

are used. However, the roofline model lack fine-grained estimation and customized models are not general as desired. Timeloop 

[21] and Eyeriss [22] use for and parallel-for to describe the temporal and spatial mapping of DNN accelerators. Specifically, Timeloop obtains the number of memory accesses and estimates the latency by calculating the maximum isolated execution cycle across all hardware IPs based on a double-buffering assumption. Accelergy [23] proposes a configuration language to describe hardware architectures and depends on plug-ins, e.g., Timeloop, to calculate the energy as in [14]. The work in [24] adopts Halide [25], a domain-specific language for image processing applications, and proposes a modeling framework which is similar to that of  [14]. MAESTRO [26] is the very first to adopt a data-centric approach.

3 The Proposed DNN-Chip Predictor

This section presents the proposed DNN-Chip Predictor which is an analytical modelling framework to formulate DNN inference accelerators’ energy cost, latency, and throughput when employing different dataflows and hardware architectures. We first introduce the employed design space description method, and then describe the developed performance models. The advantages of the DNN-Chip Predictor are that it (1) matches well with actual implementation results (

18%); (2) is analytical and intuitive (directly ties to the DNN model and accelerator parameters), facilitating its ease of use for time-efficient design space exploration and optimization; and (3) is programmer friendly and compatible with commonly used DNN frameworks (e.g., Pytorch

[27]) thanks to its adopted generic description of DNN accelerators’ design space.

3.1 Design Space Description

Figure 1: A nested for-loop description of DNN accelerators’ design space, using a CONV layer as an example, where 0,1,2,3 denotes the four memory hierarchies (i.e., RF, NoC, GB, and DRAM, respectively), and denote the six dimensions of a CONV layer (i.e., input/output channels, kernel width/height, and output feature map width/height, respectively).

For modeling DNN accelerators’ performance given their large design space, one critical question is how to describe the whole design space, i.e., cover all possible design choices, in a way that is easy to follow? For ease of use and better visualization, we adopt a nested for-loop description [14] to describe the design space as shown in Fig. 1. Specifically, we employ (1) the primitive, for, to describe the temporal operations of each process element (PE) as well as the temporal data tiling and mapping operations at the DRAM, global buffer (GB) and register file (RF) levels; and (2) the primitive, parallel-for, to describe the spatial data tiling and mapping operations at the network-on-chip (NoC) level (i.e., in the PE array). Without loss of generality, we consider four levels of memory hierarchy, i.e., off-chip DRAM, on-chip GB, NoC in the PE array, and RF within the PEs. The design space of DNN accelerators mainly includes two aspects: hardware architectures and dataflows.

Hardware architecture. It can be described using a set of architecture-dependent hardware parameters and technology-dependent IP parameters. In particular, the architecture-dependent hardware parameters includes PE array architectures (e.g., spatial array, systolic array, and adder tree), number of PEs, NoC design (e.g., unicast, multicast, or broadcast), memory hierarchies, and the storage capacity and communication bandwidth of each memory hierarchy; the technology-dependent IP parameters includes unit energy/delay costs of (1) a MAC operation, (2) memory accesses to various memory hierarchies, and (3) the clock frequency.

Dataflow. This describes how a DNN is temporally and spatially scheduled to be executed in an accelerator. Specifically, a dataflow answers the following questions: (1) how to map and schedule the computations in the PE array and within each PE?; and (2) what are the loop ordering and tiling factors on the DRAM and global buffer levels? The former captures the design choice of holding a certain type of data locally in the PE once being fetched from the memories, e.g., row/weight/output stationary. The latter shows how to store data in SRAM and DRAM to accommodate data stationary effectively. These two questions can be described using three groups of parameters as defined below in the context of the example in Fig. 1: Loop ordering factors for the twenty-four nested for-loops associated with the six dimensions of the 3D convolution operation and the four considered memory hierarchies (i.e., DRAM, GB, NoC, and RF); Loop tiling factors for the twenty-four nested for-loops associated with the six dimensions of the 3D convolution operation and the four considered memory hierarchies; and Data access locations in which of the nested for-loops we refresh the on-chip GB and in-PE RFs for the activations and weights.

3.2 The DNN-Chip Predictor

3.2.1 Overview

Fig. 2 shows a high-level view of the proposed DNN-Chip Predictor, which accepts DNN models (e.g., number of layers, layer structure, bit-precision, etc.), hardware architectures (e.g., memory hierarchy, number of PEs, NoC design, etc.), dataflows (e.g., row/weight/output stationary, loop tiling/ordering factors, etc.), and technology-dependent unit costs (e.g., unit energy/delay cost of a MAC operation and memory accesses to various memory hierarchies), and then outputs the estimated energy consumption, latency, and throughput when executing the DNN in a target accelerator.

Figure 2: A high-level view of the DNN-Chip Predictor.

It thus can be used to (1) validate DNN accelerator techniques prior to the time- and cost-consuming DNN ASIC/FPGA accelerator implementation, and (2) perform time-efficient design space exploration and optimization.

3.2.2 The Proposed Analytical Models

This subsection introduces the Predictor’s analytical models.

Energy Models. DNN accelerators’ energy cost include both computational () and data movement () costs, where with denoting the total number of MACs in the DNN. Similarly, the data movement cost can be calculated by multiplying the unit energy cost per access () with the total number of accesses () to the -th memory hierarchy (e.g., GB) using the -th type of data (i.e., inputs (), outputs (), and weights ()):

(1)

where for inputs/weights; and for outputs.

The key challenge is to obtain for various memory hierarchies and data types when using different DNN models, hardware architectures, and dataflows. We are the first to find that can be calculated as the product of the -th data volume () involved in each refresh and the total number of such refreshes () for the -th memory:

(2)

To obtain and , we propose an intuitive methodology: we first (1) choose a refresh location, which can be straightforwardly decided once the dataflow is known, in the nested for-loops (see Fig. 1) for a given data type; (2) is equal to the product of all the loop bounds in the for-loops above the refresh location; and (3) is equal to the product of all the loop bounds in the for-loops below the refresh location and associated with the particular type of data. Once and are obtained, the energy can be calculated as:

(3)
(4)
(5)
(6)

where is the number of active PEs and is the number of PEs that share the same data.

Latency Models. Similarly, the latency of DNN accelerators can be formulated as:

(7)

where , , , and denote the latency of computation in the PE array, accessing the DRAM from the GB, accessing the GB from an RF in the PEs, and setting up the first set of the weights and inputs, respectively. Adopting -bit precision for inputs/outputs/weights is , , we have:

(8)
(9)
(10)
(11)
(12)
(13)

where is the memory bandwidth for the -th memory hierarchy for the data type .

4 Experiment Results

We validate our proposed DNN-Chip Predictor by comparing its predicted performance with actual chip measured ones in [14], FPGA implementation results in [28], and synthesis results based on a commercial CMOS technology, under the same experiment settings (e.g., unit energy, clock frequency, DNN model, architecture design and dataflow, etc).

(a)
(b)
Figure 3: The # of (L) DRAM and (R) GB accesses in Eyeriss [29] and our Predictor for AlexNet’s CONV layers.
Layer comp. RF NoC GB
Meas. Pred. Meas. Pred. Meas. Pred. Meas. Pred.
CONV1 16.7% 18.7% 79.6% 74.4% 1.7% 4.8% 2.0% 2.0%
2.08% -5.15% 3.10% -0.03%
CONV5 7.3% 7.5% 80.3% 79.1% 5.3% 7.0% 7.0% 6.3%
0.26% -1.16% 1.64% -0.74%
Table 1: The energy breakdown from Eyeriss [29] and our Predictor, for the CONV1 and CONV5 of AlexNet [30].

Validation against Chip Measurements. For this set of experiments, we compare our Predictor’s predicted performance with Eyeriss’s chip measurement results using their normalized unit energy [14]. First, Table 1 compares the energy breakdown of AlexNet’s first and fifth CONV layers (denoted as CONV1 and CONV5, respectively), showing that the maximum difference is 5.15% and 1.64%, respectively.

Second, Fig. 3 compares the number of DRAM/GB accesses. The difference between the predicted number of DRAM accesses and Eyeriss’s measured results is between 2.18% and 12.10%, while the difference in terms of GB accesses is between -0.70% and 17.66%. Our Predictor’s predicted DRAM access number is smaller than that of Eyeriss because the RLC overhead of sparse activations depends on the input images and we lack the information about which set of images were used in Eyeriss’s measurements. Additionally, Fig. 3

shows that the difference between the predicted number of GB accesses and Eyeriss’s results is less than 5% except for the CONV1 layer where the relative larger prediction error is caused by its larger stride, which is 4. Specifically, a larger stride leads to lower utilization of inputs fetched from the GB, whereas our current

Predictor considers the generic case where stride is 1 as it is more often seen in recent DNN models. For better prediction accuracy, our Predictor can be adjusted to cover cases with other stride values, i.e., more considered cases for the analytical models in Section 3.2.2.

Figure 4: Comparison on the inference latency from Eyeriss [29] and our Predictor when running AlexNet.

Third, Fig. 4 compares the latency of executing AlexNet’s five CONV layers, and shows that the predicted ones and Eyeriss’s differ by 15.51%. The predicted latency is smaller than the measured one because our Predictor’s analytical models do not consider the corner cycles when the memory accesses and computation can not be fully pipelined where processing stalls occur. Finally, the predicted throughput of executing AlexNet is 46.0 GOPS while the one measured by Eyeriss is 51.6 GOPS, showing a prediction error of 11%.

Validation against FPGA Implementation. We compare our Predictor’s predicted latency with FPGA measured ones under the same DNN model and hardware configurations [31]

. Specifically, for the FPGA one we use the open source implementation of the award winner 

[31] in a state-of-the-art design contest [32]. Fig. 5 shows that our Predictor’s predicted latency differs from the FPGA-synthesized ones by 16.84%. Note that in FPGA implementations the GB can be partitioned into smaller chunks to be accessed simultaneously for increasing the parallelism and minimizing the latency. Our current models do not include the overhead of this partition, which is larger when the GB is partitioned into more chunks for layers with a larger size, leading to a larger prediction error for the CONV4/CONV5/CONV6 layers in Fig.  5.

Figure 5: Our Predictor’s predicted latency and the FPGA measured one for the 7 CONV layers of SkyNet [28].

Validation against Synthesis Results. Table 2 compares the Predictor’s energy breakdown with that from the synthesis results for AlexNet’s CONV3-CONV5 layers when using an in-house dedicated accelerator using a commercial 65nm CMOS technology. It can be seen from Table 2 that the difference between our Predictor’s predicted energy breakdown and that from the synthesis results is less than 5.28%.

Layer comp. (%) RF(%) GB(%)
Syn. Pred. Syn. Pred. Syn. Pred.
CONV3 38.76 34.49 4.26 60.99 65.25 4.26 0.24 0.25 0.01
CONV4 39.46 34.28 5.19 60.28 65.45 5.16 0.25 0.27 0.02
CONV5 31.13 25.85 5.28 68.65 73.91 5.26 0.22 0.24 0.02
Table 2: The energy breakdown from synthesized results and our Predictor for AlexNet’s CONV3-CONV5 layers.

5 Conclusion

To close the gap between the growing demand for dedicated DNN accelerators with various specifications and the time-consuming and challenging DNN accelerator design, we develop DNN-Chip Predictor, which can efficiently and effectively predict an accelerator’s energy, latency, and resource consumption. Such an analytical performance prediction tool will facilitate fast development of innovations for not only DNN accelerators but also hardware-aware efficient DNNs.

6 Acknowledgement

The work is supported by the National Science Foundation (NSF) through the ECCS Division Of Electrical, Communication & Cyber System (Award number: 1934767).

References

  • [1] Karen Simonyan et al., “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [2] Yue Wang et al., “E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings,” in Advances in Neural Information Processing Systems, 2019, pp. 5139–5151.
  • [3] Wayne Xiong et al., “The microsoft 2016 conversational speech recognition system,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5255–5259.
  • [4] Sicong Liu et al., “On-demand deep model compression for mobile devices: A usage-driven model selection framework,” in Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2018, pp. 389–400.
  • [5] Yue Wang et al., “Energynet: Energy-efficient dynamic inference,” in Advances in Neural Information Processing Systems (Workshop), 2018.
  • [6] Jianghao Shen et al., “Fractional Skipping: Towards Finer-Grained Dynamic Inference,” in

    The Thirty-Forth AAAI Conference on Artificial Intelligence

    , 2020.
  • [7] Junru Wu et al.,

    “Deep k-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions,”

    in

    Thirty-fifth International Conference on Machine Learning

    , 2018.
  • [8] Yue Wang et al., “Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference,” IEEE Journal of Selected Topics in Signal Processing, 2019.
  • [9] Xiaofan Zhang et al., “DNNBuilder: an automated tool for building high-performance dnn hardware accelerators for FPGAs,” in Proc. of ICCAD, 2018.
  • [10] Yingyan Lin et al.,

    “Predictivenet: An energy-efficient convolutional neural network via zero prediction,”

    in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.
  • [11] Google Inc., “Edge TPU,” https://cloud.google.com/tpu/, accessed 2019-09-01.
  • [12] Google Inc., “Edge TPU,” https://coral.withgoogle.com/docs/edgetpu/faq/, accessed 2019-09-01.
  • [13] Zidong Du et al., “Shidiannao: Shifting vision processing closer to the sensor,” in ACM SIGARCH Computer Architecture News. ACM, 2015, vol. 43, pp. 92–104.
  • [14] Yu-Hsin Chen et al., “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43th Annual International Symposium on. IEEE Press, 2016, pp. 367–379.
  • [15] Song Han et al., “Eie: efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 243–254.
  • [16] Chen Zhang et al., “Optimizing fpga-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
  • [17] Tianshi Chen et al., “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices. ACM, 2014, vol. 49, pp. 269–284.
  • [18] Tianqi Tang et al., “Mlpat: A power, area, timing modeling framework for machine learning accelerators,” The Second International Workshop on Domain Specific System Architecture (DOSSA), 2019.
  • [19] Zhiqiang Liu et al., “Throughput-optimized fpga accelerator for deep convolutional neural networks,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 10, no. 3, pp. 17, 2017.
  • [20] Ananda Samajdar et al., “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
  • [21] Angshuman Parashar et al., “Timeloop: A systematic approach to dnn accelerator evaluation,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 304–315.
  • [22] Yu-Hsin Chen et al., “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” arXiv preprint arXiv:1807.07928, 2018.
  • [23] Yannan Wu et al., “Accelergy: An architecture-level energy estimation methodology for accelerator designs,” 2019.
  • [24] Xuan Yang et al., “Dnn dataflow choice is overrated,” arXiv preprint arXiv:1809.04070, 2018.
  • [25] Jonathan Ragan-Kelley et al., “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” Acm Sigplan Notices, vol. 48, no. 6, pp. 519–530, 2013.
  • [26] Hyoukjun Kwon et al., “Understanding reuse, performance, and hardware cost of dnn dataflows: A data-centric approach,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2019, pp. 754–768.
  • [27] Adam Paszke et al., “Automatic differentiation in pytorch,” 2017.
  • [28] Cong Hao et al., “Fpga/dnn co-design: An efficient design methodology for iot intelligence on the edge,” Proc. of DAC, 2019.
  • [29] Yu-Hsin Chen et al., “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
  • [30] Alex Krizhevsky et al.,

    Imagenet classification with deep convolutional neural networks,”

    in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105. Curran Associates, Inc., 2012.
  • [31] Xiaofan Zhang et al., “Skynet: A champion model for dac-sdc on low power object detection,” arXiv preprint arXiv:1906.10327, 2019.
  • [32] Xilinx, Nvidia, and DJI, “Dac 2019 system design contest,” 2019.