1 Introduction
Deep Neural Networks (DNNs) have achieved recordbreaking performance in various applications, such as image classification [1, 2]
and natural language processing
[3]. However, their powerful performance often comes with a prohibitive complexity [4, 5, 6, 7, 8]. Moreover, DNNbased applications often require not only high accuracy, but also aggressive hardware performance, including high throughput, low latency, and high energy efficiency. As such, there has been intensive research on DNN accelerators in order to take advantage of different hardware platforms, such as FPGAs and ASICs, for improving DNN acceleration efficiency [9, 10, 11, 12, 13, 14].While DNN accelerators can be 1000 more efficient than general purpose computing platforms [15], developing DNN accelerators presents significant challenges, because: (1) mainstream DNNs have millions of parameters and billions of operations; (2) the design space of DNN accelerator is large due to numerous design choices of architectures, hardware IPs, DNNtoacceleratormappings, etc.; and (3) there is an algorithm/hardware codesign need for the same DNN functionality to have a different decomposition that would require different hardware IPs and thus correspond to dramatically different hardware performance/energy/area tradeoffs. Therefore, highquality DNN accelerators often take months/years to design and require a large team of crossdisciplinary experts with knowledge in DNN algorithms, microarchitectures, and physical chip design. Such a barrier makes it difficult to scientifically explore innovative DNN accelerator design and thus limits DNNs’ more extensive applications.
To address the aforementioned challenges, we propose DNNChip Predictor, an analytical performance predictor which can efficiently and accurately predict DNN accelerators’ performance prior to timeconsuming ASIC/FPGA hardware implementation. Specifically, our Predictor formulates DNN accelerators’ energy, throughput, and latency based on parameters that characterize the DNN models and corresponding accelerators’ architectures and algorithmtohardware mapping methods (i.e., dataflows). Such a generic Predictor (1) enables fast evaluation of DNN accelerator innovations and (2) can be used as an efficient design exploration and optimization tool for DNN accelerators, given their large design space. To the best of our knowledge, our proposed Predictor is the first that highlights the following three features simultaneously for practical and wide adoption: (1) analytical and thus fast; (2) covering both ASIC and FPGA DNN accelerators; (3) are validated using different DNN models and accelerator designs (i.e., architectures, dataflows, and process technologies).
2 Background
DNN Accelerators. There have been intensive studies of DNN accelerators.
For example, the first welloptimized FPGA DNN accelerator [16] uses loop tiling;
the DianNao series [13, 17] is an early effort on synthesis based ASIC accelerators; Eyeriss proposes a rowstationary dataflow [14] to reduce expensive DRAM accesses; and Google TPUs [11, 12] use a systolic array to achieve high throughput.
DNN Accelerator Performance Prediction.
DNNs often feature a high complexity while there exists various opportunities for reuse, pipeline, and resource allocation to maximize DNN accelerators’ performance. Therefore, an accurate yet fast performance predictor is desired to enable efficient design space exploration and optimization with different performance tradeoffs.
Various methods have been developed for predicting or simulating DNN accelerators’ performance.
Roofline models [16, 18] and customized analytical models which are closely tied to the specific design attributes
[9, 19, 20]
are used. However, the roofline model lack finegrained estimation and customized models are not general as desired. Timeloop
[21] and Eyeriss [22] use for and parallelfor to describe the temporal and spatial mapping of DNN accelerators. Specifically, Timeloop obtains the number of memory accesses and estimates the latency by calculating the maximum isolated execution cycle across all hardware IPs based on a doublebuffering assumption. Accelergy [23] proposes a configuration language to describe hardware architectures and depends on plugins, e.g., Timeloop, to calculate the energy as in [14]. The work in [24] adopts Halide [25], a domainspecific language for image processing applications, and proposes a modeling framework which is similar to that of [14]. MAESTRO [26] is the very first to adopt a datacentric approach.3 The Proposed DNNChip Predictor
This section presents the proposed DNNChip Predictor which is an analytical modelling framework to formulate DNN inference accelerators’ energy cost, latency, and throughput when employing different dataflows and hardware architectures. We first introduce the employed design space description method, and then describe the developed performance models. The advantages of the DNNChip Predictor are that it (1) matches well with actual implementation results (
18%); (2) is analytical and intuitive (directly ties to the DNN model and accelerator parameters), facilitating its ease of use for timeefficient design space exploration and optimization; and (3) is programmer friendly and compatible with commonly used DNN frameworks (e.g., Pytorch
[27]) thanks to its adopted generic description of DNN accelerators’ design space.3.1 Design Space Description
For modeling DNN accelerators’ performance given their large design space, one critical question is how to describe the whole design space, i.e., cover all possible design choices, in a way that is easy to follow? For ease of use and better visualization, we adopt a nested forloop description [14] to describe the design space as shown in Fig. 1. Specifically, we employ (1) the primitive, for, to describe the temporal operations of each process element (PE) as well as the temporal data tiling and mapping operations at the DRAM, global buffer (GB) and register file (RF) levels; and (2) the primitive, parallelfor, to describe the spatial data tiling and mapping operations at the networkonchip (NoC) level (i.e., in the PE array). Without loss of generality, we consider four levels of memory hierarchy, i.e., offchip DRAM, onchip GB, NoC in the PE array, and RF within the PEs. The design space of DNN accelerators mainly includes two aspects: hardware architectures and dataflows.
Hardware architecture. It can be described using a set of architecturedependent hardware parameters and technologydependent IP parameters. In particular, the architecturedependent hardware parameters includes PE array architectures (e.g., spatial array, systolic array, and adder tree), number of PEs, NoC design (e.g., unicast, multicast, or broadcast), memory hierarchies, and the storage capacity and communication bandwidth of each memory hierarchy; the technologydependent IP parameters includes unit energy/delay costs of (1) a MAC operation, (2) memory accesses to various memory hierarchies, and (3) the clock frequency.
Dataflow. This describes how a DNN is temporally and spatially scheduled to be executed in an accelerator. Specifically, a dataflow answers the following questions: (1) how to map and schedule the computations in the PE array and within each PE?; and (2) what are the loop ordering and tiling factors on the DRAM and global buffer levels? The former captures the design choice of holding a certain type of data locally in the PE once being fetched from the memories, e.g., row/weight/output stationary. The latter shows how to store data in SRAM and DRAM to accommodate data stationary effectively. These two questions can be described using three groups of parameters as defined below in the context of the example in Fig. 1: Loop ordering factors for the twentyfour nested forloops associated with the six dimensions of the 3D convolution operation and the four considered memory hierarchies (i.e., DRAM, GB, NoC, and RF); Loop tiling factors for the twentyfour nested forloops associated with the six dimensions of the 3D convolution operation and the four considered memory hierarchies; and Data access locations in which of the nested forloops we refresh the onchip GB and inPE RFs for the activations and weights.
3.2 The DNNChip Predictor
3.2.1 Overview
Fig. 2 shows a highlevel view of the proposed DNNChip Predictor, which accepts DNN models (e.g., number of layers, layer structure, bitprecision, etc.), hardware architectures (e.g., memory hierarchy, number of PEs, NoC design, etc.), dataflows (e.g., row/weight/output stationary, loop tiling/ordering factors, etc.), and technologydependent unit costs (e.g., unit energy/delay cost of a MAC operation and memory accesses to various memory hierarchies), and then outputs the estimated energy consumption, latency, and throughput when executing the DNN in a target accelerator.
It thus can be used to (1) validate DNN accelerator techniques prior to the time and costconsuming DNN ASIC/FPGA accelerator implementation, and (2) perform timeefficient design space exploration and optimization.
3.2.2 The Proposed Analytical Models
This subsection introduces the Predictor’s analytical models.
Energy Models. DNN accelerators’ energy cost include both computational () and data movement () costs, where with denoting the total number of MACs in the DNN. Similarly, the data movement cost can be calculated by multiplying the unit energy cost per access () with the total number of accesses () to the th memory hierarchy (e.g., GB) using the th type of data (i.e., inputs (), outputs (), and weights ()):
(1) 
where for inputs/weights; and for outputs.
The key challenge is to obtain for various memory hierarchies and data types when using different DNN models, hardware architectures, and dataflows. We are the first to find that can be calculated as the product of the th data volume () involved in each refresh and the total number of such refreshes () for the th memory:
(2) 
To obtain and , we propose an intuitive methodology: we first (1) choose a refresh location, which can be straightforwardly decided once the dataflow is known, in the nested forloops (see Fig. 1) for a given data type; (2) is equal to the product of all the loop bounds in the forloops above the refresh location; and (3) is equal to the product of all the loop bounds in the forloops below the refresh location and associated with the particular type of data. Once and are obtained, the energy can be calculated as:
(3) 
(4) 
(5) 
(6) 
where is the number of active PEs and is the number of PEs that share the same data.
Latency Models. Similarly, the latency of DNN accelerators can be formulated as:
(7) 
where , , , and denote the latency of computation in the PE array, accessing the DRAM from the GB, accessing the GB from an RF in the PEs, and setting up the first set of the weights and inputs, respectively. Adopting bit precision for inputs/outputs/weights is , , we have:
(8) 
(9) 
(10) 
(11) 
(12) 
(13) 
where is the memory bandwidth for the th memory hierarchy for the data type .
4 Experiment Results
We validate our proposed DNNChip Predictor by comparing its predicted performance with actual chip measured ones in [14], FPGA implementation results in [28], and synthesis results based on a commercial CMOS technology, under the same experiment settings (e.g., unit energy, clock frequency, DNN model, architecture design and dataflow, etc).
Layer  comp.  RF  NoC  GB  
Meas.  Pred.  Meas.  Pred.  Meas.  Pred.  Meas.  Pred.  
CONV1  16.7%  18.7%  79.6%  74.4%  1.7%  4.8%  2.0%  2.0% 
2.08%  5.15%  3.10%  0.03%  
CONV5  7.3%  7.5%  80.3%  79.1%  5.3%  7.0%  7.0%  6.3% 
0.26%  1.16%  1.64%  0.74% 
Validation against Chip Measurements. For this set of experiments, we compare our Predictor’s predicted performance with Eyeriss’s chip measurement results using their normalized unit energy [14]. First, Table 1 compares the energy breakdown of AlexNet’s first and fifth CONV layers (denoted as CONV1 and CONV5, respectively), showing that the maximum difference is 5.15% and 1.64%, respectively.
Second, Fig. 3 compares the number of DRAM/GB accesses. The difference between the predicted number of DRAM accesses and Eyeriss’s measured results is between 2.18% and 12.10%, while the difference in terms of GB accesses is between 0.70% and 17.66%. Our Predictor’s predicted DRAM access number is smaller than that of Eyeriss because the RLC overhead of sparse activations depends on the input images and we lack the information about which set of images were used in Eyeriss’s measurements. Additionally, Fig. 3
shows that the difference between the predicted number of GB accesses and Eyeriss’s results is less than 5% except for the CONV1 layer where the relative larger prediction error is caused by its larger stride, which is 4. Specifically, a larger stride leads to lower utilization of inputs fetched from the GB, whereas our current
Predictor considers the generic case where stride is 1 as it is more often seen in recent DNN models. For better prediction accuracy, our Predictor can be adjusted to cover cases with other stride values, i.e., more considered cases for the analytical models in Section 3.2.2.Third, Fig. 4 compares the latency of executing AlexNet’s five CONV layers, and shows that the predicted ones and Eyeriss’s differ by 15.51%. The predicted latency is smaller than the measured one because our Predictor’s analytical models do not consider the corner cycles when the memory accesses and computation can not be fully pipelined where processing stalls occur. Finally, the predicted throughput of executing AlexNet is 46.0 GOPS while the one measured by Eyeriss is 51.6 GOPS, showing a prediction error of 11%.
Validation against FPGA Implementation. We compare our Predictor’s predicted latency with FPGA measured ones under the same DNN model and hardware configurations [31]
. Specifically, for the FPGA one we use the open source implementation of the award winner
[31] in a stateoftheart design contest [32]. Fig. 5 shows that our Predictor’s predicted latency differs from the FPGAsynthesized ones by 16.84%. Note that in FPGA implementations the GB can be partitioned into smaller chunks to be accessed simultaneously for increasing the parallelism and minimizing the latency. Our current models do not include the overhead of this partition, which is larger when the GB is partitioned into more chunks for layers with a larger size, leading to a larger prediction error for the CONV4/CONV5/CONV6 layers in Fig. 5.Validation against Synthesis Results. Table 2 compares the Predictor’s energy breakdown with that from the synthesis results for AlexNet’s CONV3CONV5 layers when using an inhouse dedicated accelerator using a commercial 65nm CMOS technology. It can be seen from Table 2 that the difference between our Predictor’s predicted energy breakdown and that from the synthesis results is less than 5.28%.
Layer  comp. (%)  RF(%)  GB(%)  
Syn.  Pred.  Syn.  Pred.  Syn.  Pred.  
CONV3  38.76  34.49  4.26  60.99  65.25  4.26  0.24  0.25  0.01 
CONV4  39.46  34.28  5.19  60.28  65.45  5.16  0.25  0.27  0.02 
CONV5  31.13  25.85  5.28  68.65  73.91  5.26  0.22  0.24  0.02 
5 Conclusion
To close the gap between the growing demand for dedicated DNN accelerators with various specifications and the timeconsuming and challenging DNN accelerator design, we develop DNNChip Predictor, which can efficiently and effectively predict an accelerator’s energy, latency, and resource consumption. Such an analytical performance prediction tool will facilitate fast development of innovations for not only DNN accelerators but also hardwareaware efficient DNNs.
6 Acknowledgement
The work is supported by the National Science Foundation (NSF) through the ECCS Division Of Electrical, Communication & Cyber System (Award number: 1934767).
References
 [1] Karen Simonyan et al., “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.
 [2] Yue Wang et al., “E2Train: Training Stateoftheart CNNs with Over 80% Energy Savings,” in Advances in Neural Information Processing Systems, 2019, pp. 5139–5151.
 [3] Wayne Xiong et al., “The microsoft 2016 conversational speech recognition system,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5255–5259.
 [4] Sicong Liu et al., “Ondemand deep model compression for mobile devices: A usagedriven model selection framework,” in Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2018, pp. 389–400.
 [5] Yue Wang et al., “Energynet: Energyefficient dynamic inference,” in Advances in Neural Information Processing Systems (Workshop), 2018.

[6]
Jianghao Shen et al.,
“Fractional Skipping: Towards FinerGrained Dynamic Inference,”
in
The ThirtyForth AAAI Conference on Artificial Intelligence
, 2020. 
[7]
Junru Wu et al.,
“Deep kMeans: ReTraining and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions,”
inThirtyfifth International Conference on Machine Learning
, 2018.  [8] Yue Wang et al., “Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference,” IEEE Journal of Selected Topics in Signal Processing, 2019.
 [9] Xiaofan Zhang et al., “DNNBuilder: an automated tool for building highperformance dnn hardware accelerators for FPGAs,” in Proc. of ICCAD, 2018.

[10]
Yingyan Lin et al.,
“Predictivenet: An energyefficient convolutional neural network via zero prediction,”
in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.  [11] Google Inc., “Edge TPU,” https://cloud.google.com/tpu/, accessed 20190901.
 [12] Google Inc., “Edge TPU,” https://coral.withgoogle.com/docs/edgetpu/faq/, accessed 20190901.
 [13] Zidong Du et al., “Shidiannao: Shifting vision processing closer to the sensor,” in ACM SIGARCH Computer Architecture News. ACM, 2015, vol. 43, pp. 92–104.
 [14] YuHsin Chen et al., “Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43th Annual International Symposium on. IEEE Press, 2016, pp. 367–379.
 [15] Song Han et al., “Eie: efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 243–254.
 [16] Chen Zhang et al., “Optimizing fpgabased accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays. ACM, 2015, pp. 161–170.
 [17] Tianshi Chen et al., “Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning,” in ACM Sigplan Notices. ACM, 2014, vol. 49, pp. 269–284.
 [18] Tianqi Tang et al., “Mlpat: A power, area, timing modeling framework for machine learning accelerators,” The Second International Workshop on Domain Specific System Architecture (DOSSA), 2019.
 [19] Zhiqiang Liu et al., “Throughputoptimized fpga accelerator for deep convolutional neural networks,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 10, no. 3, pp. 17, 2017.
 [20] Ananda Samajdar et al., “Scalesim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
 [21] Angshuman Parashar et al., “Timeloop: A systematic approach to dnn accelerator evaluation,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 304–315.
 [22] YuHsin Chen et al., “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” arXiv preprint arXiv:1807.07928, 2018.
 [23] Yannan Wu et al., “Accelergy: An architecturelevel energy estimation methodology for accelerator designs,” 2019.
 [24] Xuan Yang et al., “Dnn dataflow choice is overrated,” arXiv preprint arXiv:1809.04070, 2018.
 [25] Jonathan RaganKelley et al., “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” Acm Sigplan Notices, vol. 48, no. 6, pp. 519–530, 2013.
 [26] Hyoukjun Kwon et al., “Understanding reuse, performance, and hardware cost of dnn dataflows: A datacentric approach,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2019, pp. 754–768.
 [27] Adam Paszke et al., “Automatic differentiation in pytorch,” 2017.
 [28] Cong Hao et al., “Fpga/dnn codesign: An efficient design methodology for iot intelligence on the edge,” Proc. of DAC, 2019.
 [29] YuHsin Chen et al., “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of SolidState Circuits, vol. 52, no. 1, pp. 127–138, 2017.

[30]
Alex Krizhevsky et al.,
“Imagenet classification with deep convolutional neural networks,”
in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105. Curran Associates, Inc., 2012.  [31] Xiaofan Zhang et al., “Skynet: A champion model for dacsdc on low power object detection,” arXiv preprint arXiv:1906.10327, 2019.
 [32] Xilinx, Nvidia, and DJI, “Dac 2019 system design contest,” 2019.
Comments
There are no comments yet.