1 Introduction
Deep neural networks (DNNs) have recently become the fundamental inference vehicle for a broad range of artificial intelligence applications. Unlike conventional machine learning algorithms that rely on handcrafted features, DNNs extract features and learn the hidden patterns automatically from the training data. However, the superior performance of DNNs comes at the cost of requiring an immense amount of training data and massive computational complexity. The number of operations in the winning DNNs in the ImageNet Large Scale Visual Recognition Competition
[2] has increased exponentially over the past few years, e.g., 1.4 GOPs for AlexNet [3] in 2012 to 38 GOPs for VGG19 [4] in 2014. On the other hand, the slowdown of Moore’s Law scaling makes it difficult to keep pace with DNN growth, since the gap between the computational demand from DNNs and the computational capacity available from underlying hardware resources keeps increasing. Hence, specialized hardware architectures have been developed to efficiently process DNNs. DNNs map well to a graphical processing unit (GPU) due to its parallel architecture, massive number of computational units, and high memory bandwidth. However, its performance density, the computation capacity per unit area, has almost saturated since 2011 [5]. The only improvement is due to technology scaling from 28 nm to 20 nm [5]. It is infeasible to continuously increase the chip area to accommodate larger DNNs without improving performance density.Customized applicationspecific integrated circuits (ASICs) and fieldprogrammable gate array (FPGA)based accelerators have also emerged for efficient DNN processing. These customized accelerators are designed to accelerate common lowlevel DNN operations, such as convolution and matrix multiplication. Hence, even though new DNN models are evolving rapidly and differ in their network architectures significantly, ASIC and FPGAbased accelerators are still capable of processing various DNNs efficiently. FPGAbased accelerators [6, 7, 8, 9, 10] provide high parallelism and fast timetomarket. An embedded FPGA platform is used as a convolver with dynamicprecision data quantization in [7] to achieve high throughput. In [8]
, a loadbalanceaware pruning method is implemented on an FPGA to compress the Long ShortTerm Memory model by 20
. A dynamic programming algorithm is used to map DNNs to a deeply pipelined FPGA in [9], which can achieve up to 21 and 2 energy efficiency relative to a central processing unit (CPU) and GPU, respectively. Alternatively, ASIC accelerators [11, 12, 13, 14] demonstrate much better power efficiency for DNN processing relative to generalpurpose processors. A model compression method is utilized in [11], where the inference engine can process the compressed network model efficiently through acceleration on sparse matrixvector multiplications. A custom multichip machinelearning architecture, called DaDianNao, is presented in
[13]. With each chip implemented as a singleinstruction multipledata (SIMD)like processor, DaDianNao achieves two orders of magnitude speedup over a GPU. In [12], an ASIC accelerator is proposed for DNN training on the server end. It uses heterogeneous processing tiles with a lowoverhead synchronization mechanism. The Google Tensorflow processing unit (TPU) speeds up DNN inference with its multiplyaccumulate (MAC) units arranged in a systolic structure
[14]. It has extended support for DNN training in its second version.To improve accelerator performance density, the computational resources should be fully utilized using an efficient dataflow. Since convolutional layers typically constitute over 90% of total DNN operations [3, 15], parallelizing convoluton computations can accelerate overall DNN processing significantly. The design space for convolutional layer processing comprises processing of multiple loops and data partitioning choices governed by limited onchip memory [1]. In order to efficiently process the convolutional layer, loop unrolling, loop tiling, and loop interchange are often used in recent DNN accelerators [16, 1]. However, previous accelerator designs do not fully consider the target applications in the early design stage. This may lead to the choice of a suboptimal design from the DNN accelerator design space for the target applications since the characteristics of different DNNs may vary significantly and require very different architectural designs for efficient processing.
In this article, we make the following contributions:
1) We develop an applicationdriven framework for architectural design space exploration of
efficient DNN accelerators. This framework is based on a hardware analytical model for various DNN
operations. The design space exploration task is modeled as a multidimensional optimization problem,
which can be handled using a multistep greedy method.
2) We use this framework in the early accelerator architecture design stage to achieve
geometric mean performance improvements ranging from 12.4% to 92.0%.
3) We use this framework to explore optimization opportunities for simultaneously
addressing diverse DNN applications.
4) We perform a sensitivity study of the relationship between DNN characteristics and the
corresponding accelerator design configurations.
The rest of the article is organized as follows. Section 2 discusses background information on DNNs. Section 3 presents our hardware analytical model for DNN operations. Section 4 describes the applicationdriven framework for efficient accelerator design space exploration. Section 5 presents experimental results obtained by our framework on accelerator optimization and sensitivity study of DNN characteristics. Section 6 concludes the article.
2 Background
In this section, we present background material to help understand the rest of the article. We first describe the convolutional layer of a DNN in Section 2.1. We then explore two optimization strategies used for DNNs: computational parallelism and data reuse, in Section 2.2 and Section 2.3, respectively.
2.1 Convolutional layer of a DNN
The convolutional layer is one of the DNN building blocks. In the forward pass, the convolutional layer convolves a batch of 3D input feature maps with multiple learnable 3D kernel weights to generate a batch of 3D output feature maps. These learnable 3D kernel weights are trained to detect which features, such as edges, are present in the input feature maps. They also capture spatial and temporal dependencies in the input feature maps. As shown in Fig. 1, the convolutional layer operations can be represented by five sets of convolution loops: looping through different kernels, looping within one input feature map channel, looping through multiple input feature map channels, looping within one kernel channel, and looping through different inputs in the batch. The execution order of these loops can be interchanged to optimize the dataflow. N, N, N, N, N, N
, batch_size, and S denote the number of output feature map channels, height and width of the output feature map, number of input feature map channels, height and width of the kernel window, input batch size, and sliding stride, respectively. These parameters are determined by the architecture of the DNNs. The onchip memory of the accelerator may not be large enough to hold the entire data set for input feature maps, kernel weights, and output feature maps. Therefore, they need to be partitioned into smaller data chunks in order to fit in the onchip memory. This is called loop tiling
[16]. T* parameters (e.g., T) denote the corresponding design variables for loop tiling. They represent the size of the data chunks stored in onchip memory. Computational parallelism in the convolutional layer comes from loop unrolling, which increases the acceleration throughput and resource utilization ratio. P* parameters denote the number of parallel computations in each dimension. As the computing resources can only process the data stored in onchip memory and the tiled data set is a subset of the total data set, the design constraints are set as P* T* N*. For instance, P T N. The computational parallelism enabled by loop unrolling is discussed next.2.2 Computational parallelism
Fig. 2(a)(e) depict how the five types of loop unrolling work in the convolutional layer. Fig. 2(a) depicts loop unrolling within one kernel window. In each cycle, PP input pixels and kernel weights in the kernel window are read in from the onchip memory to perform parallel multiplications followed by PP additions. The result is then added to the previously obtained partial sum. However, as the kernel sizes (N and N) are usually relatively small, standalone loop unrolling within one kernel window cannot provide enough parallelism to fully utilize the accelerator compute resources [17]. Fig. 2(b) depicts loop unrolling across multiple input feature map channels. In each cycle, pixels at the same position in different channels are multiplied with the corresponding kernel weights across channels to produce P products. Then these products are summed up and accumulated with the partial sum. Fig. 2(c) shows loop unrolling within one input feature map channel. PP input pixels are multiplied with the same kernel weight in parallel. The products are then accumulated to the corresponding partial sums in parallel. The input feature map sizes (N and N) are typically large enough to provide sufficient parallelism for the accelerator as long as the required data are stored in the onchip memory. Fig. 2(d) describes loop unrolling across different kernels. Multiple kernel weights at the same location from different kernels are multiplied with the same input pixels. The products are accumulated to the corresponding partial sums for different output feature map channels. Fig. 2(e) presents loop unrolling across inputs in the batch. Multiple inputs can be processed in parallel since there is no data dependency among them. These loop unrolling types can be combined to further increase the parallelism in convolutional layer processing. For example, loop unrolling within the kernel window, across multiple input feature map channels, and across different kernels are employed together in [7, 18, 17] while loop unrolling within one kernel window and within one input feature map channel are utilized in [19].
2.3 Data reuse
A convolutional layer is processed by sliding the kernel windows along the 3D input feature maps where MAC operations are performed at each sliding step. Since memory access and data movement incur significant delay and energy overheads [19], data fetched from onchip memory should be reused as much as possible before being discarded.
If the loops within an input feature map (Fig. 2(c)) are unrolled, each kernel weight is broadcast to multiply with PP different input pixels in every cycle. Thus, it is reused PP times. If multiple inputs in the batch are processed in parallel (Fig. 2(e)), the number of times the kernel weight is reused is equal to the batch size. If both types of loop unrolling are employed, the total number of times each kernel weight is reused is:
(1) 
If loops across output feature map channels (Fig. 2(d)) are unrolled, then each input pixel is multiplied with multiple kernel weights from different kernels in parallel. Hence, each input pixel is reused P times. Besides, if both loops within a kernel window (Fig. 2(a)) and within an input feature map (Fig. 2(c)) are unrolled together, then the pixels in neighboring kernel windows partially overlap as long as the sliding stride is smaller than the kernel window size. This results in the average number of times each input pixel is reused being PPPP/(((P)S+P)((P)S+P)) since the overlapped pixels can be reused in the following cycle. Combining the three types of loop unrolling mentioned above results in the total number of times each input pixel is reused being [1]:
(2) 
3 Analysis of DNN operation processing
In this section, we provide an analysis of hardware accelerator processing of DNN operations. We extend the analytical model for 2D convolution discussed in [1] with batch processing enabled. Based on this approach, we build analytical models for various computeintensive DNN operations.
The number of MAC operations in the convolutional layer is N = NNNNNN. Ideally, the total number of cycles required is NP, where P is the total number of the MAC units and assuming 100% MAC units efficiency. However, the available MAC units may not be fully utilized due to loop unrolling and loop tiling. In [1], the compute latency of the convolutional layer is modeled as the product of intertiling cycle and innertiling latency, where
(3) 
(4) 
Intertiling cycle refers to the number of data chunks used in loop tiling and innertiling latency refers to the number of cycles required to process each chunk. Memory transfer latency is modeled as the maximum of input memory cycles and weight memory cycles, where
(5) 
(6) 
(7) 
(8) 
With the assumption that memory bandwidth is not a bottleneck and multipliers can receive the input pixels and kernel weights continuously without incurring an idle cycle, the total processing latency for the convolutional layer is equal to the maximum value of the compute and memory transfer latencies.
To relax the constraint imposed by the memory bandwidth assumption made above and increase performance estimation accuracy, an extra optional finergrained buffer simulator has been developed to monitor onchip data. The entire convolutional layer is divided into multiple computational blocks that can be executed in parallel. Apart from the execution latency of each computational block, the memory transfer latency is also included if the data required by the block are not stored in the buffer. This buffer simulator simulates data fetching from and storing back to offchip memory. The number of computational blocks is a tradeoff between estimation speed and accuracy.
Depthwise separable convolution [20] is a variation of 2D convolution. It splits ordinary 2D convolution into two parts: 2D convolution within each channel (depthwise convolution) and mixing the channels using a set of 11 convolutions across channels (channel mixing). Compared to ordinary convolution, it has fewer parameters. Therefore, it requires less computation and is less prone to overfitting. As shown in Table I, the first part, depthwise convolution, can be fit into the 2D convolution model discussed above with the number of filter kernels being equal to 1. The second part, channel mixing, can be fit into the 2D convolution model with 11 kernel size.
Another important layer in a DNN is the fullyconnected layer, which is processed as matrixvector multiplication. We embed matrixvector multiplication into 2D convolution to fit it into the analytical model described above, as shown in Fig. 3(a). The width, height, and depth of the input feature map are equal to the row number of the matrix, 1, and the column number of the matrix, respectively. The vector is transferred to the 11 kernel with a depth equal to the matrix column number. Similarly, matrixmatrix multiplication is embedded into 2D convolution, as shown in Fig. 3(b). The second matrix is transferred to 11 kernels, where is the column number of the second matrix. Details of the design parameter values used to fit depthwise separable convolution (depthwise convolution and channel mixing), matrixvector multiplication, and matrixmatrix multiplication operations into the 2D convolution cost model are shown in Table I. In matrixvector multiplication, col and row depict the matrix column number and row number, respectively. In matrixmatrix multiplication, col_1, row_1, and col_2 depict the column and row numbers of the first matrix, and column number of the second matrix, respectively.
We validated these DNN operation analytical models against an internal FPGA implementation of three DNNs and found the timing errors to be within 10%.
2D conv.  Nif  Nix  Niy  Nkx  Nky  Nof  Nox  Noy  S 
Depthwise conv.  Nif  Nix  Niy  Nkx  Nky  1  Nox  Noy  S 
Channel mixing  Nif  Nix  Niy  1  1  Nof  Nox  Noy  S 
Matrixvector mul.  col.  row  1  1  1  1  row  1  1 
Matrixmatrix mul.  col_1  row_1  1  1  1  col_2  row_1  1  1 
4 Applicationdriven architectural optimization
In this section, we discuss the proposed applicationdriven architectural optimization framework that is based on the analytical operation models.
4.1 Architectural optimization flow
Fig. 4 shows the accelerator architectural optimization flow. An architecture description file is used to define the design variables of the hardware accelerator. For example, it defines variables for the compute resource organization and the allocation of onchip memory for activations and weights. Another input is the DNN computation graph of the target application that the accelerator is optimized for. We obtain this DNN computation graph by parsing the model file frozen from TensorFlow [21]. It is a directed acyclic graph (DAG) in which a vertex represents a DNN operation and an edge defines data dependency. The computation graph is first analyzed by a graph analyzer to generate a DNN operation stream. The total delay of the operation stream is estimated using the analytical model discussed in Section 3. We only focus on the timeconsuming operations. Accelerator performance on the target application is then optimized using a multidimensional optimizer to obtain an optimized architectural configuration.
4.2 Computation graph analyzer
The DNN DAG is analyzed by traversing backward from the end node using depthfirst search. The operation stream is obtained such that an operation can only be appended to the stream if it has no parent node or all of its parent nodes are already processed and are in the stream.
Fig. 5(a)(d) show an example of a DAG and dynamic memory allocation analysis of the intermediate results. White nodes represent unprocessed operations that can only be processed if they have no parent nodes or all of their parent nodes have been processed. Then they become blue nodes whose outputs are stored in onchip memory. Their incoming edges are then removed since data dependency no longer exists after they are processed. A blue node turns to grey if it has no more outgoing edges, which means no more nodes depend on it. Hence, the memory space for its outputs can be deallocated. Dynamic memory allocation is monitored throughout DAG traversal and the maximum dynamic memory demand sets the lower bound for onchip buffer size of the accelerator.
Variable name  Definition 

loop_order  The execution order of the convolutional loops 
PE_group  The total number of processingelement (PE) groups 
MAC/group  The number of MACs in each PE group 
buffer_bank_height  The height of the buffer bank 
buffer_bank_width  The width of the buffer bank 
weight_bank/group  The number of buffer banks per PE group for weights 
activation_bank/group  The number of buffer banks per PE group for activations 
Tif  The number of input feature map channels in loop tiling 
Tix  The width of input feature map in loop tiling 
Tiy  The height of input feature map in loop tiling 
Tof  The number of output feature map channels in loop tiling 
4.3 Multidimensional optimization
We model the architectural optimization task as a multidimensional optimization problem. We choose performance under an area constraint as our optimization metric, where the DNN processing latency is estimated from the analytical model described above. The design variables defined in the architecture description file are the independent variables for multidimensional optimization. Table II shows some of these variables. The minimum number of MAC units is constrained by the required number of parallel MAC operations per cycle:
(9) 
The weight buffer size needs to be large enough to hold weight tiles. The maximum dynamic weight demand obtained from the computation graph analyzer sets the lower bound:
(10) 
(11) 
Similarly, the constraints on the activation buffer size are:
(12) 
(13) 
The and products in Eq. 12 correspond to input feature map tiling and output feature map tiling, respectively.
The accelerator area is estimated under the assumption of unit area for each component, e.g., MAC, control logic, onchip buffer, and register file. The total area is then scaled according to the architectural configuration.
We use a multistep greedy method [22] to solve the multidimensional optimization problem and avoid getting stuck in local optima. The pseudocode of the multistep greedy algorithm is presented in Algorithm 1, where , , and denote the performance of configuration , difference in accelerator performance, and convergence threshold in performance, respectively. We start with a random initial accelerator configuration and its corresponding performance and area. First, design variables are randomly selected and searched within its design space one by one: at each step, all the possible values of the selected design variable are iterated with all other variables fixed. The performance and area are estimated for each configuration. At the end of steps, the configuration with the best performance under the area constraint is chosen as the new starting point. Then this process is repeated with another new random design variables until performance improvement converges. The step parameter trades off optimality and complexity, as the search space increases exponentially with .
5 Hardwaresoftware codesign study
In this section, we study the relationship between the accelerator architecture and the characteristics of its target DNN applications based on the optimization framework. We first optimize accelerator performance under an area constraint on the target applications through accelerator design space exploration. We provide an analysis of the characteristics of the different optimized architectures. We then explore the optimization opportunities for the accelerator architecture when multiple diverse DNN applications run simultaneously. Finally, we study the relationships between DNN applications and the resulting optimized hardware accelerators.
5.1 Accelerator architecture design space exploration
We have selected seven representative DNNs: Inceptionv3 (inception) [23], DeepLabv3 (deeplab) [24], ResNetv150 (resnet) [25], Faster RCNN (fasterRCNN) [26], PTB (ptb) [27], Wide & Deep Learning (wdl) [28], and NASNet (nasnet) [29], and use the applicationdriven architectural optimization framework discussed in Section 4 to optimize the accelerator performance under an area constraint.
Inceptionv3 is a convolutional neural network (CNN) that uses filters with multiple sizes in the same layer. Extra 1
1 convolutions are added before 33 and 55 convolutions to reduce the number of input feature channels, and thus the computational complexity of the network. DeepLabv3 is a CNN aimed at semantic image segmentation. It assigns labels to every pixel of the input image. It is constructed based on ResNet101 [25]and employs atrous spatial pyramid pooling for object segmentation at multiple scales. ResNetv150 is a CNN that uses an “identity shortcut connection” to solve the vanishing gradient problem
[30]during training. Faster RCNN uses two networks for realtime object detection: a region proposal network for object boundary predictions and another network to detect objects in the bounding boxes. PTB is a recurrent neural network that uses long shortterm memory units for word prediction. Wide & Deep Learning is a model for recommender systems. It jointly trains a wide linear model and a DNN for memorization and generalization, respectively. NASNet is a network that is automatically generated by AutoML, which automates the design of machine learning models. It searches for the best layers on CIFAR10
[31] and transfers the architectures to ImageNet [2] for object detection.We select the obtained architectural configurations with top 10% performance (in GOPS) for each DNN application as candidates for optimized configuration selection. Their design configurations are normalized for each variable and plotted in Fig. 6(a)(g). Their performance on the seven DNN applications is shown in Fig. 7(a)(g). The highest performance on each DNN is achieved by the architectural configuration optimized for that application using the framework. A configuration with 0 GOPS in Fig. 7 means that the architecture violates the constraints mentioned in Section 4.3 for that specific application.
inception  deeplab  resnet  fasterRCNN  ptb  wdl  nasnet  
peak input memory demand  2.8MB  12.7MB  2.4MB  30.1MB  8.0MB  20.0KB  5.3MB 
peak weight memory demand  2.1MB  12.8MB  2.4MB  0.3MB  2.0MB  8.0KB  0.2MB 
#Conv2D layers  95  38  53  33  0  0  196 
#Depthwise separable convolutions  0  17  0  13  0  0  160 
#Matrixmatrix mul. layers  0  0  0  4  41  3  1 
We can see that Fig. 6(a) and Fig. 6(c) have similar shapes. This means that the optimized architectures for Inceptionv3 resemble those for ResNetv150. This is consistent with the performance plots in Fig. 7(a) and Fig. 7(c), respectively, where they both achieve the highest performance on the two networks. The reason for this architecture and resulting performance similarities is that the two networks share similar characteristics, as shown in Table III. Inceptionv3 and ResNetv150 have similar peak input/weight memory demands, which means that the two networks require similar onchip buffer size for the same data processing batch. This is why Fig. 6(a) and Fig. 6(c) have the same values for bank height, bank width, #weight banks, and #activation banks. Besides, both networks mainly comprise 2D convolutional layers. Although the depths of the two networks are different, the distributions of the feature map size and the number of feature map channels are similar, as shown in Fig. 8 and Fig. 9, respectively. Architectures optimized for DeepLabv3 and Faster RCNN also show similarity in terms of their architectural configurations (Fig. 6(b) and Fig. 6(d)) and performance on the two networks (Fig. 7(b) and Fig. 7(d)). They both require relatively larger onchip memory for inputs. Therefore, there are dense horizontal lines at 0 GOPS level in Fig. 7(b) and Fig. 7(d) because these architectural configurations violate onchip memory constraints.
Among all candidate configurations, we select the one with the highest geometric mean of performance on the seven DNNs. It is compared to the architectural configurations with the best performance on each individual DNN, as shown in Table IV. The selected configuration outperforms the best configuration for each DNN by 12.4% to 92.0% in terms of geometric mean performance, as shown in Table V. The different characteristics of various DNNs may lead to significantly different configurations in the design space. Thus, the target applications should be considered in the early design stage to design efficient accelerators for a broad range of DNN applications.
Best on inception  Best on deeplab  Best on resnet  Best on fasterRCNN  Best on ptb  Best on wdl  Best on nasnet  Selected optimized result  
inception  1.00  0.55  0.76  0.44  0.33  0.38  0.52  0.55 
deeplab  0.46  1.00  0.49  0.72  0.23  0.27  0.42  0.99 
resnet  0.97  0.64  1.00  0.55  0.37  0.38  0.59  0.64 
fasterRCNN  0.41  0.42  0.54  1.00  0.32  0.48  0.48  0.99 
ptb  0.26  0.34  0.27  0.52  1.00  0.67  0.17  0.34 
wdl  0.55  0.58  0.40  0.47  0.46  1.00  0.58  0.58 
nasnet  0.77  0.83  0.50  0.49  0.14  0.15  1.00  0.83 
Geometric mean  0.57  0.59  0.53  0.58  0.34  0.41  0.48  0.66 
Over best inception  Over best deeplab  Over best resnet  Over best fasterRCNN  Over best ptb  Over best wdl  Over best nasnet 
15.9%  12.4%  25.6%  14.9%  92.0%  61.2%  36.9% 
5.2 Multicontext optimization
From Fig. 6, we observe that the optimized accelerator architectures diverge for different DNN applications. Hence, in this section, we explore if there exist new optimization opportunities when very different DNN applications run simultaneously on the same hardware accelerator.
First, we mix Inceptionv3 and PTB by interleaving layers from both DNNs. Then, we use the framework to optimize the accelerator architecture on this mixed DNN in a multithreaded manner. Fig. 10 shows the resulting architectural configurations with top 10% performance on this multicontext application. This radar chart is quite different from those shown in Fig. 6(a) and Fig. 6(e) and is not a simple combination of those two radar charts. It has smaller #macs compared to Fig. 6(a), and smaller loop tiling sizes, e.g., and , relative to Fig. 6(e). As shown in Table III, Inceptionv3 is computeintensive: it is dominated by 2D convolutional layers, and thus requires relatively larger #macs for efficient processing. On the other hand, PTB is memoryintensive: it consists of a large number of matrixmatrix multiplication layers with relatively high peak input/weight demand. Hence, large tiling sizes appear in its optimized architectural configurations. However, when these two DNN applications run simultaneously on the same accelerator, the required amount of compute and memory resources is lowered in the optimized configurations. The reasons for this are twofold. First, under an area constraint, the optimized architectural configurations for the multicontext application need to maintain a balance between compute and memory resources. Second, the complementary characteristics of Inceptionv3 and PTB help relax both compute and memory design constraints on the accelerator architecture: while MACs are mainly devoted to convolutional layers of Inceptionv3, filter weights can be transferred between the weight buffer and external memory for matrixmatrix multiplication layers for PTB at the same time, with no or very little performance loss, since the layers of both DNNs are interleaved. This shows that the optimal design for multicontext applications may not be a simple combination of designs optimized for each individual application and new optimization opportunities can be explored using our applicationdriven architectural design space exploration framework.
5.3 Application sensitivity analysis
It is evident that the applicationdriven architectural optimization framework will generate similar architectural configurations for DNNs with common characteristics. However, to better understand the reasons for the different accelerator configuration results shown in Fig. 6, we perform an application sensitivity analysis to discover the hardwaresoftware relationship in DNN applications.
We build the Faster RCNN network in four steps. In the first step, we build a DNN with the same number of 2D convolutional layers as that in Faster RCNN, but with relatively larger feature map sizes. The next step is to make the convolution dimensions the same as those in the Faster RCNN. Depthwise separable convolutional layers and matrixmatrix multiplication layers are then added in the following two steps. In each step, we use the architectural optimization framework to generate the architectural configurations and select those with top 10% performance.
Fig. 11(a)(d) show the optimized architectural configurations obtained at each step. We can see that reducing the feature map sizes of the convolutional layers (from Fig. 11(a) to Fig. 11(b)) impacts the loop tiling design variables. Smaller tiling sizes are preferred for better performance. According to Eq. (3), reducing the feature map size while keeping loop tiling variables unchanged may lower the efficiency of memory transactions. Thus, the value of loop tiling variables is also reduced in Fig. 11(b). In the third step, 13 depthwise separable convolutional layers are inserted every two Conv2D layers with the same convolution dimensions as their following Conv2D layers. Comparing Fig. 11(b) and Fig. 11(c), we can see that just adding depthwise separable convolution operations without changing the feature map size does not affect the optimized architectural configurations. In the fourth step, large matrix multiplication layers are added. The number of PE groups is increased to generate more parallel MACs for matrix multiplication processing. Besides, this also increases the value of loop tiling variables again since more computational parallelism can be exploited when processing larger data chunks. This is consistent with the architectural configurations optimized for PTB, as shown in Fig. 6(e), where the network only consists of large matrix multiplication layers. Fig. 6(a)(c) and Fig. 6(g) have small design variables in loop tiling dimensions since there are no matrix multiplication layers or the matrix dimension is small.
If the underlying hardware compute resource is fixed, we can perform a similar sensitivity analysis on the target network using this applicationdriven architectural optimization framework. The analysis results can guide DNN model development to fit the underlying compute resource.
6 Conclusion
In this article, we proposed an applicationdriven accelerator architectural optimization framework. This framework explores the accelerator design space and optimizes the architectural configuration for the target applications based on the analytical models. We use a multistep greedy method to solve the multidimensional optimization problem. We use this framework to optimize the accelerator architectural configuration for seven selected DNN applications. We show that the architectural configuration optimized for all the seven DNNs can achieve geometric mean performance improvements ranging from 12.4% to 92.0% over the configurations optimized only for each individual DNN. In addition, we explore the opportunity to use the framework for accelerator architectural configuration optimization when complementary DNN applications run simultaneously. Furthermore, the framework can be used to guide DNN model development for running on the fixed hardware accelerator more efficiently.
References
 [1] Y. Ma, Y. Cao, S. Vrudhula, and J.S. Seo, “Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, 2017, pp. 45–54.

[2]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet
Large Scale Visual Recognition Challenge,”
Int. Journal Computer Vision
, vol. 115, no. 3, pp. 211–252, Dec. 2015.  [3] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Advances Neural Information Processing Syst., 2012, pp. 1097–1105.
 [4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in Proc. Int. Conf. Learning Representations, 2015.
 [5] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi, “Scaling for edge inference of deep neural networks,” Nature Electronics, vol. 1, no. 4, p. 216, 2018.
 [6] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.S. Seo, and Y. Cao, “Throughputoptimized OpenCLbased FPGA accelerator for largescale convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, 2016, pp. 16–25.
 [7] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with embedded FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, 2016, pp. 26–35.
 [8] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, 2017, pp. 75–84.
 [9] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energyefficient CNN implementation on a deeply pipelined FPGA cluster,” in Proc. Int. Symp. Low Power Electronics Design, 2016, pp. 326–331.

[10]
R. Zhao, W. Song, W. Zhang, T. Xing, J.H. Lin, M. Srivastava, R. Gupta, and Z. Zhang, “Accelerating binarized convolutional neural networks with softwareprogrammable FPGAs,” in
Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, 2017, pp. 15–24.  [11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network,” in Proc. Int. Symp. Computer Architecture, 2016, pp. 243–254.
 [12] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, “Scaledeep: A scalable compute architecture for learning and evaluating deep networks,” in Proc. Int. Symp. Computer Architecture, 2017, pp. 13–26.
 [13] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machinelearning supercomputer,” in Proc. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014, pp. 609–622.

[14]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “Indatacenter performance analysis of a tensor processing unit,” in
Proc. Int. Symp. Computer Architecture, 2017, pp. 1–12.  [15] Y. Ma, N. Suda, Y. Cao, J. Seo, and S. Vrudhula, “Scalable and modularized RTL compilation of convolutional neural networks onto FPGA,” in Proc. Int. Conf. Field Programmable Logic Applications, Aug. 2016, pp. 1–8.
 [16] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGAbased accelerator design for deep convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, 2015, pp. 161–170.
 [17] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of FPGAbased deep convolutional neural networks,” in Proc. Asia South Pacific Design Automation Conf., Jan. 2016, pp. 575–580.
 [18] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high performance FPGAbased accelerator for largescale convolutional neural networks,” in Proc. Int. Conf. Field Programmable Logic Applications, Aug. 2016, pp. 1–9.
 [19] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” in Proc. ACM/IEEE Int. Symp. Computer Architecture, June 2016, pp. 367–379.

[20]
F. Chollet, “Xception: Deep learning with depthwise separable convolutions,”
in
Proc. IEEE Conf. Computer Vision Pattern Recognition
, July 2017, pp. 1800–1807.  [21] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for largescale machine learning,” in Proc. USENIX Conf. Operating Syst. Design Implementation, 2016, pp. 265–283.
 [22] P. Schuetz and A. Caflisch, “Efficient modularity optimization by multistep greedy algorithm and vertex mover refinement,” Physical Review E, vol. 77, no. 4, p. 046112, 2008.
 [23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 2818–2826.
 [24] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
 [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 770–778.
 [26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: Towards realtime object detection with region proposal networks,” in Proc. Advances Neural Information Processing Syst., 2015, pp. 91–99.
 [27] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.
 [28] H.T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” in Proc. Wkshp. Deep Learning Recommender Syst., 2016, pp. 7–10.
 [29] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” arXiv preprint arXiv:1707.07012, vol. 2, no. 6, 2017.
 [30] S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,” Int. Journal Uncertainty, Fuzziness KnowledgeBased Syst., vol. 6, no. 2, pp. 107–116, Apr. 1998.
 [31] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Toronto, 2009.