Software-Defined Design Space Exploration for an Efficient AI Accelerator Architecture

03/18/2019 ∙ by Ye Yu, et al. ∙ 10

Deep neural networks (DNNs) have been shown to outperform conventional machine learning algorithms across a wide range of applications, e.g., image recognition, object detection, robotics, and natural language processing. However, the high computational complexity of DNNs often necessitates extremely fast and efficient hardware. The problem gets worse as the size of neural networks grows exponentially. As a result, customized hardware accelerators have been developed to accelerate DNN processing without sacrificing model accuracy. However, previous accelerator design studies have not fully considered the characteristics of the target applications, which may lead to sub-optimal architecture designs. On the other hand, new DNN models have been developed for better accuracy, but their compatibility with the underlying hardware accelerator is often overlooked. In this article, we propose an application-driven framework for architectural design space exploration of DNN accelerators. This framework is based on a hardware analytical model of individual DNN operations. It models the accelerator design task as a multi-dimensional optimization problem. We demonstrate that it can be efficaciously used in application-driven accelerator architecture design. Given a target DNN, the framework can generate efficient accelerator design solutions with optimized performance and area. Furthermore, we explore the opportunity to use the framework for accelerator configuration optimization under simultaneous diverse DNN applications. The framework is also capable of improving neural network models to best fit the underlying hardware resources.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fig. 1: Convolutional layer of a DNN [1]

Deep neural networks (DNNs) have recently become the fundamental inference vehicle for a broad range of artificial intelligence applications. Unlike conventional machine learning algorithms that rely on handcrafted features, DNNs extract features and learn the hidden patterns automatically from the training data. However, the superior performance of DNNs comes at the cost of requiring an immense amount of training data and massive computational complexity. The number of operations in the winning DNNs in the ImageNet Large Scale Visual Recognition Competition

[2] has increased exponentially over the past few years, e.g., 1.4 GOPs for AlexNet [3] in 2012 to 38 GOPs for VGG-19 [4] in 2014. On the other hand, the slowdown of Moore’s Law scaling makes it difficult to keep pace with DNN growth, since the gap between the computational demand from DNNs and the computational capacity available from underlying hardware resources keeps increasing. Hence, specialized hardware architectures have been developed to efficiently process DNNs. DNNs map well to a graphical processing unit (GPU) due to its parallel architecture, massive number of computational units, and high memory bandwidth. However, its performance density, the computation capacity per unit area, has almost saturated since 2011 [5]. The only improvement is due to technology scaling from 28 nm to 20 nm [5]. It is infeasible to continuously increase the chip area to accommodate larger DNNs without improving performance density.

Customized application-specific integrated circuits (ASICs)- and field-programmable gate array (FPGA)-based accelerators have also emerged for efficient DNN processing. These customized accelerators are designed to accelerate common low-level DNN operations, such as convolution and matrix multiplication. Hence, even though new DNN models are evolving rapidly and differ in their network architectures significantly, ASIC- and FPGA-based accelerators are still capable of processing various DNNs efficiently. FPGA-based accelerators [6, 7, 8, 9, 10] provide high parallelism and fast time-to-market. An embedded FPGA platform is used as a convolver with dynamic-precision data quantization in [7] to achieve high throughput. In [8]

, a load-balance-aware pruning method is implemented on an FPGA to compress the Long Short-Term Memory model by 20

. A dynamic programming algorithm is used to map DNNs to a deeply pipelined FPGA in [9], which can achieve up to 21 and 2 energy efficiency relative to a central processing unit (CPU) and GPU, respectively. Alternatively, ASIC accelerators [11, 12, 13, 14] demonstrate much better power efficiency for DNN processing relative to general-purpose processors. A model compression method is utilized in [11]

, where the inference engine can process the compressed network model efficiently through acceleration on sparse matrix-vector multiplications. A custom multi-chip machine-learning architecture, called DaDianNao, is presented in

[13]. With each chip implemented as a single-instruction multiple-data (SIMD)-like processor, DaDianNao achieves two orders of magnitude speedup over a GPU. In [12]

, an ASIC accelerator is proposed for DNN training on the server end. It uses heterogeneous processing tiles with a low-overhead synchronization mechanism. The Google Tensorflow processing unit (TPU) speeds up DNN inference with its multiply-accumulate (MAC) units arranged in a systolic structure

[14]. It has extended support for DNN training in its second version.

To improve accelerator performance density, the computational resources should be fully utilized using an efficient dataflow. Since convolutional layers typically constitute over 90% of total DNN operations [3, 15], parallelizing convoluton computations can accelerate overall DNN processing significantly. The design space for convolutional layer processing comprises processing of multiple loops and data partitioning choices governed by limited on-chip memory [1]. In order to efficiently process the convolutional layer, loop unrolling, loop tiling, and loop interchange are often used in recent DNN accelerators [16, 1]. However, previous accelerator designs do not fully consider the target applications in the early design stage. This may lead to the choice of a sub-optimal design from the DNN accelerator design space for the target applications since the characteristics of different DNNs may vary significantly and require very different architectural designs for efficient processing.

In this article, we make the following contributions:
1) We develop an application-driven framework for architectural design space exploration of efficient DNN accelerators. This framework is based on a hardware analytical model for various DNN operations. The design space exploration task is modeled as a multi-dimensional optimization problem, which can be handled using a multi-step greedy method.
2) We use this framework in the early accelerator architecture design stage to achieve geometric mean performance improvements ranging from 12.4% to 92.0%.
3) We use this framework to explore optimization opportunities for simultaneously addressing diverse DNN applications.
4) We perform a sensitivity study of the relationship between DNN characteristics and the corresponding accelerator design configurations.

The rest of the article is organized as follows. Section 2 discusses background information on DNNs. Section 3 presents our hardware analytical model for DNN operations. Section 4 describes the application-driven framework for efficient accelerator design space exploration. Section 5 presents experimental results obtained by our framework on accelerator optimization and sensitivity study of DNN characteristics. Section 6 concludes the article.

2 Background

In this section, we present background material to help understand the rest of the article. We first describe the convolutional layer of a DNN in Section 2.1. We then explore two optimization strategies used for DNNs: computational parallelism and data reuse, in Section 2.2 and Section 2.3, respectively.

Fig. 2: Loop unrolling [1]: (a) within one kernel window, (b) across input feature map channels, (c) within one input feature map, (d) across output feature map channels, and (e) across inputs in a batch

2.1 Convolutional layer of a DNN

The convolutional layer is one of the DNN building blocks. In the forward pass, the convolutional layer convolves a batch of 3D input feature maps with multiple learnable 3D kernel weights to generate a batch of 3D output feature maps. These learnable 3D kernel weights are trained to detect which features, such as edges, are present in the input feature maps. They also capture spatial and temporal dependencies in the input feature maps. As shown in Fig. 1, the convolutional layer operations can be represented by five sets of convolution loops: looping through different kernels, looping within one input feature map channel, looping through multiple input feature map channels, looping within one kernel channel, and looping through different inputs in the batch. The execution order of these loops can be interchanged to optimize the dataflow. N, N, N, N, N, N

, batch_size, and S denote the number of output feature map channels, height and width of the output feature map, number of input feature map channels, height and width of the kernel window, input batch size, and sliding stride, respectively. These parameters are determined by the architecture of the DNNs. The on-chip memory of the accelerator may not be large enough to hold the entire data set for input feature maps, kernel weights, and output feature maps. Therefore, they need to be partitioned into smaller data chunks in order to fit in the on-chip memory. This is called loop tiling

[16]. T* parameters (e.g., T) denote the corresponding design variables for loop tiling. They represent the size of the data chunks stored in on-chip memory. Computational parallelism in the convolutional layer comes from loop unrolling, which increases the acceleration throughput and resource utilization ratio. P* parameters denote the number of parallel computations in each dimension. As the computing resources can only process the data stored in on-chip memory and the tiled data set is a subset of the total data set, the design constraints are set as P*  T*  N*. For instance, P T N. The computational parallelism enabled by loop unrolling is discussed next.

2.2 Computational parallelism

Fig. 2(a)-(e) depict how the five types of loop unrolling work in the convolutional layer. Fig. 2(a) depicts loop unrolling within one kernel window. In each cycle, PP input pixels and kernel weights in the kernel window are read in from the on-chip memory to perform parallel multiplications followed by PP additions. The result is then added to the previously obtained partial sum. However, as the kernel sizes (N and N) are usually relatively small, stand-alone loop unrolling within one kernel window cannot provide enough parallelism to fully utilize the accelerator compute resources [17]. Fig. 2(b) depicts loop unrolling across multiple input feature map channels. In each cycle, pixels at the same position in different channels are multiplied with the corresponding kernel weights across channels to produce P products. Then these products are summed up and accumulated with the partial sum. Fig. 2(c) shows loop unrolling within one input feature map channel. PP input pixels are multiplied with the same kernel weight in parallel. The products are then accumulated to the corresponding partial sums in parallel. The input feature map sizes (N and N) are typically large enough to provide sufficient parallelism for the accelerator as long as the required data are stored in the on-chip memory. Fig. 2(d) describes loop unrolling across different kernels. Multiple kernel weights at the same location from different kernels are multiplied with the same input pixels. The products are accumulated to the corresponding partial sums for different output feature map channels. Fig. 2(e) presents loop unrolling across inputs in the batch. Multiple inputs can be processed in parallel since there is no data dependency among them. These loop unrolling types can be combined to further increase the parallelism in convolutional layer processing. For example, loop unrolling within the kernel window, across multiple input feature map channels, and across different kernels are employed together in [7, 18, 17] while loop unrolling within one kernel window and within one input feature map channel are utilized in [19].

2.3 Data reuse

A convolutional layer is processed by sliding the kernel windows along the 3D input feature maps where MAC operations are performed at each sliding step. Since memory access and data movement incur significant delay and energy overheads [19], data fetched from on-chip memory should be reused as much as possible before being discarded.

If the loops within an input feature map (Fig. 2(c)) are unrolled, each kernel weight is broadcast to multiply with PP different input pixels in every cycle. Thus, it is reused PP times. If multiple inputs in the batch are processed in parallel (Fig. 2(e)), the number of times the kernel weight is reused is equal to the batch size. If both types of loop unrolling are employed, the total number of times each kernel weight is reused is:

(1)

If loops across output feature map channels (Fig. 2(d)) are unrolled, then each input pixel is multiplied with multiple kernel weights from different kernels in parallel. Hence, each input pixel is reused P times. Besides, if both loops within a kernel window (Fig. 2(a)) and within an input feature map (Fig. 2(c)) are unrolled together, then the pixels in neighboring kernel windows partially overlap as long as the sliding stride is smaller than the kernel window size. This results in the average number of times each input pixel is reused being PPPP/(((P)S+P)((P)S+P)) since the overlapped pixels can be reused in the following cycle. Combining the three types of loop unrolling mentioned above results in the total number of times each input pixel is reused being [1]:

(2)

3 Analysis of DNN operation processing

In this section, we provide an analysis of hardware accelerator processing of DNN operations. We extend the analytical model for 2D convolution discussed in [1] with batch processing enabled. Based on this approach, we build analytical models for various compute-intensive DNN operations.

The number of MAC operations in the convolutional layer is N = NNNNNN. Ideally, the total number of cycles required is NP, where P is the total number of the MAC units and assuming 100% MAC units efficiency. However, the available MAC units may not be fully utilized due to loop unrolling and loop tiling. In [1], the compute latency of the convolutional layer is modeled as the product of inter-tiling cycle and inner-tiling latency, where

(3)
(4)

Inter-tiling cycle refers to the number of data chunks used in loop tiling and inner-tiling latency refers to the number of cycles required to process each chunk. Memory transfer latency is modeled as the maximum of input memory cycles and weight memory cycles, where

(5)
(6)
(7)
(8)

With the assumption that memory bandwidth is not a bottleneck and multipliers can receive the input pixels and kernel weights continuously without incurring an idle cycle, the total processing latency for the convolutional layer is equal to the maximum value of the compute and memory transfer latencies.

To relax the constraint imposed by the memory bandwidth assumption made above and increase performance estimation accuracy, an extra optional finer-grained buffer simulator has been developed to monitor on-chip data. The entire convolutional layer is divided into multiple computational blocks that can be executed in parallel. Apart from the execution latency of each computational block, the memory transfer latency is also included if the data required by the block are not stored in the buffer. This buffer simulator simulates data fetching from and storing back to off-chip memory. The number of computational blocks is a tradeoff between estimation speed and accuracy.

Depthwise separable convolution [20] is a variation of 2D convolution. It splits ordinary 2D convolution into two parts: 2D convolution within each channel (depthwise convolution) and mixing the channels using a set of 11 convolutions across channels (channel mixing). Compared to ordinary convolution, it has fewer parameters. Therefore, it requires less computation and is less prone to overfitting. As shown in Table I, the first part, depthwise convolution, can be fit into the 2D convolution model discussed above with the number of filter kernels being equal to 1. The second part, channel mixing, can be fit into the 2D convolution model with 11 kernel size.

Fig. 3: Model transformations of matrix-vector and matrix-matrix multiplications

Another important layer in a DNN is the fully-connected layer, which is processed as matrix-vector multiplication. We embed matrix-vector multiplication into 2D convolution to fit it into the analytical model described above, as shown in Fig. 3(a). The width, height, and depth of the input feature map are equal to the row number of the matrix, 1, and the column number of the matrix, respectively. The vector is transferred to the 11 kernel with a depth equal to the matrix column number. Similarly, matrix-matrix multiplication is embedded into 2D convolution, as shown in Fig. 3(b). The second matrix is transferred to 11 kernels, where is the column number of the second matrix. Details of the design parameter values used to fit depthwise separable convolution (depthwise convolution and channel mixing), matrix-vector multiplication, and matrix-matrix multiplication operations into the 2D convolution cost model are shown in Table I. In matrix-vector multiplication, col and row depict the matrix column number and row number, respectively. In matrix-matrix multiplication, col_1, row_1, and col_2 depict the column and row numbers of the first matrix, and column number of the second matrix, respectively.

We validated these DNN operation analytical models against an internal FPGA implementation of three DNNs and found the timing errors to be within 10%.

2D conv. Nif Nix Niy Nkx Nky Nof Nox Noy S
Depthwise conv. Nif Nix Niy Nkx Nky 1 Nox Noy S
Channel mixing Nif Nix Niy 1 1 Nof Nox Noy S
Matrix-vector mul. col. row 1 1 1 1 row 1 1
Matrix-matrix mul. col_1 row_1 1 1 1 col_2 row_1 1 1
TABLE I: Operation parameter values used in the cost model

4 Application-driven architectural optimization

In this section, we discuss the proposed application-driven architectural optimization framework that is based on the analytical operation models.

4.1 Architectural optimization flow

Fig. 4 shows the accelerator architectural optimization flow. An architecture description file is used to define the design variables of the hardware accelerator. For example, it defines variables for the compute resource organization and the allocation of on-chip memory for activations and weights. Another input is the DNN computation graph of the target application that the accelerator is optimized for. We obtain this DNN computation graph by parsing the model file frozen from TensorFlow [21]. It is a directed acyclic graph (DAG) in which a vertex represents a DNN operation and an edge defines data dependency. The computation graph is first analyzed by a graph analyzer to generate a DNN operation stream. The total delay of the operation stream is estimated using the analytical model discussed in Section 3. We only focus on the time-consuming operations. Accelerator performance on the target application is then optimized using a multidimensional optimizer to obtain an optimized architectural configuration.

Fig. 4: Architectural optimization flow

4.2 Computation graph analyzer

The DNN DAG is analyzed by traversing backward from the end node using depth-first search. The operation stream is obtained such that an operation can only be appended to the stream if it has no parent node or all of its parent nodes are already processed and are in the stream.

Fig. 5(a)-(d) show an example of a DAG and dynamic memory allocation analysis of the intermediate results. White nodes represent unprocessed operations that can only be processed if they have no parent nodes or all of their parent nodes have been processed. Then they become blue nodes whose outputs are stored in on-chip memory. Their incoming edges are then removed since data dependency no longer exists after they are processed. A blue node turns to grey if it has no more outgoing edges, which means no more nodes depend on it. Hence, the memory space for its outputs can be deallocated. Dynamic memory allocation is monitored throughout DAG traversal and the maximum dynamic memory demand sets the lower bound for on-chip buffer size of the accelerator.

Fig. 5: Illustration of dynamic memory allocation analysis
Variable name Definition
loop_order The execution order of the convolutional loops
PE_group The total number of processing-element (PE) groups
MAC/group The number of MACs in each PE group
buffer_bank_height The height of the buffer bank
buffer_bank_width The width of the buffer bank
weight_bank/group The number of buffer banks per PE group for weights
activation_bank/group The number of buffer banks per PE group for activations
Tif The number of input feature map channels in loop tiling
Tix The width of input feature map in loop tiling
Tiy The height of input feature map in loop tiling
Tof The number of output feature map channels in loop tiling
TABLE II: Accelerator architecture design variables

4.3 Multidimensional optimization

We model the architectural optimization task as a multidimensional optimization problem. We choose performance under an area constraint as our optimization metric, where the DNN processing latency is estimated from the analytical model described above. The design variables defined in the architecture description file are the independent variables for multidimensional optimization. Table II shows some of these variables. The minimum number of MAC units is constrained by the required number of parallel MAC operations per cycle:

(9)

The weight buffer size needs to be large enough to hold weight tiles. The maximum dynamic weight demand obtained from the computation graph analyzer sets the lower bound:

(10)
(11)

Similarly, the constraints on the activation buffer size are:

(12)
(13)

The and products in Eq. 12 correspond to input feature map tiling and output feature map tiling, respectively.

The accelerator area is estimated under the assumption of unit area for each component, e.g., MAC, control logic, on-chip buffer, and register file. The total area is then scaled according to the architectural configuration.

We use a multi-step greedy method [22] to solve the multidimensional optimization problem and avoid getting stuck in local optima. The pseudocode of the multi-step greedy algorithm is presented in Algorithm 1, where , , and denote the performance of configuration , difference in accelerator performance, and convergence threshold in performance, respectively. We start with a random initial accelerator configuration and its corresponding performance and area. First, design variables are randomly selected and searched within its design space one by one: at each step, all the possible values of the selected design variable are iterated with all other variables fixed. The performance and area are estimated for each configuration. At the end of steps, the configuration with the best performance under the area constraint is chosen as the new starting point. Then this process is repeated with another new random design variables until performance improvement converges. The step parameter trades off optimality and complexity, as the search space increases exponentially with .

1:Start with a random initial valid accelerator configuration
2:do
3:     
4:     Randomly pick design variables (,…,)
5:     for  to  do
6:         for all  in Pool do
7:              for all possible values of  do
8:                   with
9:                                               
10:      where
11:     
12:     
13:while 
Algorithm 1 Multi-step greedy method
Fig. 6: Radar charts of accelerator configurations with top 10% performance

5 Hardware-software co-design study

In this section, we study the relationship between the accelerator architecture and the characteristics of its target DNN applications based on the optimization framework. We first optimize accelerator performance under an area constraint on the target applications through accelerator design space exploration. We provide an analysis of the characteristics of the different optimized architectures. We then explore the optimization opportunities for the accelerator architecture when multiple diverse DNN applications run simultaneously. Finally, we study the relationships between DNN applications and the resulting optimized hardware accelerators.

Fig. 7: Performance on selected DNN applications of accelerator configurations with top 10% performance

5.1 Accelerator architecture design space exploration

We have selected seven representative DNNs: Inception-v3 (inception) [23], DeepLabv3 (deeplab) [24], ResNet-v1-50 (resnet) [25], Faster R-CNN (fasterRCNN) [26], PTB (ptb) [27], Wide & Deep Learning (wdl) [28], and NASNet (nasnet) [29], and use the application-driven architectural optimization framework discussed in Section 4 to optimize the accelerator performance under an area constraint.

Inception-v3 is a convolutional neural network (CNN) that uses filters with multiple sizes in the same layer. Extra 1

1 convolutions are added before 33 and 55 convolutions to reduce the number of input feature channels, and thus the computational complexity of the network. DeepLabv3 is a CNN aimed at semantic image segmentation. It assigns labels to every pixel of the input image. It is constructed based on ResNet-101 [25]

and employs atrous spatial pyramid pooling for object segmentation at multiple scales. ResNet-v1-50 is a CNN that uses an “identity shortcut connection” to solve the vanishing gradient problem

[30]

during training. Faster R-CNN uses two networks for real-time object detection: a region proposal network for object boundary predictions and another network to detect objects in the bounding boxes. PTB is a recurrent neural network that uses long short-term memory units for word prediction. Wide & Deep Learning is a model for recommender systems. It jointly trains a wide linear model and a DNN for memorization and generalization, respectively. NASNet is a network that is automatically generated by AutoML, which automates the design of machine learning models. It searches for the best layers on CIFAR-10

[31] and transfers the architectures to ImageNet [2] for object detection.

We select the obtained architectural configurations with top 10% performance (in GOPS) for each DNN application as candidates for optimized configuration selection. Their design configurations are normalized for each variable and plotted in Fig. 6(a)-(g). Their performance on the seven DNN applications is shown in Fig. 7(a)-(g). The highest performance on each DNN is achieved by the architectural configuration optimized for that application using the framework. A configuration with 0 GOPS in Fig. 7 means that the architecture violates the constraints mentioned in Section 4.3 for that specific application.

inception deeplab resnet fasterRCNN ptb wdl nasnet
peak input memory demand 2.8MB 12.7MB 2.4MB 30.1MB 8.0MB 20.0KB 5.3MB
peak weight memory demand 2.1MB 12.8MB 2.4MB 0.3MB 2.0MB 8.0KB 0.2MB
#Conv2D layers 95 38 53 33 0 0 196
#Depthwise separable convolutions 0 17 0 13 0 0 160
#Matrix-matrix mul. layers 0 0 0 4 41 3 1
TABLE III: Summary of the selected DNNs

We can see that Fig. 6(a) and Fig. 6(c) have similar shapes. This means that the optimized architectures for Inception-v3 resemble those for ResNet-v1-50. This is consistent with the performance plots in Fig. 7(a) and Fig. 7(c), respectively, where they both achieve the highest performance on the two networks. The reason for this architecture and resulting performance similarities is that the two networks share similar characteristics, as shown in Table III. Inception-v3 and ResNet-v1-50 have similar peak input/weight memory demands, which means that the two networks require similar on-chip buffer size for the same data processing batch. This is why Fig. 6(a) and Fig. 6(c) have the same values for bank height, bank width, #weight banks, and #activation banks. Besides, both networks mainly comprise 2D convolutional layers. Although the depths of the two networks are different, the distributions of the feature map size and the number of feature map channels are similar, as shown in Fig. 8 and Fig. 9, respectively. Architectures optimized for DeepLabv3 and Faster R-CNN also show similarity in terms of their architectural configurations (Fig. 6(b) and Fig. 6(d)) and performance on the two networks (Fig. 7(b) and Fig. 7(d)). They both require relatively larger on-chip memory for inputs. Therefore, there are dense horizontal lines at 0 GOPS level in Fig. 7(b) and Fig. 7(d) because these architectural configurations violate on-chip memory constraints.

Among all candidate configurations, we select the one with the highest geometric mean of performance on the seven DNNs. It is compared to the architectural configurations with the best performance on each individual DNN, as shown in Table IV. The selected configuration outperforms the best configuration for each DNN by 12.4% to 92.0% in terms of geometric mean performance, as shown in Table V. The different characteristics of various DNNs may lead to significantly different configurations in the design space. Thus, the target applications should be considered in the early design stage to design efficient accelerators for a broad range of DNN applications.

Fig. 8: Feature map size comparisons of Inception-v3 and ResNet-v1-50
Fig. 9: Feature map channel comparisons of Inception-v3 and ResNet-v1-50
Best on inception Best on deeplab Best on resnet Best on fasterRCNN Best on ptb Best on wdl Best on nasnet Selected optimized result
inception 1.00 0.55 0.76 0.44 0.33 0.38 0.52 0.55
deeplab 0.46 1.00 0.49 0.72 0.23 0.27 0.42 0.99
resnet 0.97 0.64 1.00 0.55 0.37 0.38 0.59 0.64
fasterRCNN 0.41 0.42 0.54 1.00 0.32 0.48 0.48 0.99
ptb 0.26 0.34 0.27 0.52 1.00 0.67 0.17 0.34
wdl 0.55 0.58 0.40 0.47 0.46 1.00 0.58 0.58
nasnet 0.77 0.83 0.50 0.49 0.14 0.15 1.00 0.83
Geometric mean 0.57 0.59 0.53 0.58 0.34 0.41 0.48 0.66
TABLE IV: Performance (GOPS) comparisons of the selected optimized configuration and the best configuration for each DNN. Performance on each DNN normalized to that of the best configuration
Over best inception Over best deeplab Over best resnet Over best fasterRCNN Over best ptb Over best wdl Over best nasnet
15.9% 12.4% 25.6% 14.9% 92.0% 61.2% 36.9%
TABLE V: Average performance improvements of the selected result over the best configuration for each DNN

5.2 Multi-context optimization

From Fig. 6, we observe that the optimized accelerator architectures diverge for different DNN applications. Hence, in this section, we explore if there exist new optimization opportunities when very different DNN applications run simultaneously on the same hardware accelerator.

Fig. 10: Radar chart of accelerator configurations for a multi-context application

First, we mix Inception-v3 and PTB by interleaving layers from both DNNs. Then, we use the framework to optimize the accelerator architecture on this mixed DNN in a multithreaded manner. Fig. 10 shows the resulting architectural configurations with top 10% performance on this multi-context application. This radar chart is quite different from those shown in Fig. 6(a) and Fig. 6(e) and is not a simple combination of those two radar charts. It has smaller #macs compared to Fig. 6(a), and smaller loop tiling sizes, e.g., and , relative to Fig. 6(e). As shown in Table III, Inception-v3 is compute-intensive: it is dominated by 2D convolutional layers, and thus requires relatively larger #macs for efficient processing. On the other hand, PTB is memory-intensive: it consists of a large number of matrix-matrix multiplication layers with relatively high peak input/weight demand. Hence, large tiling sizes appear in its optimized architectural configurations. However, when these two DNN applications run simultaneously on the same accelerator, the required amount of compute and memory resources is lowered in the optimized configurations. The reasons for this are two-fold. First, under an area constraint, the optimized architectural configurations for the multi-context application need to maintain a balance between compute and memory resources. Second, the complementary characteristics of Inception-v3 and PTB help relax both compute and memory design constraints on the accelerator architecture: while MACs are mainly devoted to convolutional layers of Inception-v3, filter weights can be transferred between the weight buffer and external memory for matrix-matrix multiplication layers for PTB at the same time, with no or very little performance loss, since the layers of both DNNs are interleaved. This shows that the optimal design for multi-context applications may not be a simple combination of designs optimized for each individual application and new optimization opportunities can be explored using our application-driven architectural design space exploration framework.

5.3 Application sensitivity analysis

It is evident that the application-driven architectural optimization framework will generate similar architectural configurations for DNNs with common characteristics. However, to better understand the reasons for the different accelerator configuration results shown in Fig. 6, we perform an application sensitivity analysis to discover the hardware-software relationship in DNN applications.

We build the Faster R-CNN network in four steps. In the first step, we build a DNN with the same number of 2D convolutional layers as that in Faster R-CNN, but with relatively larger feature map sizes. The next step is to make the convolution dimensions the same as those in the Faster R-CNN. Depthwise separable convolutional layers and matrix-matrix multiplication layers are then added in the following two steps. In each step, we use the architectural optimization framework to generate the architectural configurations and select those with top 10% performance.

Fig. 11: Radar charts of accelerator configurations at each step

Fig. 11(a)-(d) show the optimized architectural configurations obtained at each step. We can see that reducing the feature map sizes of the convolutional layers (from Fig. 11(a) to Fig. 11(b)) impacts the loop tiling design variables. Smaller tiling sizes are preferred for better performance. According to Eq. (3), reducing the feature map size while keeping loop tiling variables unchanged may lower the efficiency of memory transactions. Thus, the value of loop tiling variables is also reduced in Fig. 11(b). In the third step, 13 depthwise separable convolutional layers are inserted every two Conv2D layers with the same convolution dimensions as their following Conv2D layers. Comparing Fig. 11(b) and Fig. 11(c), we can see that just adding depthwise separable convolution operations without changing the feature map size does not affect the optimized architectural configurations. In the fourth step, large matrix multiplication layers are added. The number of PE groups is increased to generate more parallel MACs for matrix multiplication processing. Besides, this also increases the value of loop tiling variables again since more computational parallelism can be exploited when processing larger data chunks. This is consistent with the architectural configurations optimized for PTB, as shown in Fig. 6(e), where the network only consists of large matrix multiplication layers. Fig. 6(a)-(c) and Fig. 6(g) have small design variables in loop tiling dimensions since there are no matrix multiplication layers or the matrix dimension is small.

If the underlying hardware compute resource is fixed, we can perform a similar sensitivity analysis on the target network using this application-driven architectural optimization framework. The analysis results can guide DNN model development to fit the underlying compute resource.

6 Conclusion

In this article, we proposed an application-driven accelerator architectural optimization framework. This framework explores the accelerator design space and optimizes the architectural configuration for the target applications based on the analytical models. We use a multi-step greedy method to solve the multi-dimensional optimization problem. We use this framework to optimize the accelerator architectural configuration for seven selected DNN applications. We show that the architectural configuration optimized for all the seven DNNs can achieve geometric mean performance improvements ranging from 12.4% to 92.0% over the configurations optimized only for each individual DNN. In addition, we explore the opportunity to use the framework for accelerator architectural configuration optimization when complementary DNN applications run simultaneously. Furthermore, the framework can be used to guide DNN model development for running on the fixed hardware accelerator more efficiently.

References

  • [1] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 45–54.
  • [2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”

    Int. Journal Computer Vision

    , vol. 115, no. 3, pp. 211–252, Dec. 2015.
  • [3] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Advances Neural Information Processing Syst., 2012, pp. 1097–1105.
  • [4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learning Representations, 2015.
  • [5] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi, “Scaling for edge inference of deep neural networks,” Nature Electronics, vol. 1, no. 4, p. 216, 2018.
  • [6] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-S. Seo, and Y. Cao, “Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016, pp. 16–25.
  • [7] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with embedded FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016, pp. 26–35.
  • [8] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 75–84.
  • [9] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient CNN implementation on a deeply pipelined FPGA cluster,” in Proc. Int. Symp. Low Power Electronics Design, 2016, pp. 326–331.
  • [10]

    R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, and Z. Zhang, “Accelerating binarized convolutional neural networks with software-programmable FPGAs,” in

    Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 15–24.
  • [11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network,” in Proc. Int. Symp. Computer Architecture, 2016, pp. 243–254.
  • [12] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, “Scaledeep: A scalable compute architecture for learning and evaluating deep networks,” in Proc. Int. Symp. Computer Architecture, 2017, pp. 13–26.
  • [13] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning supercomputer,” in Proc. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014, pp. 609–622.
  • [14]

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in

    Proc. Int. Symp. Computer Architecture, 2017, pp. 1–12.
  • [15] Y. Ma, N. Suda, Y. Cao, J. Seo, and S. Vrudhula, “Scalable and modularized RTL compilation of convolutional neural networks onto FPGA,” in Proc. Int. Conf. Field Programmable Logic Applications, Aug. 2016, pp. 1–8.
  • [16] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2015, pp. 161–170.
  • [17] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of FPGA-based deep convolutional neural networks,” in Proc. Asia South Pacific Design Automation Conf., Jan. 2016, pp. 575–580.
  • [18] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high performance FPGA-based accelerator for large-scale convolutional neural networks,” in Proc. Int. Conf. Field Programmable Logic Applications, Aug. 2016, pp. 1–9.
  • [19] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proc. ACM/IEEE Int. Symp. Computer Architecture, June 2016, pp. 367–379.
  • [20] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in

    Proc. IEEE Conf. Computer Vision Pattern Recognition

    , July 2017, pp. 1800–1807.
  • [21] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in Proc. USENIX Conf. Operating Syst. Design Implementation, 2016, pp. 265–283.
  • [22] P. Schuetz and A. Caflisch, “Efficient modularity optimization by multistep greedy algorithm and vertex mover refinement,” Physical Review E, vol. 77, no. 4, p. 046112, 2008.
  • [23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 2818–2826.
  • [24] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 770–778.
  • [26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Advances Neural Information Processing Syst., 2015, pp. 91–99.
  • [27] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.
  • [28] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” in Proc. Wkshp. Deep Learning Recommender Syst., 2016, pp. 7–10.
  • [29] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” arXiv preprint arXiv:1707.07012, vol. 2, no. 6, 2017.
  • [30] S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,” Int. Journal Uncertainty, Fuzziness Knowledge-Based Syst., vol. 6, no. 2, pp. 107–116, Apr. 1998.
  • [31] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Toronto, 2009.