Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

11/07/2018
by   H. T. Kung, et al.
0

This paper describes a novel approach of packing sparse convolutional neural networks for their efficient systolic array implementations. By combining subsets of columns in the original filter matrix associated with a convolutional layer, we increase the utilization efficiency of the systolic array substantially (e.g.,  4x) due to the increased density of nonzeros in the resulting packed filter matrix. In combining columns, for each row, all filter weights but one with the largest magnitude are pruned. We retrain the remaining weights to preserve high accuracy. We demonstrate that in mitigating data privacy concerns the retraining can be accomplished with only fractions of the original dataset (e.g., 10% for CIFAR-10). We study the effectiveness of this joint optimization for both high utilization and classification accuracy with ASIC and FPGA designs based on efficient bit-serial implementations of multiplier-accumulators. We present analysis and empirical evidence on the superior performance of our column combining approach against prior arts under metrics such as energy efficiency (3x) and inference latency (12x).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/22/2019

Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor Crossbars

Neural processor development is reducing our reliance on remote server a...
05/01/2019

Full-stack Optimization for Accelerating CNNs with FPGA Validation

We present a full-stack optimization framework for accelerating inferenc...
10/03/2018

Sparse Winograd Convolutional neural networks on small-scale systolic arrays

The reconfigurability, energy-efficiency, and massive parallelism on FPG...
07/28/2021

SPOTS: An Accelerator for Sparse Convolutional Networks Leveraging Systolic General Matrix-Matrix Multiplication

This paper proposes a new hardware accelerator for sparse convolutional ...
03/24/2020

Evolutionary Bin Packing for Memory-Efficient Dataflow Inference Acceleration on FPGA

Convolutional neural network (CNN) dataflow inference accelerators imple...
03/04/2021

Efficient Training Convolutional Neural Networks on Edge Devices with Gradient-pruned Sign-symmetric Feedback Alignment

With the prosperity of mobile devices, the distributed learning approach...
05/25/2018

Heterogeneous Bitwidth Binarization in Convolutional Neural Networks

Recent work has shown that fast, compact low-bitwidth neural networks ca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many recent hardware-based state-of-the-art deep learning accelerators use systolic arrays for efficient implementations of convolutional neural networks (CNNs). They leverage properties of systolic arrays such as parallel processing under the dataflow architecture, regular layout of processing elements, efficient inter-processor communication, and minimized I/O by being able to reuse the same data fetched from the memory many times 

[32, 31, 49]. These systems, such as the Google TPU [27], the ShiDianNao accelerators [17], and numerous other efforts, including [40][10][61][68][63], have achieved low power consumption and high throughput.

In recent years, there have also been significant algorithmic advances for CNNs which have enabled orders of magnitude reduction in both the number of network parameters and the amount of computation for inference compared to prior well studied networks such as VGG-16 [55]. One of the more important model reduction techniques is weight pruning [22], which has shown that the majority of weights in a trained CNN can be pruned (set to 0) without significantly impacting the accuracy of the network.

However, it can be challenging to efficiently utilize the regular structure of systolic arrays given that these nonzero CNN weights are distributed in an irregular manner. In traditional approaches, zero weights still occupy systolic cells in the systolic array.

In this paper we propose a novel approach, called column combining, which can pack sparse convolutional networks for their efficient systolic array implementations. In combining columns to increase the percentage of nonzeros in a packed CNN, within a group of combined columns, we prune all weights on conflicting rows but the one with the largest magnitude. We then bring up the classification accuracy of the pruned network via retraining. We iteratively perform these column-combining and network-retraining steps to improve both utilization efficiency of the systolic array and the classification accuracy of the network until a target model size is reached.

Thus our proposed column combining approach leverages a joint optimization opportunity present in CNNs. That is, for a CNN, we can optimize its topologies to fit the structure of the underlying computing hardware such as systolic arrays, while preserving most of its classification accuracy via network retraining.

The main contributions of the paper are summarized as follows:

  • Column combining algorithm (Section 3) for packing sparse CNNs with unstructured sparsity for their efficient systolic array implementations. The method can retrain remaining filter weights after column-combine pruning using only fractions of the original training dataset in mitigating data privacy concerns (Section 6). To ease data routing, a row permuting scheme is described (Section 3.5) for a systolic array to output contiguous data items for those columns to be combined together in the systolic array of the next layer.

  • Joint optimization methodology (Algorithm 1 in Section 3) aiming at achieving two objectives simultaneously—high utilization efficiency of the systolic array and high classification accuracy of the CNN. The methodology leverages opportunities presented in CNNs in training for both efficiency and accuracy simultaneously.

  • Bit-serial systolic arrays (Section 4.2) to allow efficient multiplexing of multiple data streams into a single column in the array in support of column combining. In addition, bit-serial implementations provide flexibility in supporting accumulations at various precisions for the multiplier-accumulators (MACs) of systolic cells. In this work, we assume bit-serial implementations of 32-bit accumulations, except in Section 7.1.2 where we use 16-bit accumulations, and 8-bit weights and input data. Our approach extends naturally to other precisions.

  • Cross-layer pipelining (Section 3.6) for CNN inference over a series of systolic arrays, one for each layer. This dramatically reduces the inference latency per input sample (e.g., an input image) and is especially important for realtime applications where it is common to process samples one at a time.

  • ASIC and FPGA designs to validate performance gains of our approaches (section 7) in energy efficiency, area efficiency and latency.

2 Background and Related Work

In this section, we first provide a brief review of the basic principle of using systolic arrays for the implementations of CNNs and introduce terminologies that we will use throughout. Then, we review related ASIC and FPGA accelerators for CNN inference, advances in CNN design, weight pruning, and input and weight quantization, all of which have led to large reductions in both model size and computation cost for training and inference.

2.1 Systolic arrays for Convolutional Layers

It is well known that the bulk of the computation of a convolutional layer for a CNN can be viewed as a matrix-matrix multiplication. Suppose that a convolutional layer has N filters operating on a data volume of depth M, as depicted in Figure (a)a. Then, the result of the convolution computation is the matrix product of the filter matrix and the data matrix, as depicted in Figure (b)b.

(a) Computation of a convolutional layer
(b) Equivalent matrix-matrix multiplication
(c) Weight-stationary Systolic Array
Figure 4:

(a) Computation of a convolutional layer, (b) viewed as a matrix multiplication, and (c) deployed in a weight-stationary systolic array, with skewed input data and output results.

Figure (c)c depicts a systolic array design for this matrix multiplication. It is a weight-stationary systolic array in the sense that filter weights stored in the array will not move during computation, whereas input data continuously move bottom-to-top and result data accumulate left-to-right. For systolic array synchronization, items in the data and result matrices are properly skewed, as shown in the figure. We assume throughout the paper this weight-stationary systolic array design.

2.2 ASIC and FPGA Accelerators for CNNs

Over the past several years, there has been extensive work on constructing efficient ASIC and FPGA designs for CNNs which generally consider well studied networks such as LeNet-5 [34], AlexNet [30], and VGG-16 [55] including [52, 45, 67, 66, 70, 69, 43, 44, 46, 53, 54, 56, 7, 42]. One of the main considerations for such systems is minimizing the number of off-chip DRAM accesses for fetching the CNN weights, input samples and intermediate layer results, as these incur significant energy consumption [22]. Therefore, a main focus of accelerator design is mapping CNN computations in such a way that input and weights are fetched only once for all usages within a layer [9, 11, 17, 39]. Another orthogonal direction is designing memory systems that are more suitable to the regular structure of CNN inference computation [48, 13, 62, 60]. In Section 7.1, we show our design achieves state-of-the-art performance in terms of energy efficiency.

FPGAs allow for faster development time and therefore are often used to explore various new research areas for CNNs, such as low-precision and binary networks [14, 59], novel training regimes [15], and model compression through weight pruning or novel CNN structures [21, 16]. In Section 7, we validate the performance of our filter matrix packing algorithm with an FPGA implementation. Additionally, we compare our implementation to previous state-of-the-art FPGA results [57, 43, 16, 70].

2.3 CNNs with Simplified Filter Structures

Figure 5 compares standard CNNs to two recent CNN variants, separable convolution [12, 25] and shift convolution [65], as shown in Figure 5. Separable convolution decouples a standard convolution layer into two smaller convolution layers (depthwise convolution and pointwise convolution) in order to reduce both model size and amount of computation. Each pointwise filter has only a single weight for each channel, and therefore does not utilize neighboring pixels in the spatial dimensions (width and height). Shift convolution replaces the depthwise convolution layer with a shift operation that does not require any learned weights. In Section 4, we leverage shift convolution to construct a network that consists only of pointwise layers, as it greatly simplifies the structure of computation in each layer.

Figure 5: Standard, separable, and shift convolution.

2.4 Weight Pruning During Training

Weight pruning methods aim to reduce the number of weights in a trained CNN by removing (pruning) unimportant weights. These pruning techniques have shown that many well studied networks such as AlexNet and VGG-16 have a large number of weights (up to 90%) that can be pruned without any impact on classification accuracy [64, 41, 19, 26, 24, 38].

In Section 3, we propose an iterative pruning procedure, similar to CondenseNet [26], but at the finest granularity of individual weights. This iterative pruning method gradually removes the smallest magnitude weights during training. This leads to sparse models (as low as 10% nonzero in each convolution layer) which still achieve similar performance to the baseline dense models. Additionally, as outlined in Section 3, we prune weights in such a way as to improve the utilization efficiency of the CNN when deployed in the systolic array design for sparse CNNs described in Section 4.

2.5 Input and Weight Quantization

Quantization is another direction in accelerating inference computations. In this work, we take a simple linear fixed-point quantization scheme [35]. We quantize both the inputs and weights to an 8-bit fixed-point representation from the 32-bit float-point representation [36, 20] used during training. This quantization has been shown to lead to minimal performance degradation even on challenging datasets [35]. Within a layer, the accumulation is done with 32-bit integers, which adds complexity to the bit-serial systolic array design and is discussed in Section 4.2.

3 Column Combining

As discussed in Section 2.4, training a CNN with weight pruning leads to small but highly sparse models with unstructured nonzero weights, which is not directly amenable to efficient implementation in systolic arrays traditionally designed for dense matrix-matrix multiplication. In this section, we propose a column combining algorithm, which is an iterative training procedure that jointly optimizes the CNN for both classification accuracy and utilization efficiency when deployed in the proposed systolic array described in Section 4.

3.1 Terminologies and definitions

Suppose that we are given the filter matrix of weights associated with a convolutional layer of the CNN (see Figure (a)a). The columns of this filter matrix which have nonzero weights on a row are said to be conflicting on the row, and the row is said to be a conflicting row for these columns. By column combining, we mean combining a given group of columns into a single combined column. In a combined column, for the columns which conflict on a row, all nonzero weights on the row are pruned except for the one with the largest magnitude. We refer this pruning process as column-combine pruning.

We say a group of columns has conflicts if a total of weights will be pruned when combining columns in the group. We say that a group of columns meets the limited-conflict condition for certain value, if the group has at most conflicts per row on average. The value can less than 1. For example, if , then for every two rows at most one weight is pruned on average.

3.2 Column Combining Overview

Given a sparse filter matrix, we first partition it into column groups, and then for each group we combine its columns to form a single combined column by applying column-combine pruning. We aim at achieving two objectives simultaneously. First, we pack the given sparse filter matrix into a dense matrix, called a packed filter matrix, with as a few combined columns as possible to allow efficient systolic array implementations. Second, we minimize the impact of column-combine pruning on classification accuracy.

For high-density packing, we adopt a dense-column-first combining policy that favors selections of combining columns which result in high-density combined columns, where the density of a column is the percentage of nonzeros in the column. For high classification accuracy, we then retrain the remaining weights after column-combine pruning. The algorithm involves some parameters: (the maximum number of combined columns), (the initial pruning percentage number) and (the average number of conflicts per row allowed for each group). Their typical values are , and .

Figure 6 depicts a column combining example. In (a), a filter matrix , associated with a sparse convolutional layer, is divided along columns into three groups (blue, green, and red). The zero-valued weights in due to previous pruning steps are omitted for illustration clarity. The objective of column grouping is to select columns that, when combined, achieve high packing efficiency (i.e. are mostly nonzeros). As we show in Section 5, a high packing efficiency translates to a high utilization efficiency, as more MACs will perform useful computation by storing nonzero weights. A small number of conflicting elements are allowed between the columns in a group. For instance, in the blue group, (-3) in column 1 conflicts with (7) in column 3 and -8 in columns 5. The conflicting (-3) and (-7) weights are pruned and -8 is kept as it has the largest magnitude. In (b), each group is combined into a single column in order to be loaded into a column in the proposed systolic array (as discussed in Section 4).

Figure 6: Example of combining columns.

3.3 Column Combining Algorithm

The column combining scheme, outlined in Section 3.2, joins columns in a sparse filter matrix that do not have significant conflicts. Algorithm 1, which calls Algorithm 2 and Algorithm 3, is the top level algorithm used to train sparse CNNs that can be implemented with systolic arrays of high utilization efficiency. The training process works in an iterative fashion, where at each iteration the model is pruned and packed so that it fits more efficiently in the systolic array. In Section 5, we provide analysis on the effect that each parameter of Algorithm 1 has on both classification accuracy and utilization efficiency.

Input: is a CNN with convolution layers
is the maximum number of combined columns per group
is the initial pruning factor (e.g.,  of weights)
is the number of conflicts (i.e. pruned weights) allowed on average per row for each group
is the target number of nonzero weights after column combining in and is used as the stopping criteria
Output: is a pruned version of with combined columns for each of the sparse convolution layers
are the column groups for each of the layers in
1 ; Prune and retrain until target is reached while  do
2       for  do
3             Step 1 Perform initial pruning by removing the smallest magnitude weights up to an percentage , ; Step 2 Form groups by combining columns , , ; Step 3 Prune conflicts in groups , );
4             end for
5             Step 4 Network retraining ; ; Decay by constant factor
6             end while
Algorithm 1 Iterative Training with Column Combining
Input: a filter matrix with rows and columns
is the maximum number of combined columns per group
is the number of conflicts (i.e. nonzero weights) allowed on average per row in each group
Output: are the groups of columns in
1 ; ; Loop 
2       exit if all columns are in a group if  then break; ; select ungrouped column compute densities between and pairwise-density(, , ); compute number of conflicting weights between and pairwise-overlap(, , ); select the group with the highest density while satisfying both the group size and the overlap constraints densest-group(, , , , ); ; add to the group
3       EndLoop
Algorithm 2 Column Grouping (group-columns)
Input: a filter matrix with rows and columns
are the groups of columns in
Output: is with conflicting entries within each group pruned
1 ; Iterate over groups and prune all but one entry per row in each group for  do
2       ; Submatrix of containing columns in for  do
3             ; Find largest magnitude weight in row false; for  do
4                   if found or  then
5                         ; Prune (set to )
6                         else
7                               true;
8                               end if
9                              
10                               end for
11                              
12                               end for
13                              
14                               end for
Algorithm 3 Column-Combine Pruning (group-prune)

3.4 Explanations for Column Combining Algorithm

The limited-conflict condition assures that in column combining for each group (Algorithm 2) at most weights are pruned per row on average (Algorithm 3). This helps minimize the impact of column-combine pruning on classification accuracy. The fact that each group can have at most columns (e.g., = 8) limits the degree of multiplexing that systolic cells (described in Section 4.2) need to support, while allowing as many as columns to be combined in order to achieve high packing density.

The initial pruning in the beginning of each iteration can decrease the number of iterations required to reach a target number of nonzero weights for the sparse CNN. This is useful, when column-combine pruning is set to be less aggressive by using a relatively small (e.g., = 0.5) in order to minimize its impact on classification accuracy. Each iteration retrains the network resulting from the initial pruning and column-combine pruning. This mitigates the impact of these pruning operations on classification accuracy. Finally, we note that the dense-column-first combining policy is analogous to that of some popular bin-packing algorithms which pack large items first.

3.5 Row Permutation for Contiguous Column Groups

We can permute rows of a filter matrix of the current layer to ensure that the columns from the same group for the next layer are output next to each other. In Figure 7, systolic arrays for various layers are denoted as rectangles with a thick black boundary. In (a), a systolic array of eight columns for layer i+1 is for an original sparse filter matrix of this layer consisting of three column groups, indicated in three colors, for column combining. In (b), column combining is performed on the three column groups, which results in a reduced systolic array of three columns for layer i+1. This reduced systolic array is for a packed filter matrix consisting of three combined columns. A relatively expensive switchbox function is needed for routing output of layer i to input of the reduced systolic array for layer i+1. In (c), by permuting the rows of the layer i filter matrix according to the column groups in layer i+1, we avoid the expensive switchbox. A simple counter that counts the data items in each group can be used instead.

Note that such row permutations are valid, as the column combining operation on a filter matrix are not affected by row permutations on the previous filter matrix. Thus, row permutations for layer i can be determined by the column groups of a row permuted filter matrix for layer i+1. This makes the columns within each group contiguous and removes the need to reorder the output using a switchbox at inference runtime.

Figure 7: Applying Column Combining and Row Permutation.

3.6 Cross-layer Pipelining of CNN Inference under Column Combining and Row Permutation

In many realtime scenarios, single sample latency is a more important metric than throughput, as an input sample (e.g., an image) must be processed as soon as it is received by the system, and therefore cannot be processed in large batches.

To address this concern, we propose cross-layer pipelining in the sense that we will pipe the output data elements from the previous layer immediately as input to the next layer as soon as it exits from the systolic array. Figure 8 shows this pipelining approach for three sparse CNN layers (Layer i, Layer i+1, and Layer i+2), each deployed in a separate systolic array after column combining and row permutation have been applied to each layer. The dashed lines emitted from each layer output denote that each data element is immediately pipelined into the next layer. In Section 7.4, we show that this approach reduces the inference latency for our ASIC implementation of LeNet-5 by 3.5. Having the effect of narrowing systolic arrays for convolutional layers of a CNN, column combining can reduce data skew, which further reduces the latency.

Figure 8: Pipelining CNN inference across three layers with column combining and row permutation applied to each layer.

4 Systolic Array System Description for Column Combining

In this section, we describe the systolic array system and its components in support of the proposed column combining approach presented in Section 3.

4.1 Systolic Array System Components

The systolic array system for column combined CNNs is shown in Figure 9. The filter weights corresponding to layers of a CNN are stored in the weight buffer. The weights for a CNN layer can then be loaded into the MX cells of the systolic array (discussed in Section 4.2) before matrix multiplication is performed with the input data. The input data is loaded from the input buffer and passed through the shift block (discussed in Section 4.3). The shift block performs shift operations, as depicted in Figure 5

, and passes the output to the systolic array in a bit-serial fashion, which then performs matrix multiplication with the weights stored in the systolic cells. The output of each row in the systolic array is passed to the ReLU block (discussed in Section 

4.4

), which performs the ReLU activation function. Finally, the result from the ReLU block is passed to the quantization block and stored in the output buffer.

Figure 9: Systolic array system.

4.2 Bit-serial Systolic Arrays

In this section, we describe our bit-serial implementation of systolic arrays for matrix multiplication. Figure 10 show our proposed bit-serial MAC design which is used across all systolic array implementation for 8-bit input Xi and 8-bit filter weight W. The white logic elements implement the bit-serial multiplication between the input Xi and the absolute value of the filter weight. The blue logic elements negate the product based on the sign of the filter weight. The pink full adder performs bit-serial addition between the product and the input accumulation Yi.

We illustrate the scheme with a bit-serial systolic array for multiplying a filter matrix and a M data matrix, as depicted in Figure (a)a. We pre-store in the systolic cell (or simply cell) at position the corresponding filter weight W in the filter matrix. Data arrive from the bottom of the array. Matrix multiplication results come out from the right side of the array.

First, consider a simple scenario where each systolic cell has balanced I/O and computation time. This is the case when input data, filter weights and accumulation values use words of the same length. Suppose that they are all 8-bit. In this case, under the bit-serial MAC implementation of Figure 10, we will have a systolic cell as depicted in Figure (a)a or a BL cell in Figure 19. In the corresponding systolic array, as depicted in Figure (a)a, for data synchronization purposes, neighboring input and accumulation data streams are skewed by one clock to accommodate the communication delay between the cells. However, this simple scenario is not applicable to high-precision accumulation that is necessary for holding the partial result of matrix multiplication [35].

Figure 10: Bit-serial multiplier-accumulator (MAC).

To accommodate high-precision accumulation, bit-serial systolic cells will have longer computation time than I/O. Suppose that input data and filter weights are 8-bit and accumulation values are 32-bit. In this case, under a bit-serial MAC implementation of Figure 10 with k = 32, we have the systolic cell as depicted in Figure (b)b. In the corresponding systolic array, as depicted in Figure (b)b, there is a 24-clock gap between words in each input data stream. The gap allows for the additional computation time required beyond the I/O time.

We can fill in these gaps for each cell by processing four independent input data streams simultaneously in an interleaved manner, while expanding the processing power and accumulation data path by 4, as depicted in Figure (c)c and the IL cell in Figure 19. The corresponding systolic array is depicted in Figure (c)c with more details in Figure (b)b.

(a) Balanced Cell
(b) Unbalanced Cell
(c) Interleaved Cell
Figure 14: Systolic cells under different computation settings.
(a)
(b)
(c)
Figure 18: Systolic arrays under mixed precision settings.
Figure 19: Systolic cell types used for the corresponding systolic array in Figure 23.
(a)
(b)
(c)
Figure 23: Three types of systolic arrays based on the three cell designs in Figure 19.

Given the input channel groups determined by the column combining algorithm, we now describe an efficient systolic array implementation which can utilize the combined input channel groups. In Figure 19, the multiplexed input (MX) cell, takes in two x inputs, from two input channels, utilizes one of them inside each MAC, and forwards both x to the cell above. Note that while for illustration simplicity this figure shows only two instances of input x, in our ASIC and FPGA designs we pack up to 8 channels (requiring 8 instances of input x) into a single cell. This highlights the importance of the bit-serial design, as in the case of 8 inputs, each cell takes in only 8 bits per cycle, as opposed to a bit-parallel design where each cell would require 64 inputs per cycle in the case of 8-bit input.

Figure (c)c shows how a systolic array connects the MX cells. In this example, for the first column, the first and third rows (filters) use input channel 1, denoted by the and weights stored within the cells, and the second row uses input channel 2, denoted by the weight stored in the cell. As shown, these channels indexes are after row permutation (Section 3.5), and are therefore guaranteed to be contiguous.

4.3 Shift Block

Figure 26 shows the design for shift operation. Based on the direction of the spatial translation specified by the shift control signal, the memory controller fetches the corresponding 8 bits input maps from the input buffer to the register array, which generates the input to the systolic arrays in a bit-serial fashion. We use double buffering to prefetch the next data tile so that the output time can overlap with the data transfer overhead from the input buffer to register arrays.

(a)
(b)
Figure 26: Shift and ReLU blocks.

4.4 ReLU and Quantization

Figure 26 shows the design for ReLU operation. The 32-bit input stream comes in a bit-serial fashion and is stalled in a register array until the last bit arrives. The sign of the integer number represented by the 32-bit input stream is determined by the most significant bit (32nd bit). If the 32nd bit is 1, then the multiplexer outputs a 32-bit stream of 0, otherwise the multiplexer simply outputs the input stream. The output from the ReLU block is then re-quantized and saved in the output buffer. This output can then be transferred to the input buffer to be used as input for the following layer.

5 Performance Analysis for the Column Combining Algorithm

We analyze our column combining approach described in Section 3 on two datasets MNIST [33] (2828 greyscale images of handwritten digits) and CIFAR-10 [29] (3232 RGB images of objects). We evaluate the approach on three well studied datasets: Lenet-5 [33] on MNIST and VGG-16 [55] and ResNet-20 [23] on CIFAR-10. Each convolution layer in all networks is replaced by shift followed by pointwise convolution (Shift Convolution in Figure 5) to fit our systolic array system design covered in Section 4

. All networks are trained using Stochastic Gradient Descent (SGD) with an initial learning rate

of 0.05 for Lenet-5 and 0.2 for VGG-16 and ResNet-20. A Nesterov momentum of 0.9 

[51] is used for all networks. A cosine shape learning rate schedule [37] is used to decay the learning rate over each iteration of Algorithm 1, ending at 20% of the initial initial learning rate

. After the target number of weights has been reached, 100 additional epochs of training is performed, while decaying the learning rate to

. Unless stated otherwise, is set to for all networks.

5.1 Iterative Training with Column Combining

Training a network with column combining occurs over a series of iterations (Algorithm 1), where, at each iteration, weights are pruned to decrease the model size and increase the utilization efficiency when deployed in the systolic array. After each pruning stage, retraining is performed to recover the loss in classification accuracy due to the pruned weights. Figure (a)a shows the classification accuracy and number of nonzeros weights for the ResNet-20 model over each training epoch. The dashed vertical lines denote the beginning of an iteration of Algorithm 1, where initial pruning (using ) and column-combine pruning (using and ) are performed. At each epoch, the number of weights in the model is shown by the red line. The first iteration of pruning decreases the model size most substantially (from 740K to 440K nonzero weights), and subsequent pruning stages decreases the model size by smaller amounts due to the reduced value. When the target number of nonzeros weights is reached, 125K in this instance, a final 100 epochs of training is performed, which improves the classification accuracy by an additional 5%.

(a) Iterative Training with Column Combining
(b) Impact of
(c) Impact of
Figure 30: (a) Classification accuracy and number of nonzero weights over training epochs (grey vertical lines denote pruning). (b) Increasing allows for more columns to be added to a single combined column. (c) Increasing greatly improves utilization efficiency while minimally impacting classification accuracy.

5.2 Impact of Number of Columns per Group

The number of columns allowed to be added to a group during column grouping (Algorithm 2) is determined by the parameter. Figure (b)b shows the classification accuracy and utilization efficiency for 5 ResNet-20 models for the CIFAR-10 dataset trained using Algorithm 1 with and while varying from 1 to 16. At , no column combining or column-combine pruning is performed, as only a single column is allowed per group. This network is equivalent to a standard systolic array operating on sparse filter matrices and achieves a utilization efficiency of under . Note that, for this analysis, utilization efficiency and packing efficiency are interchangeable. As is increased, the utilization efficiency improves up to at , with the classification accuracy dropping by approximately 1% due to column-combine pruning. For , there is no improvement in utilization efficiency, as columns cannot be further combined due to the higher degree of conflicts between the remaining nonzero weights.

5.3 Impact of the Limited-Conflict Condition

The limited-conflict condition, as described in Section 3.1, allows for conflicting entries per row on average between columns within a group. All but the largest magnitude weight among conflicting weights are pruned during column-combine pruning (Algorithm 3). Figure (c)c shows how classification accuracy and utilization efficiency vary as a function of for 5 ResNet-20 networks trained on the CIFAR-10 dataset. Larger values of allow for more conflicts between the columns in a group and therefore prune more weights, possibly with relatively large magnitudes, in order to achieve higher utilization efficiency across all layers in the CNN. This dramatically increases the utilization efficiency from 52% () to 93% (). As discussed in the previous subsection, column-combine pruning has a small impact on classification accuracy (around 1%) since retraining is performed after each round of pruning in order to allow the remaining weights to adjust to the loss of the pruned weights.

5.4 Dramatic Tiling Reduction in Partitioned Matrix Multiplication with Column Combining

When a systolic array is smaller than the weights of a convolutional layer, matrix multiplication can be performed in multiple passes, where each pass executes matrix multiplication between a submatrix of the layer weights and the corresponding input data. Figure (a)a shows how this partitioning process is performed on a sparse filter matrix of (96 rows by 94 columns), which is larger than the systolic array (32 rows by 32 columns). The filter matrix is partitioned into 9 tiles, each with a maximum size of 32 by 32, and the input data is tiled in a similar manner along the columns, but not along the rows (batch size image width image height).

The full matrix multiplication is performed by alternating between weight loads and matrix multiplications for each of the submatrices (tiles). The filter matrix and input data enter the systolic array as depicted in a skewed fashion in order to maintain synchronization within the systolic array. Note that every systolic cell is busy all the time, either doing the matrix multiplication computation or loading the weights for the next tile. ReLU and quantization are performed on the output of the systolic array after the final tile for a set of rows in the filter matrix. (Note that in Section 7, we evaluate settings where the CNN must be partitioned in tiles, as shown in Figure (a)a and also settings where the each layer can fit entirely into a systolic array which does not require partitioning.)

(a)
(b)
Figure 33: (a) Partitioned matrix multiplication with systolic array alternating between weight loads and matrix multiplication. (b) Column Combining reduces the number of tiles.

We have used a ResNet-20 model in the performance study for our proposed column combining scheme. For illustration purposes, consider here the third layer of model. Figure (b)b shows a sparse filter matrix and a corresponding packed filter matrix after column combining, which is stored in the systolic array with MX cells (described in Section 4). As in Figure (a)a, the sparse filter matrix has 96 rows and 95 columns, with only 16% of the weights being nonzeros. For a 3232 systolic array, this sparse filter matrix is partitioned into 9 tiles (denoted by the red lines) in order to perform the full matrix multiplication. The packed filter matrix is the output after column combining, which has arranged the 94 columns of the sparse filter matrix into 17 groups. Each group is a single column in the packed filter matrix format, which can then be loaded into the systolic array. This packed format has 89% nonzeros and requires only 3 tiles to perform matrix multiplication (a 3 reduction).

Figure (a)a shows the number of tiles required to perform matrix multiplication with a 3232 systolic array for each layer in ResNet-20 models trained using Algorithm 1 under three different parameter settings. All three settings set . The baseline setting trains the CNN with standard pruning but without column combining or column-combine pruning (). The column-combine setting uses the same CNN trained in the baseline setting, but allows for column combining without column-combine pruning (). Finally, the column-combine pruning setting trains the CNN with column combining and performs column-combine pruning to remove conflicting entries (). The column-combine setting only reduces the number of tiles over the baseline setting by 10% at most. By comparison, the column-combine pruning setting reduces the number of tiles by a substantial margin across all layers and achieves at 5 reduction in the largest layer (layer 19). Generally, this shows that it is difficult to effectively combine sparse columns, as a single conflict in any row for a potential group will make the combination invalid. By adding a modest amount of column-combine pruning (e.g., ) the combining algorithm is able to substantially improve the utilization efficiency and decrease the number of tiles.

(a) Number of Tiles in Systolic Array
(b) Training with Limited Data
Figure 36: (a) Number of tiles for a 3232 systolic array. (b) Comparing training a new model to training a pretrained model with column combining on limited datasets.

6 Column Combining with Limited Datasets

In many real world scenarios, customers may provide pretrained models to vendors to be deployed on their device (e.g., a mobile device). In these settings, a customer may not wish to divulge datasets used to train the model to the vendor for a number of reasons, such as the dataset containing sensitive private information or being a competitive advantage. In this scenario, model pruning is difficult, as pruning weights without retraining leads to a significant degradation in classification accuracy.

We propose that these data privacy concerns can be mostly mitigated by providing only a subset of the original dataset to perform the proposed column combining iterative training process. Figure (b)b compares the effects of column combining on a pretrained dense ResNet-20 model, trained on the full CIFAR-10 train dataset, to a new network, such as depicted in Figure (a)a, over different fractions of training data. The largest difference in performance between the two approaches is when only 1% of the full training data is used (a 15% difference in classification accuracy), as the weights in the pretrained model are already initialized to reasonable values. At 15% of the full training data, the pretrained model can achieve over 90% classification accuracy. This shows that a small amount of training data can be sufficient to perform column combining while still maintaining a relatively high classification accuracy. By comparison, training a new model requires 35% of the training dataset to achieve an over 90% classification accuracy. Our model pruning and retraining method can be view as part of a larger area of research shared by teacher-student networks [50, 58] and curriculum learning [8].

7 Hardware Implementation Experiments and Performance Evaluation

In this section, we evaluate performance of our column combining system described in Section 4 based on design experiments in both ASIC and FPGA. Throughout, we compare designs in terms of performance metrics concerning accuracy, throughput, area efficiency and energy efficiency. Additionally, we pay attention to performance for single or a small number of input samples, e.g., the end-to-end latency, or energy requirement, in processing a single input sample such as a 28x28 grey-scale image over LeNet 5. As stated earlier in Section 3.6, in realtime scenarios, single sample latency is a more important metric than throughput, as an input sample must be processed immediately for early prediction.

Section 7.1 describes our ASIC implementation and compares it against a baseline systolic array without column combining on three different CNNs (Lenet-5, VGG-16, and ResNet-20). In addition, we compare our ASIC design with other state-of-the-art ASIC accelerators for LeNet-5. In Section 7.2, we provide a mathematical analysis on optimality in energy efficiency. We argue that for CNNs such as LeNet-5 which incurs a relatively small I/O energy compared to MAC operations, packing these CNNs with column combining leads to systolic array designs which are near optimal in energy efficiency. Section 7.3 compares our FPGA implementation with other FPGA CNN accelerators on CIFAR-10. Section 7.4 compares single-sample latency of our ASIC implementations with and without cross-layer pipelining.

7.1 ASIC Implementation and Evaluation

We synthesize our ASIC design using the Synopsys Design Compiler [2] with 45nm NanGate Open Cell Library [3] and CACTI 7.0 [1]

. We estimate the hardware performance of static random-access-memory (SRAM) with CACTI 7.0 and synthesize the remaining components of the design including Systolic Arrays with MX cells (Section 

4.2), Shift (Section 4.3), ReLU and Quantization (Section 4.4) using the Synopsys Design Compiler.

We analyze our ASIC implementation across two scenarios. First, in Section 7.1.1, we compare the bit-serial systolic array without column combining (Figure (b)b) to our bit-serial design with column combining (Figure (c)c), where a single systolic array is used to process all CNN layers with tiling as presented in Section 5.4. Then, in Section 7.1.2, we compare our column combining ASIC implementation against prior ASIC implementations of LeNet-5. In the second scenario, we can fit each layer entirely into a systolic array and therefore do not require tiling.

7.1.1 Systolic Array Comparison using Tiling

To analyze our ASIC implementation of column combining, we implement the three networks discussed in Section 5 (LeNet-5, VGG-16 and ResNet-20) using a single systolic array of size 3232 and perform partitioned matrix multiplication as shown in Figure (a)a. For this scenario, 32-bit accumulation is used for all networks. We report energy consumption for processing one input sample for each CNN across the three column combining algorithm parameter settings presented in Section 5.4. The baseline setting uses standard pruning without column combining (), the column-combine setting allows for column combining without column-combine pruning (), and the column-combine pruning setting performs column-combine pruning to improve utilization efficiency by removing conflicting entries ().

Figure 37 depicts the throughput, number of tiles required to perform matrix multiplication across all layers, energy consumption per input sample, and classification accuracy for each CNN across the three parameter settings. For all the three CNN structures, the column-combine pruning setting greatly reduces the energy consumption and number of tiles by 4 to 6 over the other two settings. Furthermore, the column-combine pruning setting has to greater throughput compared to the other settings across all networks.

Figure 37: Performance of baseline and column combining ASIC implementations using tiling as in Section 5.4.

7.1.2 Comparison Against Prior Designs on LeNet-5

We compares our ASIC implementation of LeNet-5, trained on MNIST, to prior state-of-the-art CNN accelerator designs. For this scenario, we use 16-bit accumulations for the systolic array, as all layers are small in terms of filter sizes and therefore do not require 32-bit accumulations. With 16-bit accumulations, a single MAC operation will take half amount of cycles compared with 32-bit accumulations. All other designs use LeNet-5 (except for SpiNNaker [28]

which uses a Deep Belief Network and TrueNorth 

[6] which uses a Spiking Neural Network). Two SC-DCNN [47] designs were chosen for comparison: SC-DCNN (type a) has higher classification accuracy while SC-DCNN (type b) has higher energy efficiency. To compare with these designs, we implement two configurations of LeNet-5, Ours (design 1) and Ours (design 2), by running the column-combining algorithm with two different target numbers of nonzero weights (8k for design 1 and 5K for design 2). Both designs use ().

Table 1 compares all designs in terms of accuracy, area efficiency, and energy efficiency. Generally, our design has both the highest area efficiency and energy efficiency across all the designs. Compared to SC-DCNN (type a), our design 1 achieves a 2.2 improvement in area efficiency and a 3 improvement in energy efficiency, while also attaining a higher classification accuracy. Similarly, our design 2 realizes a higher classification accuracy than SC-DCNN (type b), while achieving a 1.4 improvement in area efficiency and a 1.7 improvement in energy efficiency.

width=center Platform Network Platform Accuracy Area Eff. Energy Eff. Ours (design 1) CNN ASIC 98.32% 46603 658053 Ours (design 2) CNN ASIC 97.61% 64716 869402 SC-DCNN (type a) CNN ASIC 98.26% 21439 221287 SC-DCNN (type b) CNN ASIC 96.64% 45946 510734 2x Xeon W5580 CNN CPU 98.46% 2.5 4.2 Tesla C2075 CNN GPU 98.46% 4.5 3.2 SpiNNaker DBN ARM 95.00% N/A 166.7 TrueNorth SNN ASIC 99.42% 2.3 9259

Table 1: Comparison of our ASIC implementations of LeNet-5 to other CNN accelerators on MNIST.

7.2 Optimality in Energy Efficiency

We provide an analysis showing that our systolic array design can achieve near-optimal energy efficiency. The total energy consumption of processing an input sample is:

where and are the energy consumption for all MAC computations and SRAM, respectively, is the energy consumption for a single MAC operation, is the number of MAC operations in the pruned network, and is the optimal number of MAC operations. Let denotes the ratio between and . Suppose that all designs have the same and . Then the energy efficiency of a design is:

and the optimal energy efficiency is:

We have observed from synthesized results that when the input size is relatively small, tends to be small. For example, and for LeNet-5 and ResNet-20, respectively. In this case, we have

Note that is the packing efficiency achievable by column combining. Thus when is small, the ratio between Energy Eff. and Optimal Energy Eff. is mostly denominated by the packaging efficiency.

Consider, for example, the scenario depicted in Figure (c)c (c), for . Column combining can achieve a packing efficiency of about 94.5% with a modest degradation of classification accuracy of about 0.7% in absolute percentage. Thus in this case the energy efficiency of our design is about 94.5% of the optimal energy efficiency, for small .

7.3 FGPA Implementation and Evaluation

For our FPGA implementation, we use the Xilinx XCKU035-1FBVA676C chip [5]. We synthesize our design using the Xilinx Vivado Design Suite (2017.4) [4]. We use 32-bit accumulation for the systolic array implementation.

Table 2 compares our ResNet-20 implementation to other FPGA implementations for CIFAR-10 in terms of classification accuracy and energy efficiency. We notice that our design achieves an accuracy of 93.1%, which is around 5-6% higher than other models. Moreover, our design achieves a 3 improvement on energy efficiency over the next best design. While it is possible for the other designs to increase the accuracy by using more hardware, it is hard for them to attain a low energy efficiency as our design.

width=center [57] [70] [16] Ours Frequency (MHz) N/A 143 100 150 Precision (data/weight) N/A 1 16 8 Classification Accuracy N/A 87.73% 88.3% 93.1% Energy Efficiency (frames/joule) 6109 1320 36 18830

Table 2: Comparison of our ResNet-20 model to state-of-the-art FPGA implementations for CIFAR-10.

7.4 Dramatic Reduction in End-to-end Inference Latency with Cross-layer Pipelining

In this section, we evaluate the FPGA performance of cross-layer pipelining, described in Section 3.6, in terms of reduced end-to-end inference latency for a single sample on LeNet-5 and ResNet-20. We found that cross-layer pipelining reduces the latency significantly by and compared to without pipelining for LeNet-5 and ResNet-20, respectively.

Furthermore, we compare our column combined ResNet-20 model with cross-layer pipelining on FPGA to other hardware implementations including GPU, CPU and FPGA accelerators trained on CIFAR-10. Table 3 shows the classification accuracy and end-to-end latency for a single input sample of each design. The latency 652s of [18] shown in Table 3 only includes the latency for all convolutional layers (thus >652). Our design achieves an end-to-end latency over smaller than next best implementation, while also obtaining a higher classification accuracy.

width=center CPU[70] GPU[70] [70] [18] Ours Classification Accuracy 88.42% 88.42% 88.42% 85.88% 93.1% Latency (microseconds/frame) 14800 730 5940 >652 55.68

Table 3: Comparison of our ResNet-20 model with cross-layer pipelining to state-of-the-art CNN accelerators for CIFAR-10.

8 Conclusion

In this paper, for CNN inference, we have presented a solution to a long-standing parallel processing challenge about how one can make efficient use of regular parallel processing arrays, such as systolic arrays, for sparse computations. Specifically, for a given sparse CNN, we have proposed a novel approach of using column combining to pack the filter matrix associated with each convolutional layer for its efficient systolic array implementation. In combining columns, we prune all weights on conflicting rows but the one with the largest magnitude. We then bring up the classification accuracy of the pruned network via retraining. We iterate on this column-combining and network-retraining step to improve both utilization efficiency of the systolic array and the classification accuracy of the network. This joint optimization has become feasible for sparse CNN inference. That is, for a CNN, we can optimize its topologies to fit the structure of the underlying computing hardware such as a given systolic array, while preserving most of its classification accuracy via network training.

Being able to transform sparse computations to fit highly efficient regular processor arrays is powerful. As demonstrated in the paper, our proposed column combining approach can increase the utilization efficiency of a systolic array by approximately 4, with a slight increase in the complexity of systolic cells for providing multiplexing (MX) support. This has led to superior performance of our proposed method against prior arts under metrics such as energy efficiency (3) and inference latency (12).

References

  • [1] Cacti: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. https://github.com/HewlettPackard/cacti.
  • [2] Design compiler: Rtl synthesis. https://www.synopsys.com/support/training/rtl-synthesis/design-compiler-rtl-synthesis.html.
  • [3] Nangate freepdk45 open cell library. http://www.nangate.com/?page_id=2325.
  • [4] Vivado design suite - hlx editions productivity. multiplied. https://www.xilinx.com/products/design-tools/vivado.html.
  • [5] Xilinx inc. xcku035-1fbva676c. https://www.digikey.ca/product-detail/en/xilinx-inc/XCKU035-1FBVA676C/122-1989-ND/6132038.
  • [6] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, Brian Taba, Michael Beakes, Bernard Brezzo, Jente Kuang, Rajit Manohar, William Risk, Bryan Jackson, and Dharmendra Modha.

    Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip.

    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34(10):1537–1557, 2015.
  • [7] Suyoung Bang, Jingcheng Wang, Ziyun Li, Cao Gao, Yejoong Kim, Qing Dong, Yen-Po Chen, Laura Fick, Xun Sun, Ron Dreslinski, Trevor Mudge, Hun Seok Kim, David Blaauw, and Dennis Sylvester. 14.7 a 288w programmable deep-learning processor with 270kb on-chip weight storage using non-uniform memory hierarchy for mobile intelligence. In Solid-State Circuits Conference (ISSCC), 2017 IEEE International, pages 250–251. IEEE, 2017.
  • [8] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    , pages 41–48. ACM, 2009.
  • [9] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices, 49(4):269–284, 2014.
  • [10] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.
  • [11] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE Computer Society, 2014.
  • [12] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
  • [13] Jason Clemons, Chih-Chi Cheng, Iuri Frosio, Daniel Johnson, and Stephen W Keckler.

    A patch memory system for image processing and computer vision.

    In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE, 2016.
  • [14] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  • [15] Roberto DiCecco, Lin Sun, and Paul Chow. Fpga-based training of convolutional neural networks with a reduced precision floating-point library. In Field Programmable Technology (ICFPT), 2017 International Conference on, pages 239–242. IEEE, 2017.
  • [16] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408. ACM, 2017.
  • [17] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, volume 43, pages 92–104. ACM, 2015.
  • [18] Xing T. Zhao R. Zhang Z. Srivastava M. B. Tu Z. Gupta R. K. ELin, J. H. Binarized convolutional neural networks with separable filters for efficient hardware acceleration. In CVPR Workshops, pages 344–352, 2017.
  • [19] Scott Gray, Alec Radford, and Diederik Kingma. Gpu kernels for block-sparse weights. https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/blocksparsepaper.pdf, 2017. [Online; accessed 12-January-2018].
  • [20] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. Hardware-oriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016.
  • [21] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 75–84. ACM, 2017.
  • [22] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [24] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks.
  • [25] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [26] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. arXiv preprint arXiv:1711.09224, 2017.
  • [27] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon.

    In-datacenter performance analysis of a tensor processing unit.

    In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 1–12, New York, NY, USA, 2017. ACM.
  • [28] Muhammad Mukaram Khan, David R Lester, Luis A Plana, A Rast, Xin Jin, Eustace Painkras, and Stephen B Furber. Spinnaker: mapping neural networks onto a massively-parallel chip multiprocessor. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pages 2849–2856. Ieee, 2008.
  • [29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset, 2014.
  • [30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [31] H. T. Kung. Why systolic architectures? IEEE Computer, 15:37–46, 1982.
  • [32] H. T. Kung and C. E. Leiserson. Systolic arrays (for vlsi). In Sparse Matrix Proceedings 1978, pages 256–282. Society for Industrial and Applied Mathematics, 1979.
  • [33] Yann LeCun.

    The mnist database of handwritten digits.

    http://yann. lecun. com/exdb/mnist/, 1998.
  • [34] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [35] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
  • [36] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
  • [37] Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. Learning, 10:3, 2016.
  • [38] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
  • [39] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 45–54. ACM, 2017.
  • [40] Rick Merritt. Arm at risk on ai chip marketc. EE Times India, April 2018.
  • [41] Sharan Narang, Eric Undersander, and Gregory F. Diamos. Block-sparse recurrent neural networks. CoRR, abs/1711.02782, 2017.
  • [42] Jian Ouyang, Ephrem Wu, Jing Wang, Yupeng Li, and Hanlin Xie. Xpu: A programmable fpga accelerator for diverse workloads. Hot Chips, 2017.
  • [43] Jinhwan Park and Wonyong Sung. Fpga based implementation of deep neural networks using on-chip memory only. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 1011–1015. IEEE, 2016.
  • [44] Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. Scale-out acceleration for machine learning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 367–381. ACM, 2017.
  • [45] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 26–35. ACM, 2016.
  • [46] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In ACM SIGARCH Computer Architecture News, volume 44, pages 267–278. IEEE Press, 2016.
  • [47] Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. ACM SIGOPS Operating Systems Review, 51(2):405–418, 2017.
  • [48] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE, 2016.
  • [49] R. Rojas. Neural Networks - A Systematic Introduction, Chapter 18: Hardware for Neural Networks. Springer-Verlag, 1996.
  • [50] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  • [51] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • [52] Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. A massively parallel coprocessor for convolutional neural networks. In Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on, pages 53–60. IEEE, 2009.
  • [53] Yongming Shen, Michael Ferdman, and Peter Milder. Escher: A cnn accelerator with flexible buffering to minimize off-chip transfer. In Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on, pages 93–100. IEEE, 2017.
  • [54] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing cnn accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 535–547. ACM, 2017.
  • [55] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [56] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. Pipelayer: A pipelined reram-based accelerator for deep learning. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pages 541–552. IEEE, 2017.
  • [57] John V. Arthur Andrew S. Cassidy Rathinakumar Appuswamy Alexander Andreopoulos David J. Berg Jeffrey L. McKinstry Timothy Melano Davis R. Barch Carmelo di Nolfo Pallab Datta Arnon Amir Brian Taba Myron D. Flickner Steven K. Esser, Paul A. Merolla and Dharmendra S. Modha. Convolutional networks for fast, energy-efficient neuromorphic computing. National Academy of Sciences, 2016.
  • [58] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
  • [59] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 65–74. ACM, 2017.
  • [60] Chao Wang, Lei Gong, Qi Yu, Xi Li, Yuan Xie, and Xuehai Zhou. Dlau: A scalable deep learning accelerator unit on fpga. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(3):513–517, 2017.
  • [61] Shihao Wang, Dajiang Zhou, Xushen Han, and Takeshi Yoshimura. Chain-nn: An energy-efficient 1d chain architecture for accelerating deep convolutional neural networks. In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1032–1037. IEEE, 2017.
  • [62] Ying Wang, Huawei Li, and Xiaowei Li. Re-architecting the on-chip memory sub-system of machine-learning accelerator for embedded devices. In Proceedings of the 35th International Conference on Computer-Aided Design, page 13. ACM, 2016.
  • [63] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Design Automation Conference (DAC), 2017 54th ACM/EDAC/IEEE, pages 1–6. IEEE, 2017.
  • [64] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
  • [65] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. arXiv preprint arXiv:1711.08141, 2017.
  • [66] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Computer-Aided Design (ICCAD), 2016 IEEE/ACM International Conference on, pages 1–8. IEEE, 2016.
  • [67] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 161–170. ACM, 2015.
  • [68] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 20. IEEE Press, 2016.
  • [69] Lei Zhao, Youtao Zhang, and Jun Yang. Aep: An error-bearing neural network accelerator for energy efficiency and model protection. In Computer-Aided Design (ICCAD), 2017 IEEE/ACM International Conference on, pages 765–771. IEEE, 2017.
  • [70] Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. Accelerating binarized convolutional neural networks with software-programmable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 15–24. ACM, 2017.