1 Introduction
Many recent hardwarebased stateoftheart deep learning accelerators use systolic arrays for efficient implementations of convolutional neural networks (CNNs). They leverage properties of systolic arrays such as parallel processing under the dataflow architecture, regular layout of processing elements, efficient interprocessor communication, and minimized I/O by being able to reuse the same data fetched from the memory many times
[32, 31, 49]. These systems, such as the Google TPU [27], the ShiDianNao accelerators [17], and numerous other efforts, including [40], [10], [61], [68], [63], have achieved low power consumption and high throughput.In recent years, there have also been significant algorithmic advances for CNNs which have enabled orders of magnitude reduction in both the number of network parameters and the amount of computation for inference compared to prior well studied networks such as VGG16 [55]. One of the more important model reduction techniques is weight pruning [22], which has shown that the majority of weights in a trained CNN can be pruned (set to 0) without significantly impacting the accuracy of the network.
However, it can be challenging to efficiently utilize the regular structure of systolic arrays given that these nonzero CNN weights are distributed in an irregular manner. In traditional approaches, zero weights still occupy systolic cells in the systolic array.
In this paper we propose a novel approach, called column combining, which can pack sparse convolutional networks for their efficient systolic array implementations. In combining columns to increase the percentage of nonzeros in a packed CNN, within a group of combined columns, we prune all weights on conflicting rows but the one with the largest magnitude. We then bring up the classification accuracy of the pruned network via retraining. We iteratively perform these columncombining and networkretraining steps to improve both utilization efficiency of the systolic array and the classification accuracy of the network until a target model size is reached.
Thus our proposed column combining approach leverages a joint optimization opportunity present in CNNs. That is, for a CNN, we can optimize its topologies to fit the structure of the underlying computing hardware such as systolic arrays, while preserving most of its classification accuracy via network retraining.
The main contributions of the paper are summarized as follows:

Column combining algorithm (Section 3) for packing sparse CNNs with unstructured sparsity for their efficient systolic array implementations. The method can retrain remaining filter weights after columncombine pruning using only fractions of the original training dataset in mitigating data privacy concerns (Section 6). To ease data routing, a row permuting scheme is described (Section 3.5) for a systolic array to output contiguous data items for those columns to be combined together in the systolic array of the next layer.

Joint optimization methodology (Algorithm 1 in Section 3) aiming at achieving two objectives simultaneously—high utilization efficiency of the systolic array and high classification accuracy of the CNN. The methodology leverages opportunities presented in CNNs in training for both efficiency and accuracy simultaneously.

Bitserial systolic arrays (Section 4.2) to allow efficient multiplexing of multiple data streams into a single column in the array in support of column combining. In addition, bitserial implementations provide flexibility in supporting accumulations at various precisions for the multiplieraccumulators (MACs) of systolic cells. In this work, we assume bitserial implementations of 32bit accumulations, except in Section 7.1.2 where we use 16bit accumulations, and 8bit weights and input data. Our approach extends naturally to other precisions.

Crosslayer pipelining (Section 3.6) for CNN inference over a series of systolic arrays, one for each layer. This dramatically reduces the inference latency per input sample (e.g., an input image) and is especially important for realtime applications where it is common to process samples one at a time.

ASIC and FPGA designs to validate performance gains of our approaches (section 7) in energy efficiency, area efficiency and latency.
2 Background and Related Work
In this section, we first provide a brief review of the basic principle of using systolic arrays for the implementations of CNNs and introduce terminologies that we will use throughout. Then, we review related ASIC and FPGA accelerators for CNN inference, advances in CNN design, weight pruning, and input and weight quantization, all of which have led to large reductions in both model size and computation cost for training and inference.
2.1 Systolic arrays for Convolutional Layers
It is well known that the bulk of the computation of a convolutional layer for a CNN can be viewed as a matrixmatrix multiplication. Suppose that a convolutional layer has N filters operating on a data volume of depth M, as depicted in Figure (a)a. Then, the result of the convolution computation is the matrix product of the filter matrix and the data matrix, as depicted in Figure (b)b.
(a) Computation of a convolutional layer, (b) viewed as a matrix multiplication, and (c) deployed in a weightstationary systolic array, with skewed input data and output results.
Figure (c)c depicts a systolic array design for this matrix multiplication. It is a weightstationary systolic array in the sense that filter weights stored in the array will not move during computation, whereas input data continuously move bottomtotop and result data accumulate lefttoright. For systolic array synchronization, items in the data and result matrices are properly skewed, as shown in the figure. We assume throughout the paper this weightstationary systolic array design.
2.2 ASIC and FPGA Accelerators for CNNs
Over the past several years, there has been extensive work on constructing efficient ASIC and FPGA designs for CNNs which generally consider well studied networks such as LeNet5 [34], AlexNet [30], and VGG16 [55] including [52, 45, 67, 66, 70, 69, 43, 44, 46, 53, 54, 56, 7, 42]. One of the main considerations for such systems is minimizing the number of offchip DRAM accesses for fetching the CNN weights, input samples and intermediate layer results, as these incur significant energy consumption [22]. Therefore, a main focus of accelerator design is mapping CNN computations in such a way that input and weights are fetched only once for all usages within a layer [9, 11, 17, 39]. Another orthogonal direction is designing memory systems that are more suitable to the regular structure of CNN inference computation [48, 13, 62, 60]. In Section 7.1, we show our design achieves stateoftheart performance in terms of energy efficiency.
FPGAs allow for faster development time and therefore are often used to explore various new research areas for CNNs, such as lowprecision and binary networks [14, 59], novel training regimes [15], and model compression through weight pruning or novel CNN structures [21, 16]. In Section 7, we validate the performance of our filter matrix packing algorithm with an FPGA implementation. Additionally, we compare our implementation to previous stateoftheart FPGA results [57, 43, 16, 70].
2.3 CNNs with Simplified Filter Structures
Figure 5 compares standard CNNs to two recent CNN variants, separable convolution [12, 25] and shift convolution [65], as shown in Figure 5. Separable convolution decouples a standard convolution layer into two smaller convolution layers (depthwise convolution and pointwise convolution) in order to reduce both model size and amount of computation. Each pointwise filter has only a single weight for each channel, and therefore does not utilize neighboring pixels in the spatial dimensions (width and height). Shift convolution replaces the depthwise convolution layer with a shift operation that does not require any learned weights. In Section 4, we leverage shift convolution to construct a network that consists only of pointwise layers, as it greatly simplifies the structure of computation in each layer.
2.4 Weight Pruning During Training
Weight pruning methods aim to reduce the number of weights in a trained CNN by removing (pruning) unimportant weights. These pruning techniques have shown that many well studied networks such as AlexNet and VGG16 have a large number of weights (up to 90%) that can be pruned without any impact on classification accuracy [64, 41, 19, 26, 24, 38].
In Section 3, we propose an iterative pruning procedure, similar to CondenseNet [26], but at the finest granularity of individual weights. This iterative pruning method gradually removes the smallest magnitude weights during training. This leads to sparse models (as low as 10% nonzero in each convolution layer) which still achieve similar performance to the baseline dense models. Additionally, as outlined in Section 3, we prune weights in such a way as to improve the utilization efficiency of the CNN when deployed in the systolic array design for sparse CNNs described in Section 4.
2.5 Input and Weight Quantization
Quantization is another direction in accelerating inference computations. In this work, we take a simple linear fixedpoint quantization scheme [35]. We quantize both the inputs and weights to an 8bit fixedpoint representation from the 32bit floatpoint representation [36, 20] used during training. This quantization has been shown to lead to minimal performance degradation even on challenging datasets [35]. Within a layer, the accumulation is done with 32bit integers, which adds complexity to the bitserial systolic array design and is discussed in Section 4.2.
3 Column Combining
As discussed in Section 2.4, training a CNN with weight pruning leads to small but highly sparse models with unstructured nonzero weights, which is not directly amenable to efficient implementation in systolic arrays traditionally designed for dense matrixmatrix multiplication. In this section, we propose a column combining algorithm, which is an iterative training procedure that jointly optimizes the CNN for both classification accuracy and utilization efficiency when deployed in the proposed systolic array described in Section 4.
3.1 Terminologies and definitions
Suppose that we are given the filter matrix of weights associated with a convolutional layer of the CNN (see Figure (a)a). The columns of this filter matrix which have nonzero weights on a row are said to be conflicting on the row, and the row is said to be a conflicting row for these columns. By column combining, we mean combining a given group of columns into a single combined column. In a combined column, for the columns which conflict on a row, all nonzero weights on the row are pruned except for the one with the largest magnitude. We refer this pruning process as columncombine pruning.
We say a group of columns has conflicts if a total of weights will be pruned when combining columns in the group. We say that a group of columns meets the limitedconflict condition for certain value, if the group has at most conflicts per row on average. The value can less than 1. For example, if , then for every two rows at most one weight is pruned on average.
3.2 Column Combining Overview
Given a sparse filter matrix, we first partition it into column groups, and then for each group we combine its columns to form a single combined column by applying columncombine pruning. We aim at achieving two objectives simultaneously. First, we pack the given sparse filter matrix into a dense matrix, called a packed filter matrix, with as a few combined columns as possible to allow efficient systolic array implementations. Second, we minimize the impact of columncombine pruning on classification accuracy.
For highdensity packing, we adopt a densecolumnfirst combining policy that favors selections of combining columns which result in highdensity combined columns, where the density of a column is the percentage of nonzeros in the column. For high classification accuracy, we then retrain the remaining weights after columncombine pruning. The algorithm involves some parameters: (the maximum number of combined columns), (the initial pruning percentage number) and (the average number of conflicts per row allowed for each group). Their typical values are , and .
Figure 6 depicts a column combining example. In (a), a filter matrix , associated with a sparse convolutional layer, is divided along columns into three groups (blue, green, and red). The zerovalued weights in due to previous pruning steps are omitted for illustration clarity. The objective of column grouping is to select columns that, when combined, achieve high packing efficiency (i.e. are mostly nonzeros). As we show in Section 5, a high packing efficiency translates to a high utilization efficiency, as more MACs will perform useful computation by storing nonzero weights. A small number of conflicting elements are allowed between the columns in a group. For instance, in the blue group, (3) in column 1 conflicts with (7) in column 3 and 8 in columns 5. The conflicting (3) and (7) weights are pruned and 8 is kept as it has the largest magnitude. In (b), each group is combined into a single column in order to be loaded into a column in the proposed systolic array (as discussed in Section 4).
3.3 Column Combining Algorithm
The column combining scheme, outlined in Section 3.2, joins columns in a sparse filter matrix that do not have significant conflicts. Algorithm 1, which calls Algorithm 2 and Algorithm 3, is the top level algorithm used to train sparse CNNs that can be implemented with systolic arrays of high utilization efficiency. The training process works in an iterative fashion, where at each iteration the model is pruned and packed so that it fits more efficiently in the systolic array. In Section 5, we provide analysis on the effect that each parameter of Algorithm 1 has on both classification accuracy and utilization efficiency.
3.4 Explanations for Column Combining Algorithm
The limitedconflict condition assures that in column combining for each group (Algorithm 2) at most weights are pruned per row on average (Algorithm 3). This helps minimize the impact of columncombine pruning on classification accuracy. The fact that each group can have at most columns (e.g., = 8) limits the degree of multiplexing that systolic cells (described in Section 4.2) need to support, while allowing as many as columns to be combined in order to achieve high packing density.
The initial pruning in the beginning of each iteration can decrease the number of iterations required to reach a target number of nonzero weights for the sparse CNN. This is useful, when columncombine pruning is set to be less aggressive by using a relatively small (e.g., = 0.5) in order to minimize its impact on classification accuracy. Each iteration retrains the network resulting from the initial pruning and columncombine pruning. This mitigates the impact of these pruning operations on classification accuracy. Finally, we note that the densecolumnfirst combining policy is analogous to that of some popular binpacking algorithms which pack large items first.
3.5 Row Permutation for Contiguous Column Groups
We can permute rows of a filter matrix of the current layer to ensure that the columns from the same group for the next layer are output next to each other. In Figure 7, systolic arrays for various layers are denoted as rectangles with a thick black boundary. In (a), a systolic array of eight columns for layer i+1 is for an original sparse filter matrix of this layer consisting of three column groups, indicated in three colors, for column combining. In (b), column combining is performed on the three column groups, which results in a reduced systolic array of three columns for layer i+1. This reduced systolic array is for a packed filter matrix consisting of three combined columns. A relatively expensive switchbox function is needed for routing output of layer i to input of the reduced systolic array for layer i+1. In (c), by permuting the rows of the layer i filter matrix according to the column groups in layer i+1, we avoid the expensive switchbox. A simple counter that counts the data items in each group can be used instead.
Note that such row permutations are valid, as the column combining operation on a filter matrix are not affected by row permutations on the previous filter matrix. Thus, row permutations for layer i can be determined by the column groups of a row permuted filter matrix for layer i+1. This makes the columns within each group contiguous and removes the need to reorder the output using a switchbox at inference runtime.
3.6 Crosslayer Pipelining of CNN Inference under Column Combining and Row Permutation
In many realtime scenarios, single sample latency is a more important metric than throughput, as an input sample (e.g., an image) must be processed as soon as it is received by the system, and therefore cannot be processed in large batches.
To address this concern, we propose crosslayer pipelining in the sense that we will pipe the output data elements from the previous layer immediately as input to the next layer as soon as it exits from the systolic array. Figure 8 shows this pipelining approach for three sparse CNN layers (Layer i, Layer i+1, and Layer i+2), each deployed in a separate systolic array after column combining and row permutation have been applied to each layer. The dashed lines emitted from each layer output denote that each data element is immediately pipelined into the next layer. In Section 7.4, we show that this approach reduces the inference latency for our ASIC implementation of LeNet5 by 3.5. Having the effect of narrowing systolic arrays for convolutional layers of a CNN, column combining can reduce data skew, which further reduces the latency.
4 Systolic Array System Description for Column Combining
In this section, we describe the systolic array system and its components in support of the proposed column combining approach presented in Section 3.
4.1 Systolic Array System Components
The systolic array system for column combined CNNs is shown in Figure 9. The filter weights corresponding to layers of a CNN are stored in the weight buffer. The weights for a CNN layer can then be loaded into the MX cells of the systolic array (discussed in Section 4.2) before matrix multiplication is performed with the input data. The input data is loaded from the input buffer and passed through the shift block (discussed in Section 4.3). The shift block performs shift operations, as depicted in Figure 5
, and passes the output to the systolic array in a bitserial fashion, which then performs matrix multiplication with the weights stored in the systolic cells. The output of each row in the systolic array is passed to the ReLU block (discussed in Section
4.4), which performs the ReLU activation function. Finally, the result from the ReLU block is passed to the quantization block and stored in the output buffer.
4.2 Bitserial Systolic Arrays
In this section, we describe our bitserial implementation of systolic arrays for matrix multiplication. Figure 10 show our proposed bitserial MAC design which is used across all systolic array implementation for 8bit input Xi and 8bit filter weight W. The white logic elements implement the bitserial multiplication between the input Xi and the absolute value of the filter weight. The blue logic elements negate the product based on the sign of the filter weight. The pink full adder performs bitserial addition between the product and the input accumulation Yi.
We illustrate the scheme with a bitserial systolic array for multiplying a filter matrix and a M data matrix, as depicted in Figure (a)a. We prestore in the systolic cell (or simply cell) at position the corresponding filter weight W in the filter matrix. Data arrive from the bottom of the array. Matrix multiplication results come out from the right side of the array.
First, consider a simple scenario where each systolic cell has balanced I/O and computation time. This is the case when input data, filter weights and accumulation values use words of the same length. Suppose that they are all 8bit. In this case, under the bitserial MAC implementation of Figure 10, we will have a systolic cell as depicted in Figure (a)a or a BL cell in Figure 19. In the corresponding systolic array, as depicted in Figure (a)a, for data synchronization purposes, neighboring input and accumulation data streams are skewed by one clock to accommodate the communication delay between the cells. However, this simple scenario is not applicable to highprecision accumulation that is necessary for holding the partial result of matrix multiplication [35].
To accommodate highprecision accumulation, bitserial systolic cells will have longer computation time than I/O. Suppose that input data and filter weights are 8bit and accumulation values are 32bit. In this case, under a bitserial MAC implementation of Figure 10 with k = 32, we have the systolic cell as depicted in Figure (b)b. In the corresponding systolic array, as depicted in Figure (b)b, there is a 24clock gap between words in each input data stream. The gap allows for the additional computation time required beyond the I/O time.
We can fill in these gaps for each cell by processing four independent input data streams simultaneously in an interleaved manner, while expanding the processing power and accumulation data path by 4, as depicted in Figure (c)c and the IL cell in Figure 19. The corresponding systolic array is depicted in Figure (c)c with more details in Figure (b)b.
Given the input channel groups determined by the column combining algorithm, we now describe an efficient systolic array implementation which can utilize the combined input channel groups. In Figure 19, the multiplexed input (MX) cell, takes in two x inputs, from two input channels, utilizes one of them inside each MAC, and forwards both x to the cell above. Note that while for illustration simplicity this figure shows only two instances of input x, in our ASIC and FPGA designs we pack up to 8 channels (requiring 8 instances of input x) into a single cell. This highlights the importance of the bitserial design, as in the case of 8 inputs, each cell takes in only 8 bits per cycle, as opposed to a bitparallel design where each cell would require 64 inputs per cycle in the case of 8bit input.
Figure (c)c shows how a systolic array connects the MX cells. In this example, for the first column, the first and third rows (filters) use input channel 1, denoted by the and weights stored within the cells, and the second row uses input channel 2, denoted by the weight stored in the cell. As shown, these channels indexes are after row permutation (Section 3.5), and are therefore guaranteed to be contiguous.
4.3 Shift Block
Figure 26 shows the design for shift operation. Based on the direction of the spatial translation specified by the shift control signal, the memory controller fetches the corresponding 8 bits input maps from the input buffer to the register array, which generates the input to the systolic arrays in a bitserial fashion. We use double buffering to prefetch the next data tile so that the output time can overlap with the data transfer overhead from the input buffer to register arrays.
4.4 ReLU and Quantization
Figure 26 shows the design for ReLU operation. The 32bit input stream comes in a bitserial fashion and is stalled in a register array until the last bit arrives. The sign of the integer number represented by the 32bit input stream is determined by the most significant bit (32nd bit). If the 32nd bit is 1, then the multiplexer outputs a 32bit stream of 0, otherwise the multiplexer simply outputs the input stream. The output from the ReLU block is then requantized and saved in the output buffer. This output can then be transferred to the input buffer to be used as input for the following layer.
5 Performance Analysis for the Column Combining Algorithm
We analyze our column combining approach described in Section 3 on two datasets MNIST [33] (2828 greyscale images of handwritten digits) and CIFAR10 [29] (3232 RGB images of objects). We evaluate the approach on three well studied datasets: Lenet5 [33] on MNIST and VGG16 [55] and ResNet20 [23] on CIFAR10. Each convolution layer in all networks is replaced by shift followed by pointwise convolution (Shift Convolution in Figure 5) to fit our systolic array system design covered in Section 4
. All networks are trained using Stochastic Gradient Descent (SGD) with an initial learning rate
of 0.05 for Lenet5 and 0.2 for VGG16 and ResNet20. A Nesterov momentum of 0.9
[51] is used for all networks. A cosine shape learning rate schedule [37] is used to decay the learning rate over each iteration of Algorithm 1, ending at 20% of the initial initial learning rate. After the target number of weights has been reached, 100 additional epochs of training is performed, while decaying the learning rate to
. Unless stated otherwise, is set to for all networks.5.1 Iterative Training with Column Combining
Training a network with column combining occurs over a series of iterations (Algorithm 1), where, at each iteration, weights are pruned to decrease the model size and increase the utilization efficiency when deployed in the systolic array. After each pruning stage, retraining is performed to recover the loss in classification accuracy due to the pruned weights. Figure (a)a shows the classification accuracy and number of nonzeros weights for the ResNet20 model over each training epoch. The dashed vertical lines denote the beginning of an iteration of Algorithm 1, where initial pruning (using ) and columncombine pruning (using and ) are performed. At each epoch, the number of weights in the model is shown by the red line. The first iteration of pruning decreases the model size most substantially (from 740K to 440K nonzero weights), and subsequent pruning stages decreases the model size by smaller amounts due to the reduced value. When the target number of nonzeros weights is reached, 125K in this instance, a final 100 epochs of training is performed, which improves the classification accuracy by an additional 5%.
5.2 Impact of Number of Columns per Group
The number of columns allowed to be added to a group during column grouping (Algorithm 2) is determined by the parameter. Figure (b)b shows the classification accuracy and utilization efficiency for 5 ResNet20 models for the CIFAR10 dataset trained using Algorithm 1 with and while varying from 1 to 16. At , no column combining or columncombine pruning is performed, as only a single column is allowed per group. This network is equivalent to a standard systolic array operating on sparse filter matrices and achieves a utilization efficiency of under . Note that, for this analysis, utilization efficiency and packing efficiency are interchangeable. As is increased, the utilization efficiency improves up to at , with the classification accuracy dropping by approximately 1% due to columncombine pruning. For , there is no improvement in utilization efficiency, as columns cannot be further combined due to the higher degree of conflicts between the remaining nonzero weights.
5.3 Impact of the LimitedConflict Condition
The limitedconflict condition, as described in Section 3.1, allows for conflicting entries per row on average between columns within a group. All but the largest magnitude weight among conflicting weights are pruned during columncombine pruning (Algorithm 3). Figure (c)c shows how classification accuracy and utilization efficiency vary as a function of for 5 ResNet20 networks trained on the CIFAR10 dataset. Larger values of allow for more conflicts between the columns in a group and therefore prune more weights, possibly with relatively large magnitudes, in order to achieve higher utilization efficiency across all layers in the CNN. This dramatically increases the utilization efficiency from 52% () to 93% (). As discussed in the previous subsection, columncombine pruning has a small impact on classification accuracy (around 1%) since retraining is performed after each round of pruning in order to allow the remaining weights to adjust to the loss of the pruned weights.
5.4 Dramatic Tiling Reduction in Partitioned Matrix Multiplication with Column Combining
When a systolic array is smaller than the weights of a convolutional layer, matrix multiplication can be performed in multiple passes, where each pass executes matrix multiplication between a submatrix of the layer weights and the corresponding input data. Figure (a)a shows how this partitioning process is performed on a sparse filter matrix of (96 rows by 94 columns), which is larger than the systolic array (32 rows by 32 columns). The filter matrix is partitioned into 9 tiles, each with a maximum size of 32 by 32, and the input data is tiled in a similar manner along the columns, but not along the rows (batch size image width image height).
The full matrix multiplication is performed by alternating between weight loads and matrix multiplications for each of the submatrices (tiles). The filter matrix and input data enter the systolic array as depicted in a skewed fashion in order to maintain synchronization within the systolic array. Note that every systolic cell is busy all the time, either doing the matrix multiplication computation or loading the weights for the next tile. ReLU and quantization are performed on the output of the systolic array after the final tile for a set of rows in the filter matrix. (Note that in Section 7, we evaluate settings where the CNN must be partitioned in tiles, as shown in Figure (a)a and also settings where the each layer can fit entirely into a systolic array which does not require partitioning.)
We have used a ResNet20 model in the performance study for our proposed column combining scheme. For illustration purposes, consider here the third layer of model. Figure (b)b shows a sparse filter matrix and a corresponding packed filter matrix after column combining, which is stored in the systolic array with MX cells (described in Section 4). As in Figure (a)a, the sparse filter matrix has 96 rows and 95 columns, with only 16% of the weights being nonzeros. For a 3232 systolic array, this sparse filter matrix is partitioned into 9 tiles (denoted by the red lines) in order to perform the full matrix multiplication. The packed filter matrix is the output after column combining, which has arranged the 94 columns of the sparse filter matrix into 17 groups. Each group is a single column in the packed filter matrix format, which can then be loaded into the systolic array. This packed format has 89% nonzeros and requires only 3 tiles to perform matrix multiplication (a 3 reduction).
Figure (a)a shows the number of tiles required to perform matrix multiplication with a 3232 systolic array for each layer in ResNet20 models trained using Algorithm 1 under three different parameter settings. All three settings set . The baseline setting trains the CNN with standard pruning but without column combining or columncombine pruning (). The columncombine setting uses the same CNN trained in the baseline setting, but allows for column combining without columncombine pruning (). Finally, the columncombine pruning setting trains the CNN with column combining and performs columncombine pruning to remove conflicting entries (). The columncombine setting only reduces the number of tiles over the baseline setting by 10% at most. By comparison, the columncombine pruning setting reduces the number of tiles by a substantial margin across all layers and achieves at 5 reduction in the largest layer (layer 19). Generally, this shows that it is difficult to effectively combine sparse columns, as a single conflict in any row for a potential group will make the combination invalid. By adding a modest amount of columncombine pruning (e.g., ) the combining algorithm is able to substantially improve the utilization efficiency and decrease the number of tiles.
6 Column Combining with Limited Datasets
In many real world scenarios, customers may provide pretrained models to vendors to be deployed on their device (e.g., a mobile device). In these settings, a customer may not wish to divulge datasets used to train the model to the vendor for a number of reasons, such as the dataset containing sensitive private information or being a competitive advantage. In this scenario, model pruning is difficult, as pruning weights without retraining leads to a significant degradation in classification accuracy.
We propose that these data privacy concerns can be mostly mitigated by providing only a subset of the original dataset to perform the proposed column combining iterative training process. Figure (b)b compares the effects of column combining on a pretrained dense ResNet20 model, trained on the full CIFAR10 train dataset, to a new network, such as depicted in Figure (a)a, over different fractions of training data. The largest difference in performance between the two approaches is when only 1% of the full training data is used (a 15% difference in classification accuracy), as the weights in the pretrained model are already initialized to reasonable values. At 15% of the full training data, the pretrained model can achieve over 90% classification accuracy. This shows that a small amount of training data can be sufficient to perform column combining while still maintaining a relatively high classification accuracy. By comparison, training a new model requires 35% of the training dataset to achieve an over 90% classification accuracy. Our model pruning and retraining method can be view as part of a larger area of research shared by teacherstudent networks [50, 58] and curriculum learning [8].
7 Hardware Implementation Experiments and Performance Evaluation
In this section, we evaluate performance of our column combining system described in Section 4 based on design experiments in both ASIC and FPGA. Throughout, we compare designs in terms of performance metrics concerning accuracy, throughput, area efficiency and energy efficiency. Additionally, we pay attention to performance for single or a small number of input samples, e.g., the endtoend latency, or energy requirement, in processing a single input sample such as a 28x28 greyscale image over LeNet 5. As stated earlier in Section 3.6, in realtime scenarios, single sample latency is a more important metric than throughput, as an input sample must be processed immediately for early prediction.
Section 7.1 describes our ASIC implementation and compares it against a baseline systolic array without column combining on three different CNNs (Lenet5, VGG16, and ResNet20). In addition, we compare our ASIC design with other stateoftheart ASIC accelerators for LeNet5. In Section 7.2, we provide a mathematical analysis on optimality in energy efficiency. We argue that for CNNs such as LeNet5 which incurs a relatively small I/O energy compared to MAC operations, packing these CNNs with column combining leads to systolic array designs which are near optimal in energy efficiency. Section 7.3 compares our FPGA implementation with other FPGA CNN accelerators on CIFAR10. Section 7.4 compares singlesample latency of our ASIC implementations with and without crosslayer pipelining.
7.1 ASIC Implementation and Evaluation
We synthesize our ASIC design using the Synopsys Design Compiler [2] with 45nm NanGate Open Cell Library [3] and CACTI 7.0 [1]
. We estimate the hardware performance of static randomaccessmemory (SRAM) with CACTI 7.0 and synthesize the remaining components of the design including Systolic Arrays with MX cells (Section
4.2), Shift (Section 4.3), ReLU and Quantization (Section 4.4) using the Synopsys Design Compiler.We analyze our ASIC implementation across two scenarios. First, in Section 7.1.1, we compare the bitserial systolic array without column combining (Figure (b)b) to our bitserial design with column combining (Figure (c)c), where a single systolic array is used to process all CNN layers with tiling as presented in Section 5.4. Then, in Section 7.1.2, we compare our column combining ASIC implementation against prior ASIC implementations of LeNet5. In the second scenario, we can fit each layer entirely into a systolic array and therefore do not require tiling.
7.1.1 Systolic Array Comparison using Tiling
To analyze our ASIC implementation of column combining, we implement the three networks discussed in Section 5 (LeNet5, VGG16 and ResNet20) using a single systolic array of size 3232 and perform partitioned matrix multiplication as shown in Figure (a)a. For this scenario, 32bit accumulation is used for all networks. We report energy consumption for processing one input sample for each CNN across the three column combining algorithm parameter settings presented in Section 5.4. The baseline setting uses standard pruning without column combining (), the columncombine setting allows for column combining without columncombine pruning (), and the columncombine pruning setting performs columncombine pruning to improve utilization efficiency by removing conflicting entries ().
Figure 37 depicts the throughput, number of tiles required to perform matrix multiplication across all layers, energy consumption per input sample, and classification accuracy for each CNN across the three parameter settings. For all the three CNN structures, the columncombine pruning setting greatly reduces the energy consumption and number of tiles by 4 to 6 over the other two settings. Furthermore, the columncombine pruning setting has to greater throughput compared to the other settings across all networks.
7.1.2 Comparison Against Prior Designs on LeNet5
We compares our ASIC implementation of LeNet5, trained on MNIST, to prior stateoftheart CNN accelerator designs. For this scenario, we use 16bit accumulations for the systolic array, as all layers are small in terms of filter sizes and therefore do not require 32bit accumulations. With 16bit accumulations, a single MAC operation will take half amount of cycles compared with 32bit accumulations. All other designs use LeNet5 (except for SpiNNaker [28]
which uses a Deep Belief Network and TrueNorth
[6] which uses a Spiking Neural Network). Two SCDCNN [47] designs were chosen for comparison: SCDCNN (type a) has higher classification accuracy while SCDCNN (type b) has higher energy efficiency. To compare with these designs, we implement two configurations of LeNet5, Ours (design 1) and Ours (design 2), by running the columncombining algorithm with two different target numbers of nonzero weights (8k for design 1 and 5K for design 2). Both designs use ().Table 1 compares all designs in terms of accuracy, area efficiency, and energy efficiency. Generally, our design has both the highest area efficiency and energy efficiency across all the designs. Compared to SCDCNN (type a), our design 1 achieves a 2.2 improvement in area efficiency and a 3 improvement in energy efficiency, while also attaining a higher classification accuracy. Similarly, our design 2 realizes a higher classification accuracy than SCDCNN (type b), while achieving a 1.4 improvement in area efficiency and a 1.7 improvement in energy efficiency.
7.2 Optimality in Energy Efficiency
We provide an analysis showing that our systolic array design can achieve nearoptimal energy efficiency. The total energy consumption of processing an input sample is:
where and are the energy consumption for all MAC computations and SRAM, respectively, is the energy consumption for a single MAC operation, is the number of MAC operations in the pruned network, and is the optimal number of MAC operations. Let denotes the ratio between and . Suppose that all designs have the same and . Then the energy efficiency of a design is:
and the optimal energy efficiency is:
We have observed from synthesized results that when the input size is relatively small, tends to be small. For example, and for LeNet5 and ResNet20, respectively. In this case, we have
Note that is the packing efficiency achievable by column combining. Thus when is small, the ratio between Energy Eff. and Optimal Energy Eff. is mostly denominated by the packaging efficiency.
Consider, for example, the scenario depicted in Figure (c)c (c), for . Column combining can achieve a packing efficiency of about 94.5% with a modest degradation of classification accuracy of about 0.7% in absolute percentage. Thus in this case the energy efficiency of our design is about 94.5% of the optimal energy efficiency, for small .
7.3 FGPA Implementation and Evaluation
For our FPGA implementation, we use the Xilinx XCKU0351FBVA676C chip [5]. We synthesize our design using the Xilinx Vivado Design Suite (2017.4) [4]. We use 32bit accumulation for the systolic array implementation.
Table 2 compares our ResNet20 implementation to other FPGA implementations for CIFAR10 in terms of classification accuracy and energy efficiency. We notice that our design achieves an accuracy of 93.1%, which is around 56% higher than other models. Moreover, our design achieves a 3 improvement on energy efficiency over the next best design. While it is possible for the other designs to increase the accuracy by using more hardware, it is hard for them to attain a low energy efficiency as our design.
7.4 Dramatic Reduction in Endtoend Inference Latency with Crosslayer Pipelining
In this section, we evaluate the FPGA performance of crosslayer pipelining, described in Section 3.6, in terms of reduced endtoend inference latency for a single sample on LeNet5 and ResNet20. We found that crosslayer pipelining reduces the latency significantly by and compared to without pipelining for LeNet5 and ResNet20, respectively.
Furthermore, we compare our column combined ResNet20 model with crosslayer pipelining on FPGA to other hardware implementations including GPU, CPU and FPGA accelerators trained on CIFAR10. Table 3 shows the classification accuracy and endtoend latency for a single input sample of each design. The latency 652s of [18] shown in Table 3 only includes the latency for all convolutional layers (thus >652). Our design achieves an endtoend latency over smaller than next best implementation, while also obtaining a higher classification accuracy.
8 Conclusion
In this paper, for CNN inference, we have presented a solution to a longstanding parallel processing challenge about how one can make efficient use of regular parallel processing arrays, such as systolic arrays, for sparse computations. Specifically, for a given sparse CNN, we have proposed a novel approach of using column combining to pack the filter matrix associated with each convolutional layer for its efficient systolic array implementation. In combining columns, we prune all weights on conflicting rows but the one with the largest magnitude. We then bring up the classification accuracy of the pruned network via retraining. We iterate on this columncombining and networkretraining step to improve both utilization efficiency of the systolic array and the classification accuracy of the network. This joint optimization has become feasible for sparse CNN inference. That is, for a CNN, we can optimize its topologies to fit the structure of the underlying computing hardware such as a given systolic array, while preserving most of its classification accuracy via network training.
Being able to transform sparse computations to fit highly efficient regular processor arrays is powerful. As demonstrated in the paper, our proposed column combining approach can increase the utilization efficiency of a systolic array by approximately 4, with a slight increase in the complexity of systolic cells for providing multiplexing (MX) support. This has led to superior performance of our proposed method against prior arts under metrics such as energy efficiency (3) and inference latency (12).
References
 [1] Cacti: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. https://github.com/HewlettPackard/cacti.
 [2] Design compiler: Rtl synthesis. https://www.synopsys.com/support/training/rtlsynthesis/designcompilerrtlsynthesis.html.
 [3] Nangate freepdk45 open cell library. http://www.nangate.com/?page_id=2325.
 [4] Vivado design suite  hlx editions productivity. multiplied. https://www.xilinx.com/products/designtools/vivado.html.
 [5] Xilinx inc. xcku0351fbva676c. https://www.digikey.ca/productdetail/en/xilinxinc/XCKU0351FBVA676C/1221989ND/6132038.

[6]
Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo AlvarezIcaza, John Arthur,
Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, GiJoon Nam, Brian
Taba, Michael Beakes, Bernard Brezzo, Jente Kuang, Rajit Manohar, William
Risk, Bryan Jackson, and Dharmendra Modha.
Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip.
IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 34(10):1537–1557, 2015.  [7] Suyoung Bang, Jingcheng Wang, Ziyun Li, Cao Gao, Yejoong Kim, Qing Dong, YenPo Chen, Laura Fick, Xun Sun, Ron Dreslinski, Trevor Mudge, Hun Seok Kim, David Blaauw, and Dennis Sylvester. 14.7 a 288w programmable deeplearning processor with 270kb onchip weight storage using nonuniform memory hierarchy for mobile intelligence. In SolidState Circuits Conference (ISSCC), 2017 IEEE International, pages 250–251. IEEE, 2017.

[8]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.
Curriculum learning.
In
Proceedings of the 26th annual international conference on machine learning
, pages 41–48. ACM, 2009.  [9] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning. ACM Sigplan Notices, 49(4):269–284, 2014.
 [10] YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52(1):127–138, 2017.
 [11] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Dadiannao: A machinelearning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE Computer Society, 2014.
 [12] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.

[13]
Jason Clemons, ChihChi Cheng, Iuri Frosio, Daniel Johnson, and Stephen W
Keckler.
A patch memory system for image processing and computer vision.
In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE, 2016.  [14] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [15] Roberto DiCecco, Lin Sun, and Paul Chow. Fpgabased training of convolutional neural networks with a reduced precision floatingpoint library. In Field Programmable Technology (ICFPT), 2017 International Conference on, pages 239–242. IEEE, 2017.
 [16] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. Circnn: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408. ACM, 2017.
 [17] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, volume 43, pages 92–104. ACM, 2015.
 [18] Xing T. Zhao R. Zhang Z. Srivastava M. B. Tu Z. Gupta R. K. ELin, J. H. Binarized convolutional neural networks with separable filters for efficient hardware acceleration. In CVPR Workshops, pages 344–352, 2017.
 [19] Scott Gray, Alec Radford, and Diederik Kingma. Gpu kernels for blocksparse weights. https://s3uswest2.amazonaws.com/openaiassets/blocksparse/blocksparsepaper.pdf, 2017. [Online; accessed 12January2018].
 [20] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. Hardwareoriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016.
 [21] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 75–84. ACM, 2017.
 [22] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [24] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks.
 [25] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [26] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. arXiv preprint arXiv:1711.09224, 2017.

[27]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick
Boyle, Pierreluc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike
Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra
Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John
Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski,
Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar,
Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu,
Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony,
Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas
Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt
Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew
Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory
Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter,
Walter Wang, Eric Wilcox, and Doe Hyun Yoon.
Indatacenter performance analysis of a tensor processing unit.
In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 1–12, New York, NY, USA, 2017. ACM.  [28] Muhammad Mukaram Khan, David R Lester, Luis A Plana, A Rast, Xin Jin, Eustace Painkras, and Stephen B Furber. Spinnaker: mapping neural networks onto a massivelyparallel chip multiprocessor. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pages 2849–2856. Ieee, 2008.
 [29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar10 dataset, 2014.
 [30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [31] H. T. Kung. Why systolic architectures? IEEE Computer, 15:37–46, 1982.
 [32] H. T. Kung and C. E. Leiserson. Systolic arrays (for vlsi). In Sparse Matrix Proceedings 1978, pages 256–282. Society for Industrial and Applied Mathematics, 1979.

[33]
Yann LeCun.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/, 1998.  [34] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [35] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
 [36] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
 [37] Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. Learning, 10:3, 2016.
 [38] JianHao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
 [39] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jaesun Seo. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 45–54. ACM, 2017.
 [40] Rick Merritt. Arm at risk on ai chip marketc. EE Times India, April 2018.
 [41] Sharan Narang, Eric Undersander, and Gregory F. Diamos. Blocksparse recurrent neural networks. CoRR, abs/1711.02782, 2017.
 [42] Jian Ouyang, Ephrem Wu, Jing Wang, Yupeng Li, and Hanlin Xie. Xpu: A programmable fpga accelerator for diverse workloads. Hot Chips, 2017.
 [43] Jinhwan Park and Wonyong Sung. Fpga based implementation of deep neural networks using onchip memory only. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 1011–1015. IEEE, 2016.
 [44] Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. Scaleout acceleration for machine learning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 367–381. ACM, 2017.
 [45] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 26–35. ACM, 2016.
 [46] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel HernándezLobato, GuYeon Wei, and David Brooks. Minerva: Enabling lowpower, highlyaccurate deep neural network accelerators. In ACM SIGARCH Computer Architecture News, volume 44, pages 267–278. IEEE Press, 2016.
 [47] Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. Scdcnn: Highlyscalable deep convolutional neural network using stochastic computing. ACM SIGOPS Operating Systems Review, 51(2):405–418, 2017.
 [48] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. vdnn: Virtualized deep neural networks for scalable, memoryefficient neural network design. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE, 2016.
 [49] R. Rojas. Neural Networks  A Systematic Introduction, Chapter 18: Hardware for Neural Networks. SpringerVerlag, 1996.
 [50] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 [51] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
 [52] Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. A massively parallel coprocessor for convolutional neural networks. In Applicationspecific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on, pages 53–60. IEEE, 2009.
 [53] Yongming Shen, Michael Ferdman, and Peter Milder. Escher: A cnn accelerator with flexible buffering to minimize offchip transfer. In FieldProgrammable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on, pages 93–100. IEEE, 2017.
 [54] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing cnn accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 535–547. ACM, 2017.
 [55] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [56] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. Pipelayer: A pipelined rerambased accelerator for deep learning. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pages 541–552. IEEE, 2017.
 [57] John V. Arthur Andrew S. Cassidy Rathinakumar Appuswamy Alexander Andreopoulos David J. Berg Jeffrey L. McKinstry Timothy Melano Davis R. Barch Carmelo di Nolfo Pallab Datta Arnon Amir Brian Taba Myron D. Flickner Steven K. Esser, Paul A. Merolla and Dharmendra S. Modha. Convolutional networks for fast, energyefficient neuromorphic computing. National Academy of Sciences, 2016.
 [58] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
 [59] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 65–74. ACM, 2017.
 [60] Chao Wang, Lei Gong, Qi Yu, Xi Li, Yuan Xie, and Xuehai Zhou. Dlau: A scalable deep learning accelerator unit on fpga. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 36(3):513–517, 2017.
 [61] Shihao Wang, Dajiang Zhou, Xushen Han, and Takeshi Yoshimura. Chainnn: An energyefficient 1d chain architecture for accelerating deep convolutional neural networks. In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1032–1037. IEEE, 2017.
 [62] Ying Wang, Huawei Li, and Xiaowei Li. Rearchitecting the onchip memory subsystem of machinelearning accelerator for embedded devices. In Proceedings of the 35th International Conference on ComputerAided Design, page 13. ACM, 2016.
 [63] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Design Automation Conference (DAC), 2017 54th ACM/EDAC/IEEE, pages 1–6. IEEE, 2017.
 [64] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [65] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. arXiv preprint arXiv:1711.08141, 2017.
 [66] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In ComputerAided Design (ICCAD), 2016 IEEE/ACM International Conference on, pages 1–8. IEEE, 2016.
 [67] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 161–170. ACM, 2015.
 [68] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambriconx: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 20. IEEE Press, 2016.
 [69] Lei Zhao, Youtao Zhang, and Jun Yang. Aep: An errorbearing neural network accelerator for energy efficiency and model protection. In ComputerAided Design (ICCAD), 2017 IEEE/ACM International Conference on, pages 765–771. IEEE, 2017.
 [70] Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, JengHau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. Accelerating binarized convolutional neural networks with softwareprogrammable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 15–24. ACM, 2017.
Comments
There are no comments yet.