Full-stack Optimization for Accelerating CNNs with FPGA Validation

05/01/2019 ∙ by Bradley McDanel, et al. ∙ Harvard University 0

We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference latency, energy efficiency, hardware utilization and inference accuracy. As a validation vehicle, we have implemented a 170MHz FPGA inference chip achieving 2.28ms latency for the ImageNet benchmark. The achieved latency is among the lowest reported in the literature while achieving comparable accuracy. However, our chip shines in that it has 9x higher energy efficiency compared to other implementations achieving comparable latency. A highlight of our full-stack approach which attributes to the achieved high energy efficiency is an efficient Selector-Accumulator (SAC) architecture for implementing the multiplier-accumulator (MAC) operation present in any digital CNN hardware. For instance, compared to a FPGA implementation for a traditional 8-bit MAC, SAC substantially reduces required hardware resources (4.85x fewer Look-up Tables) and power consumption (2.48x).



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Due to the widespread success of Convolutional Neural Networks (CNNs) across a variety of domains, there have been extraordinary research and development efforts placed on improving inference latency, energy efficiency, and accuracy of these networks. Generally, these research efforts can be viewed from two distinct perspectives: (1) machine learning practitioners who focus on reducing the complexity of CNNs through more efficient convolution operations 

(Wu et al., 2017), weight and activation quantization (Jacob et al., [n. d.]), and weight pruning (Han et al., 2015) and (2) hardware architecture experts who design and build CNN accelerators with minimal power consumption and I/O cost (Du et al., 2015; Jouppi et al., 2017; Zhang et al., 2016; Wang et al., 2017).

Figure 1. An overview of the proposed full-stack optimization framework for accelerating inference of sparse CNNs. Section 3 details CNN training which includes constraints to match the proposed computing architecture. Section 4 described the process of converting a trained sparse CNN into a packed representation and how FPGA instructions are generated for each layer in the CNN. Section 5 outlines the proposed architecture including the use of multiplication-free selector-accumulator (SAC) based systolic cells which are used to perform inference for all CNN layers on the FPGA.

However, approaching the problem from only one of these two viewpoints can lead to suboptimal solutions. For instance, as discussed in Section 2.3, many low-precision weight quantization methods omit certain significant cost factors in an end-to-end implementation such as using full-precision weights and data for the first and last layers (Zhou et al., 2017; Cai et al., 2017)

or full-precision batch normalization 

(Ioffe and Szegedy, 2015) as in  (Rastegari et al., 2016). On the other side, most CNN accelerators are designed to support some target CNNs (e.g., AlexNet (Krizhevsky et al., 2012) and VGG-16 (Simonyan and Zisserman, 2014)) at 8-bit or 16-bit precision for weights or data (Gupta et al., 2015; Dettmers, 2015). Therefore, these accelerators generally are not directly applicable to many of the recent advances in CNN design including efficient CNN structures (using, e.g., separable filters (Howard et al., 2017)), weight pruning (using, e.g., Lasso (Tibshirani, 1996)), and low-precision quantization.

To address this disparity, in this paper we propose a full-stack optimization framework, where the design of the CNN model is jointly optimized with the computing architectures and circuit implementations on which it will run. Figure 1 provides an overview of the proposed method in three stages, covered in three sections. Section 3 describes the training stage, which uses a hardware-aware quantization graph to facilitate training a CNN which can be directly implemented on an FPGA without any additional overhead.

Section 4 describes the process of generating instructions to perform inference given both the trained CNN and also a systolic array of a fixed size implemented on the target FPGA. It also covers how the trained sparse and quantized CNN is coded for efficient use of FPGA memory. Section 5 depicts the bit-serial systolic array architecture which includes the use of multiplication-free sparse systolic cells, based on the Selector-Accumulator (SAC) architecture for the multiplier-accumulation (MAC) operation, for efficient inference and can leverage irregular sparsity in a CNN model. Figure 2

depicts how a given systolic array implemented on a FPGA carries out a CNN inference by reusing the array across the CNN layers. Note that for systolic array synchronization, items input to and output from the array are properly skewed, as shown in the figure. When a layer has more filters than the systolic array can handle, we partition the layer into

vertical tiles across filters, as shown on the left of the figure, and reuse the systolic array across these tiles. When a layer has more channels than the systolic array can handle, we partition the layer into horizontal tiles across channels (these horizontal tiles are not shown in the figure). In this paper, with column combining (Kung et al., 2018), the 128x64 systolic array implemented on our FPGA is large enough to handle all channels in each layer of evaluation CNN models. Thus we do not use horizontal tiles.

Figure 2. CNN inference for two consecutive layers (layer L and L + 1) using a single 128128 systolic array similar to the systolic array implemented on the FPGA reported in this paper. The systolic array alternatively executes load weight and matrix multiply instructions for all tiles in a layer (six instructions in total for this example; see Section 4.2).

The novel techniques of the paper are as follows:

  • A full-stack optimization framework where the hardware structure is used to inform the CNN structure via training, and vice versa.

  • Selector-accumulator (SAC) which provide efficient
    multiplication-free inference. We replace traditional
    multiplier-accumulator (MAC) hardware with inexpensive SAC circuits facilitated by simple shared register chains (Section 5.2).

  • A systolic array building block for low-precision CNNs (Section 5.1) which uses shared register chains for two purposes: propagating data into adjacent cells and performing multiplication with power of two weights. Systolic arrays are used as example of processor arrays (our register chain design may extend to other array architectures).

  • A streamlined CNN structure which achieves competitive performance on ImageNet in the mobile setting using only 1

    1 convolution without residual connections (Section 


  • Input reshaping which allows parallel input of input image into the systolic array (Section 3.3).

Leveraging all these advances into a single system is challenging and one of the main accomplishments of this work. We have built an efficient CNN inference engine on a FPGA (Xilinx VC707 evaluation board) and have validated its correctness by checking the output against our Python simulator’s output. All the timing and power consumption results reported in this paper are based on the actual measurements obtained from this FPGA implementation. Our FPGA design is composed almost entirely of Look-up Tables (LUTs). We use DSPs only to implement the final fully connected layer. We believe our design provides a useful base for future ASIC implementation. Additionally, the concepts and architecture presented in this paper could scale across multiple components in a distributed setting to form larger systems for deep learning.

Links to our CNN training code (using PyTorch 

(Paszke et al., 2017)), python code which converts a trained CNN into a packed representation for the FPGA, and Verilog code for FPGA implementation is available at https://goo.gl/8i9aJp.

2. Background and Related Work

In this section, we first summarize recent FPGA-based CNN accelerators which we compare against in Section 6. Then, we review advances in efficient CNNs we use as a starting point for our approach.

2.1. FPGA Accelerators for CNNs

In recent years, numerous FPGA designs for CNN inference have been proposed (generally targeting prominent networks such as LeNet-5 (LeCun et al., 1998), AlexNet (Krizhevsky et al., 2012), and VGG-16 (Simonyan and Zisserman, 2014)) with the key objective of designing a system with low latency and high energy efficiency. A common strategy deployed by these designs is to minimize the degree of weight and data movement, especially from off-chip memory, as they add significant overhead in terms of both latency and power consumption.

One approach to minimizing data movement is layer fusion, where multiple CNN layers are processed at the same time in a pipelined manner to allow for instant use of intermediate data without external memory accesses (Alwani et al., 2016; Li et al., 2016; Xiao et al., 2017). Another approach, used for 33 or larger convolutional filters, is to determine the ordering of inference computation which minimizes the number of partial sums that must be stored (Ma et al., 2017; Zhang et al., 2018). Since our streamlined CNN architecture (Section 3) uses only 11 filters, convolution is reduced to matrix multiplication, which can be efficiently implemented using systolic arrays. Additionally, different computation strategies are often taken for the first layer (xil, 2018), as it has only three input channels in the case of RGB images and final fully connected layer (Qiu et al., 2016), where there are significantly more weights than data. In this work, we propose to use the same systolic array building block for efficient implementations of all layers in a CNN by using various full-stack optimization techniques such as input reshaping discussed in Section 3.3.

2.2. Efficient CNN Structures

Since VGG-16 (Simonyan and Zisserman, 2014) was introduced in 2014, there has been a general trend towards designing deeper CNNs through the use of residual connections (ResNets(He et al., 2016)) and concatenative connections (DenseNet (Huang et al., 2017b)) as deeper networks tend to achieve higher classification accuracy for benchmark datasets such as ImageNet (Deng et al., 2009). However, as pointed out in Table 2 of the original ResNet paper (He et al., 2016), residual connections appear to add little improvement in classification accuracy to a shallower (18 layer) CNN. Based on these observations, we have chosen to use a shallower CNN (19 layers) without any residual or concatenative connections, which we outline in Section 3.1. In our evaluation (Section 6.4.3) we show that for this shallower CNN, the exclusion of additional connections has minimal impact on classification accuracy while significantly simplifying our hardware implementation and improving its efficiency.

Additionally, several alternatives to standard convolution have recently been proposed to reduce the computation cost. Depthwise separable convolution (Chollet, 2016) dramatically reduces the number weights and operations by separating a standard convolution layer into two smaller layers: a depthwise layer that only utilize neighboring pixels within each input channel and a pointwise layer which operates across all channels but does not use neighboring pixels within a channel (i.e., it only uses 11 filters). Wu et al. showed that a channel shift operation can be used to replace the depthwise layer without significant impact on classification accuracy (Wu et al., 2017). As described in Section 3.1, our proposed CNN use this channel shift operation immediately preceding a 11 convolution layer.

2.3. Weight and Data Quantization

Several methods have been proposed to quantize the CNN weights after training, using 16-bit (Gupta et al., 2015) and 8-bit (Dettmers, 2015) fixed-point representations, without dramatically impacting classification accuracy. More recently, low-precision quantization methods (i.e., 1-4 bits) such as binary (Courbariaux et al., 2015; Hu et al., 2018) and ternary quantization (Zhu et al., 2016; Zhou et al., 2017; Wang and Cheng, 2017) methods have also been studied, which to smaller models may incur some cost to classification accuracy. Generally, for these low-precision approaches, training is still performed using full-precision weights, but the training graph is modified to include quantization operations in order to match the fixed-point arithmetic used at inference. In this paper, log quantization (Zhou et al., 2017) is adopted for weights, with each quantization point being a power of two. This allows for significantly more efficient inference, as fixed-point multiplication is replaced with bit shift operations corresponding the power of two weight, as discussed in Section 5.

In addition to weight quantization, there are many quantization methods for activation data output from each CNN layer (Rastegari et al., 2016; Zhou et al., 2016; Cai et al., 2017; Zhou et al., 2017; Choi et al., 2018). Data quantization reduces the cost of memory access for these intermediate output between layers in a CNN and also the computation cost of inference. However, it has been shown that low-precision quantization of activation (i.e., 1-4 bits) leads to a significantly larger degradation in classification accuracy compared to weight quantization (Courbariaux et al., 2016; Liu et al., 2018). Due to these considerations, we use 8-bit linear quantization for data in this paper and focus on an efficient implementation of multiplication-free computations with 8-bit data.

Additionally, we note that the majority of proposed methods for low precision weights and data omit two details which are critical for efficient end-to-end system performance. First, works in this area often treat the first and last layers in a special manner by keeping the weights and data full-precision for these layers (Courbariaux et al., 2016; Liu et al., 2018; Cai et al., 2017). Second, they often explicitly omit quantization considerations of batch normalization and use standard full-precision computation as performed during training (Cai et al., 2017; Zhou et al., 2016). Since batch normalization is essential to the convergence of low-precision CNNs, this omission makes it difficult to efficiently implement many low-precision approaches. In this work, as discussed in Section 3.2, we handle both of these issues by (1) quantizing the weights and data in all layers (including the first and last layers) under a single quantization scheme and by (2) including batch normalization quantization in the training graph (depicted in Figure 6) so that it adds zero overhead during inference.

2.4. Weight Pruning

It is well known in the literature that the majority of weights in a CNN (up to 90% for large models such as VGG-16) can be set to zero (pruned) without having a significant impact on the classification accuracy (Han et al., 2015). The resulting pruned network may have sparsely distributed weights with an irregular sparsity structure, which is generally difficult to implement efficiently using conventional hardware such as GPUs. This has led to subsequent methods that propose structured pruning techniques which will result in models with nonzero weights densely distributed (Wen et al., 2016; Narang et al., 2017; Gray et al., 2017; Huang et al., 2017a; He et al., 2017; Luo et al., 2017). While these methods allow more efficient CPU and GPU implementations, they appear unable to achieve the same level of reduction in model size that unstructured pruning can achieve111For instance, in Table 4 of (Wen et al., 2016), the highest accuracy model relative to the number of nonzero weights is achieved using unstructured pruning..

Unlike previous work, column combining is a new pruning method which allows for sparse CNN layers, but requires that the remaining sparse weights can be packed into a denser format when deployed in hardware (Kung et al., 2018). In our proposed training pipeline, we use column combining in addition to weight and data quantization as discussed in the previous section, in order to achieve efficient sparse CNN inference. Figure 3 shows how a sparse pointwise convolution layer with power of two weights is converted into a denser format by removing all but the largest nonzero entry in each row across the combined channels when stored in a systolic array. In this example, column combining reduces the width of the small layer by factor of 4 from 8 to 2. In Section 5, we describe bit-serial design for efficient hardware implementation of this packed format shown on the right side of Figure 3.

Figure 3. A pointwise convolution layer (left) with four channels per group resulting from weight pruning training for column combining (Kung et al., 2018). After combining columns in the filter matrix (left), each group of four channels (shown in cream and green) are reduced into a single column (right). Note that during column combing, for each group, all entries in each row are removed (pruned) but one with the largest magnitude.

3. Streamlined CNN Architecture

In this section, we first provide an overview of our streamlined CNN structure in Section 3.1, targeted for our FPGA implementation reported in this paper. Then, we outline the various design choices to improve the utilization and efficiency of the FPGA system. Specifically, we employ a quantization-aware training graph, including quantized batch normalization (Section 3.2) and an input reshaping operation to improve the utilization of our systolic array for the first convolution layer (Section 3.3).

3.1. Proposed Streamlined CNN Architecture

Our objective in designing a CNN architecture is to achieve high classification accuracy using a simplified structure across all CNN layers which can be mapped efficiently onto a systolic array. Figure 4 shows the structure of each convolutional layer in our network. To achieve similar performance to standard 33 convolution using only pointwise (11) convolution, every layer begins with a channel shift operation as described in (Wu et al., 2017). The output of the shift operation

is then applied to a sparse pointwise convolution layer, followed by batch normalization and rectified linear unit (ReLU). During training, the weights in the pointwise convolution layer are pruned with column combining using the column groups parameter (g) as in 

(Kung et al., 2018). For the earlier convolution layers in a network which have fewer weights, a column group size of 2 is used, which reduces the number of nonzero weights by roughly 50%. For the latter CNN layers, which are larger and have higher redundancy, a group size of 8 is used which reduces the number of nonzero weights by approximately 87.5%. Each layer is progressively pruned over the course of training, such that after training they will reach their target sparsity set by the column groups for the layer.

Figure 4. Each layer of the evaluation CNN models in this paper consists of a shift operation (Wu et al., 2017)

, pointwise (1x1) convolution, batch normalization and ReLU activation. A layer is parameterized with a number of filters (f), a stride (s), and column groups (g) for column combining in packing a sparse convolutional layer 

(Kung et al., 2018).

Figure 5 shows the evaluation models for the proposed streamlined CNN structure for CIFAR-10 (Krizhevsky et al., 2014) and ImageNet (Deng et al., 2009) datasets. As discussed in Section 2.2, we have chosen to keep the network relatively shallow (19 layers) and without any residual or concatenative connections. In Section 6, we show that this streamlined structure can achieve competitive Top-1 ImageNet classification accuracy with low latency and high energy efficiency. We evaluate ImageNet using three settings: ImageNet-Small/224, ImageNet-Small/56, and ImageNet-Large/56, where 224 and 56 refer to the width and height of the input image after the prepossessing stage. The small models have 1.5M weights and the large model has 8.5M weights after training. These evaluation model were chosen to evaluate the importance of model size and the spatial input size on classification accuracy, latency, and throughput. Additionally, as described in Section 3.3, for the settings with (5656) input size, we use an reshaping operation to increase the number of input channels from 3 (for RGB images) to 48 to the systolic array for low-latency input.

Figure 5. The evaluation models for the CIFAR-10 and ImageNet datasets. Each network consists of 19 layers, where each layer has a structure shown in Figure 4, with the first layer not including a shift component. Downsampling is performed using strided convolution denoted by layers with a stride of 2 (s=2).

3.2. Quantization-aware Training

In order to achieve high classification accuracy using power of two weights, we add quantization operations to the CNN training graph in order to match the fixed-point weights and data used at inference. Figure 6 shows the training and inference graphs for a single layer in the CNN shown in Figure 4. As discussed in Section 2.3, this approach of injecting quantization into the training graph is known in the literature and has mainly been used to train binary and ternary networks (Courbariaux et al., 2015; Zhou et al., 2016)

. In our training graph, we use log quantization for the weights, which quantizes an underlying full-precision weight (shown in blue) to the nearest power of two. During training, backpropagation uses full-precision gradients to update the full-precision weight in each layer.

Figure 6. The quantized training graph (left) performs both linear quantization for input data and power of two quantization (log quantization) for weight and batch normalization parameters during training. The inference graph (right) uses the quantized version of the full-precision weights learned during training and therefore does not require any floating-point operations.

Additionally, we perform quantization of the batch normalization operations which follow each convolutional layer. For higher precision weights (e.g., 8-bit weights), these batch normalization parameters can be folded directly into the weights and bias terms of the preceding convolution layer after training, so that they have no additional overhead (Krishnamoorthi, 2018). However, for lower precision weights (such as binary or power of two weights), this folding processes introduces significant quantization error, leading to a notable drop in classification accuracy. For this reason, prior works using low-precision weights employ full-precision batch normalization incurring the corresponding full-precision computation (e.g., (Zhou et al., 2016)). For our proposed bit-serial architecture, these full-precision batch normalization operations would introduce significant overhead and break our objective of multiplication-free inference. Therefore, as shown in Figure 6, we include quantization of the batch normalization parameters in our training graph. Applying log quantization on the batch normalization scale parameters allows them to be folded into the log quantized weights without introducing any quantization error.

Batch normalization is defined as

where and

are the mean and standard deviation of each mini batch during training and average running statistics during inference.

and are learnable parameters which are introduced to improve the representation power of the network. When followed by ReLU, as is the case in our CNN, the effects of the learnable scale parameter can be captured in the following convolution layer and can therefore be omitted by setting gamma as 1 (tfb, [n. d.]). We then factor , , and into a scale and bias term as follows

After applying quantized batch normalization to the output from the preceding convolution layer, a non-linear activation function (ReLU) is performed, which sets all negative values to 0. Additionally, it applies 8-bit linear quantization on the data so that it matches the fixed-point computation at inference. The inference graph of Figure 

6 shows how computation is performed on the FPGA during inference. The log quantized batch normalization scale factor is folded into the log quantized weights in the preceding convolution layer. Only arithmetic shift and fixed-point addition operations are performed during inference.

3.3. Input Reshaping to Improve Utilization

For CNNs trained on ImageNet, the first convolution layer represents 10-15% of the total computation performed due to the large spatial size of the input image (3224224). However, as recently discussed by Xilinx (xil, 2018), the computation in this layer does not map well onto systolic architectures, because the input image only offers three input channels, meaning that the majority of the array’s input bandwidth may not be utilized. To address this imbalance, Xilinx proposes to use two systolic arrays, one systolic array specifically designated to the first layer and the other systolic array used for the remaining convolution layers in the CNN.

In this paper, rather than adding a systolic array for the input layer, we reshape the input image to the CNN to increase the utilization of our single systolic array for the first convolutional layer. Figure 7 shows the input reshaping operation for an RGB image with 3 channels and 224224 pixels per channel. Each 22 block of pixels is divided into four groups (denoted 1, 2, 3, and 4) with 1 pixel per group. The pixels in the same group across all 22 blocks are then placed into a new set of RGB channels (4 groups, each with RGB channels, leading to 12 channels total). Each of these channels has 112112 pixels, which is one quarter of the original input image. In Section 6, we evaluate the ImageNet-Small/56 and ImageNet-Large/56 networks with an even more aggressive reshaping operation, where we use 16 groups to convert the 3224224 input image into a 485656 input for the CNN.

Figure 7. Reshaping the input data by decreasing the spatial size and increasing the number of channels in order to improve utilization of the systolic array in processing the first layer.

4. Configuration and Instructions

In this section, we show how our trained CNN described in Section 3 is coded for efficient configuration of systolic array on FPGA (Section 4.1). We then explain how the weights in each layer are divided into tiles which fit in a given systolic array and the corresponding instructions for each tile which run on the FPGA (Section 4.2).

4.1. Coding Sparse CNN Layers for FPGA

After training is complete, each convolution layer will have reached a target sparsity set by the column group parameter for the layer as described in Section 3.1. The left side of Figure 8 illustrates the weights of a pointwise convolution layer after training with 8 filters, 8 channels, and column groups of size 4. For this layer, each group of 4 channels will be combined into a single column in the systolic array on the FPGA, as there is only one nonzero entry per filter (row) in each group. The remaining nonzero weights are power of two due to the weight quantization scheme discussed in Section 3.2.

Figure 8. A sparse pointwise layer with power of two weights (left) is converted into a packed representation for efficient store on FPGA (right), where each group of combining channels (4 in this example) produces 1 8-bit encoding per filter (row).

To illustrate the coding procedure, we have outlined 4 weights in red in the pointwise layer shown in Figure 8 which will be converted into an 8-bit representation in the packed format. The second element in this group is the nonzero weight . The first 3 bits in the encoding store the index of the nonzero weight which is 1 (001) corresponding to the second element in the group. Note that for larger layers, we combine up to 8 channels, requiring a 3-bit index. The remaining 5 bits are a sign bit and 4 bits to indicate the power of two weight, which is 00110 for . As depicted in Figure 8, the power of two weights are ordered from smallest to largest (e.g.,  is 0001 and is 0111). The value 0000 is used to represent . In summary, to configure each systolic cell, only 8 bits are required.

4.2. Instructions for FPGA

CNN inference is composed of a series of matrix-matrix multiplications, one per convolution layer, between the data which is input to a layer and the learned weights of a layer. When using a single systolic array to perform the matrix multiplications for all layers, generated instructions will carry out a relatively straightforward process of alternatively loading weights into the systolic array and performing matrix multiplication between the data and loaded weights, in sequential order of the CNN layers. However, when a weight matrix is larger than the fixed size of the systolic array, it must be partitioned into smaller tiles, where each tile can fit into the systolic array. Then, a pair of weight loading and matrix multiplication instructions are scheduled for each tile. In this paper, we use column combining to dramatically reduce the number of columns for inference (e.g., from 512 channels to 64 columns in the systolic array via 8-way combining) and therefore require no tiling to partition channels of each convolutional layer of our evaluation CNNs tiling.

Figure 2 shows how inference is performed across two layers (layer L and layer L + 1) using a single systolic array. First, a load weights instruction is used to load the 128 filters by 128 channels weight matrix for layer L. Then, matrix multiplication is performed between the loaded weights and the previous layer output (layer L - 1) by passing the data into the systolic array. This matrix multiplication generates the layer L output which will be used for layer L + 1. Since the layer L + 1 weight matrix of 256128 is larger than the systolic array of 128128, it is partitioned into two tiles. The first tile in layer L + 1 is then loaded into the systolic array and is multiplied with the layer L output, which generates half the output for layer L + 1. The second tile in layer L + 1 is then processed in the same manner as the first tile. A total of six instructions are used in total, one pair of weight load and matrix multiply instructions for layer L and two pairs of instructions for layer L + 1.

Figure 9 shows the FPGA instruction layout for the systolic array architecture described in Section 5. A load weight instruction is indicated when the first bit is set, with the systolic array width and height fields controlling the size of the tile being loaded into the array. A matrix multiply instruction is indicated when the second bit is set. The height of the data matrix to be multiplied with the loaded weights is set by the input width and height fields (e.g., 5656 for the first layer).

Figure 9. The FPGA instruction layout for a 128x128 systolic array on the FPGA.

5. FPGA Design

In this section, we provide a detailed description of our FPGA design for sparse CNN inference with power of two weights. Our FPGA implementation is written in Verilog and is available at https://goo.gl/8i9aJp.

5.1. System Description

Figure 10 shows an overview of the CNN inference system as implemented on an FPGA. The parameter buffer stores the filter weights, biases, and shift directions for the channel shift operation (Wu et al., 2017). During a load weight instruction, filter weights are loaded into the systolic array (Section 5.2) and filter bias and channel shift directions are loaded into the bias and the channel shifters, respectively. During a matrix multiplication instruction, input data is loaded from the data buffer into the channel shifters, which perform the shift operations before sending the data to the systolic array in a bit-serial fashion. Each column in the systolic array takes input data from multiple input channels to support column combining shown in Figure 3. Each Selector-Accumulator (SAC) cell (Figure 11) within a column of the systolic array takes in the multiple input channels at different power of two shift offsets to select both the channel index and power of two weight index for the cell corresponding to the packed format in Figure 8.

The output from each row of the systolic array is passed to the ReLU Quantization block (Section 5.4) before storing the results back to the data buffer. Output data stored in the data buffer for the previous layer is the input data to the next layer. The output accumulator (Section 5.5) is used only in the final (fully connected) layer to reduce the feature map for each class to a single number used for prediction. The parameters for the next tile are loaded from the off-chip DRAM (not shown in Figure 10) to the parameter buffer as matrix multiplication is performed on the systolic array for the current tile. During inference, all intermediate results are stored in on-chip RAM in the data buffer.

Figure 10. System design as implemented on an FPGA.

5.2. Selector-Accumulator (SAC) for Multiplication-free Systolic Array Design

In this section, we describe our Selector-Accumulator (SAC) design for a multiplication-free systolic array for sparse matrix multiplication. We choose a bit-serial design for efficient multiplexing of multiple data streams into a single column of the array to support column combining (Kung et al., 2018). In the layout of the multiplication-free systolic array (shown in Figure 10), each column in the array takes up to eight input channels (to support column combining) into the register chain for the column in a bit-serial fashion. Each cycle, input data is shifted up to the next register in the chain. This register chain serves the standard purpose in a systolic array of propagating data through the cells in the column. However, when using power of two weights, it can serve an additional purpose of power of two weight multiplication, as each increasing position in the register chain corresponds to the input data being multiplied by an increasing power of two weight. In our design, we utilize this observation of the dual purpose nature of the register chain when using power of two weights to design an efficient systolic cell.

Figure 11a shows the selector-accumulator (SAC) cells which takes the input data from multiple points on the register chain and selects the point corresponding to the power of two weight stored in cell using a multiplexer. Additionally, it uses a channel index, also stored in the cell, to determine the position of the weight in the original sparse filter matrix (see Figure 8 for details on the indexing scheme). The selected element is then passed into the bit-serial accumulator shown in Figure 11b. The blue logic elements in the accumulator negate the product Y based on the sign of the power of two weight and add the result to the bit-serial accumulator (pink full-adder) before passing the result to the SAC to the right.

Figure 11. Bit-serial Selector-Accumulator (SAC).

Compared with a 8-bit multiplier-accumulator (MAC) which requires 8 1-bit full adders for multiplication, the SAC requires only a multiplexer to select the power of two shift offset. We have done performance comparisons using Xilinx Vivado design suite. We observed that compared to a traditional 8-bit MAC, SAC substantially reduces required LUTs (4.85), FFs (3.54), and power consumption (2.48), as shown in Table 1. As we discuss in Section 6.2, this dramatically reduces the hardware cost of each systolic cell and allows for substantially larger systolic array to be implemented on the FPGA compared to standard 8-bit MAC cells.

width=0.7center 6464 MAC 6464 SAC MAC / SAC LUT 212388 43776 4.85 FF 192293 54330 3.54 Power 4.21W 1.7W 2.48

Table 1. Comparison of FPGA resources and power for a 6464 systolic array implemented with MAC and SAC.

Figure 12 shows an example of how a register chain is used in generating a shifted version of the input data 10010 (red) with the shift amount corresponding to the power of two weight associated with the cell over time steps T = 0, 1, 2, etc. As depicted in Figure 12 (a), (b) and (c), suppose that the SAC requires a shifted version of the original input with two pending zeros in the beginning (filter weight is four). Then the Accumulator (Acc) will grab the input data stream at the second register in the register chain, so the first two bits sent to the Acc are zeros (black). After 4 additional cycles, the Acc receives an input of 1001000, which is four times of the original input 10010.

Figure 12. An example of sending the shifted version of input stream to the SAC cell.

Figure 13 shows how the register chain can be shared across two consecutive SAC cells in one column of systolic array. Suppose each of two SAC cells may require any one of the three shifted versions of the original input (corresponding to three possible powers of 2 weights). Then this leads to use of two windows with span of three on the register chain (shown in green and blue in Figure 13). The red lines in the figures show the positions where the SAC cells grab the shifted versions of the original input from the register chain. The register chain is used for two purposes: (1) shifts the input data upwards to all the SAC cells in the same column and (2) generates the shifted versions of the input data for the power of two multiplication.

Figure 13. (a) Register chain with per-cell window for power of two weights for two adjacent cells on a column of the systolic array. The red lines show the position where the shifted versions of the original input are grabbed from the register chain. (b) SAC without and with zero-skipping.

5.3. Energy-efficient SAC with Zero-Skipping

Each SAC can be turned off to save power when either the weight or the input to the SAC is zero. The structure of a SAC with and without zero-skipping mechanism are shown in Figure 13. For the SAC with zero-skipping, the zero signal is set when either the input or weight is 0 and is used as the enable signal for the gated clock. When the zero signal is set due to the current input being 0, the accumulation bypasses the SAC and is forwarded directly to the next SAC on the row in the systolic array. Note that due to ReLU, approximately half of the data elements are zero, meaning that the SAC will be disabled roughly half of the time. When the weight for the SAC is 0, then the SAC will be disabled for the entire matrix multiplication. In Section 6.3, we show that this zero-skipping mechanism reduces power by roughly 30%.

5.4. Design of ReLU and Quantization Block

As mentioned in Section 2.3, we use an 8-bit fixed point representation for the input data to each layer. Therefore, the quantization process must convert the higher precision (32-bit) accumulator output from the systolic array back into an 8-bit range to be used as input to the next layer. In this paper, since the fixed-point scale factor is shared across all layers, this quantization step simplifies to extracting an 8-bit range from the 32-bit accumulator. This quantization step can be fused with the ReLU activation function, by setting negative accumulator outputs to 0.

Figure 14 shows the architecture of the ReLU & Quantization block. A register array is used to collect the 32-bit output from the systolic array. The 32-bit result is shifted by the smallest representable power of two weight (e.g.,  as shown in Figure 8) and passed to the comparator. The comparator generates the indicator bit for the multiplexer, which clips the result between before storing it back to the buffer.

Figure 14. Design of ReLU & Quantization block.

5.5. Design of Output Accumulator

Given the output channels produced by the final convolutional layer, average pooling is used to reduce the spatial components of each channel to a single averaged value. For our single systolic array implementation, we fold this pooling operation into the weights of fully connected layer. Let and denote the i-th element of the input map and the average of the input channel , where is total number of elements in each channel. Denote

as the vector of channel averages, where

is the total number of input channels. We have the following derivation for the output of the fully connected layer:


where and is the weight matrix of the fully connected layer. From equation 1, we notice that can be computed by carrying out the matrix multiplication between and with the systolic array, and can be computed by summing up all the .

The output accumulator is used to calculate the sum of . The 32-bit output stream from a row in the systolic array enters the output accumulator in a bit-serial fashion. This stream is stalled in a register array until the final bit arrives. The resulting 32-bit output is added to the next 32-bit output stream. We use DSPs to carry out part of these 32-bit additions.

6. Evaluation

In this section, we first briefly reiterate the key contributions of our proposed architecture from the perspectives of both the CNN training and hardware and tie each contribution to the corresponding evaluation section (Section 6.1). Then, we evaluate the performance of our FPGA implementation against state-of-the-art FPGA accelerators on the ImageNet dataset (Section 6.2). Next, we measure the impact of zero skipping on energy efficiency (Section 6.3) for different sized systolic arrays. Finally, in Section 6.4, we analyze the impact of our streamlined CNN structure and training procedure presented in Section 3 on classification accuracy including input reshaping, using power of two weights, and the omission of residual connections from the CNN structure.

We focus on two primary performance metrics: (1) latency from image input to classification output, and (2) energy efficiency, which is the number of images the inference engine can process per joule. Note that the latter is also the number of images/sec (i.e., throughput) per watt. For high-throughput inference applications we may use multiple inference engines in parallel. If these individual engines each offer low latency and high-energy efficient inference, then the aggregate system will delivery high-throughput inferences per watt while meeting low inference latency requirements.

6.1. Recap of Full-stack Optimization

Full-stack optimization via training has enabled the following design advances which lead to our efficient FPGA implementation presented in Section 5.

  • Using power of two for weights and the batch normalization scale parameters, outlined in Section 2.3, for all layers in the CNN (including the fully connected layer). This allows for a simplified design, where a single sparse multiplication-free systolic array is used for all CNN layers. In Section 6.4.2, we discuss the impact of the proposed quantization scheme on classification accuracy.

  • Zero-skipping of the quantized data (Section 5.3). In Section 6.3, we show that zero-skipping reduces the power consumption during matrix multiplication by roughly 30%.

  • Packing sparse CNNs using column combining (Kung et al., 2018) for efficient storage and use on FPGAs, which we describe in Section 4.1. Our ImageNet-Small/56 evaluation model has only 1.5M power of two weights, which is 40 smaller than AlexNet and 92 smaller than VGG-16 (the two CNNs used by other FPGA designs).

  • Using channel shifts (Wu et al., 2017) to replace 33 convolutions with 11 convolutions. As with column combining, this reduces the number of model parameters. Additionally, it streamlines the design of the systolic array system, as 11 reduces to matrix multiplication.

  • Input reshaping (Section 3.3) to increase the bit-serial systolic array utilization and dramatically reduce the latency for the first convolution layer. In Section 6.4.1, we show that input reshaping alleviates the accuracy loss when using a smaller spatial input size of 5656 instead of the conventional 224224.

6.2. Comparing to Prior FPGA Accelerators

We compare our 170 MHz FPGA design to several state-of-the-art FPGA accelerators on the ImageNet dataset in terms of top-1 classification accuracy, latency for a single input image, and energy efficiency when no batch processing is performed (i.e., batch size of 1). By choosing these metrics, we focus on real-time scenarios where input samples must be processed immediately to meet a hard time constraint. Our evaluation model is the ImageNet-Small/56 network shown in Figure 5 with input reshaping to 485656. Our FPGA can fit a systolic array of size 128 rows by 64 columns. Each of the columns can span up to 8 channels in convolution weight matrix, i.e., when the column group parameter is set to 8, for a total of 512 channels.

Table 2 provides a comparison of our FPGA implementation with the other FPGA-based CNN accelerators. Our design achieves a per-image latency of 2.28 ms, which is among the lowest across all the designs. In addition, compared with some of the most recent works (Zhang et al., 2018; Wang et al., 2018), our design outperforms them by 5.64 and 3.26, respectively, in term of energy efficiency. Additionally, compared to an implementation which achieves comparable low latency (Li et al., 2016), our implementation has 9.29x higher energy efficiency.

Our design achieves the highest energy efficiency among all these designs for several reasons. First, we use a highly efficient CNN structure (Section 3.1) with only 1.5M weights (compared to 60M for AlexNet and 136M weights for VGG-16 (Ma et al., 2017; Qiu et al., 2016)). Our model in Table 2 is significantly smaller and all weights (including weights in batch normalization layers) are quantized to power of two numbers. Our accuracy is (about 2% worse than nearest competitive designs (Wang et al., 2018) in terms of energy efficiency). However, our implementation has at least 3x higher energy efficiency. Second, our proposed power of two quantization (Section 2.3) enables the use of a multiplication-free systolic array (Section 5.1), where each cell contains only a selector and two full adders (see Figure 11) which are more efficient compared with (Ma et al., 2017) and have simpler structure compared with (Shen et al., 2017). This allows for a large systolic array (12864) to fit on the FPGA, thereby reducing the number of tiles required to perform inference for each sample. Moreover, by using column combining (Kung et al., 2018) we can pack sparse CNN layers for efficient systolic array implementation with high hardware utilization (Xiao et al., 2017). Additionally, DSPs are used in the Output Accumulator (Section 5.5) only for a single fully connected layer and are turned off for the rest of the layers. Finally, the zero-skipping mechanism, which we evaluate in more detail in Section 6.3, further saves power by dynamically turning off systolic cells when the data entering a cell is zero.

width=center (Zhang et al., 2018) (Qiu et al., 2016) (Xiao et al., 2017) (Ma et al., 2017) (Li et al., 2016) (Shen et al., 2017) (Wang et al., 2018) Ours Xilinx FPGA Chip VC706 ZC706 ZC706 Arria-10 VC709 Virtex-7 ZC706 VC707 FF 51K(12%) 127k(29%) 96k(22%) - 262k(30%) 348k(40%) 51k(12%) 201K(33%) LUT 86k(39%) 182k(83%) 148k(68%) 161k(38%) 273k(63%) 236k(55%) 86k(39%) 239K(78%) DSP 808(90%) 780(89%) 725(80%) 1518(100%) 2144(59%) 3177(88%) 808(90%) 112(4%) BRAM 303(56%) 486(86%) 901(82%) 1900(70%) 1913(65%) 1436(49%) 303(56%) 834(81%) Accuracy (Top-1) 53.30% 64.64% N/A N/A N/A 55.70% 52.60% 50.84% Frequency (MHz) 200 150 100 150 150 100 200 170 Latency (ms) 5.88 224 17.3 47.97 2.56 11.7 5.84 2.28 Efficiency (img./S/W) 23.6 0.46 6.13 0.98 12.93 8.39 40.7 120.7

Table 2. Comparison with FPGA-based CNN accelerators.

6.3. Power Reduction by Zero Skipping

In order to evaluate the power reduction due to zero skipping, we measure the power consumption of the FPGA during matrix multiplication under two settings. The “Without Skipping” setting uses inputs which are all nonzero, meaning that every cell will be active during matrix multiplication. The “With Skipping” setting uses inputs which are half zero, in order to approximate the output of ReLU, which sets roughly half of the elements to zero.

Table 3 shows the amount of power consumption for inference for the “Without Skipping” and “With Skipping” settings for three systolic arrays of increasing sizes. For all three systolic array sizes, we observe that “With Skipping” reduces the power consumption of matrix multiplication by roughly 30%.

width=0.6height=0.08center Without Skipping With Skipping 3264 1.0W 0.7W 6464 1.7W 1.3W 12864 3.0W 2.2W

Table 3. Power consumption comparison of zero-skipping.

6.4. Impact of Full-stack Training on Accuracy

We now evaluate the impact of the modifications to both the CNN structure and training procedure as proposed in Section 3 on classification accuracy.

6.4.1. Impact of Input Reshaping

In order to determine the effectiveness of the input reshaping operation described in Section 3.3, we compare models using the same spatial input size with and without reshaping (e.g., 35656 versus 485656) and models with different spatial input size (e.g., 3224224 versus 485656). Additionally, we train a larger ImageNet model (ImageNet-Large/56) using input reshaping to see best accuracy that our proposed approach can achieve when used with a small spatial input size.

Table 4 shows the classification accuracy for the four evaluated network settings. First, we observe that the ImageNet-Small/56 with reshaping is able to achieve similar classification accuracy to the ImageNet-Small/224 without reshaping, even with a 16 fewer pixels in each channel. This shows that input reshaping allows for input images with additional channels to negate the loss in accuracy due to the small spatial input size. Additionally, for the two ImageNet-Small/56 models (with and without reshaping), we see that input reshaping provides a substantial improvement of around 4% accuracy. This is especially interesting considering these two networks have identical structures except for the initial layer (48 channels with input reshaping versus 3 channels without reshaping). Finally, the ImageNet-Large/56 model achieves an impressive 67.57% which is only 2% behind full-precision MobileNet using 224224 input. This shows that the proposed CNN structure and power of two quantization method can achieve high classification accuracy with reshaped input.

width=0.8height=0.1center Model Input Reshaping Accuracy (%) ImageNet-Small/224 No 52.32 ImageNet-Small/56 No 46.92 ImageNet-Small/56 Yes 50.84 ImageNet-Large/56 Yes 67.57

Table 4. Evaluating impact of input reshaping.

6.4.2. Impact of Power of Two Weight Quantization

While power of two weight quantization allow for an exceedingly efficient implementation, they introduce some loss in classification accuracy when compared against a full-precision version of the same network. Additionally, if these schemes are only evaluated on easier datasets (such as CIFAR-10), the reduction in accuracy can be understated when transition to harder datasets (such as ImageNet). Table 5 shows the classification accuracy for the CIFAR-10 and ImageNet-Small/56 models using full-precision and power of two weights. We see that while the gap between the CIFAR-10 models is only around 2.5%, the gap for ImageNet is closer to 6%. However, as we demonstrate in Section 6.4.1, this reduction in classification accuracy can often be alleviated by increasing the model size.

CIFAR-10 ImageNet-Small/56
Full-Precision 95.28 57.16
Power of Two 92.80 50.84
Table 5. Comparing classification accuracy (%) for full-precision and power of two weights for the CIFAR-10 and ImageNet-Small/56 models.

6.4.3. Impact of Removing Residual Connections

Figure 15 shows the impact of residual connections by evaluating the CIFAR-10 network structure with and without residual connections. In order to ensure that there is not an unseen interaction between the power of two quantization and residual connections, we compare the impact of residual connections on networks with and without quantization. We see that, regardless of quantization, networks trained without residual connections achieve similar performance to the networks trained with residual connections. This shows that residual connections have minor impact on classification accuracy for the 19 layer networks as shown by He et al. in the original ResNet paper (He et al., 2016).

Figure 15.

Classification accuracy over 150 epochs CIFAR-10 models trained with/without residual connections and with/without power of two quantization.

7. Conclusion

In this paper, we propose using full-stack optimizations for accurate, low-latency and high energy-efficiency CNN inference. We demonstrate that designs ranging from CNN model training at a high level, to those of computing structures and FPGA implementation at a low level can all be optimized simultaneously to ensure they fit one another, thereby achieving high system performance. While cross-layer optimization is a known concept in the literature, to the best of our knowledge, the system reported in this paper is one of the most comprehensive realizations based on full-stack optimization for the design of deep learning implementations on a chip.

We describe implementation details of various optimization techniques, including (1) channel shifts instead of computationally more expensive 33 convolutions, (2) packing sparse CNNs of irregular sparsity structure for efficient implementations on regular processor arrays, (3) quantizing data activations for power-saving with zero-skipping and efficient storage of intermediate data between layers, and (4) use of power of two weights and batch normalization for efficient computation.

Our Selector-Accumulator (SAC) design resulting from full-stack optimization with power of two weights represents an extremely efficient way of implementing MAC by selecting from a shift register rather than performing arithmetic operations. (It would be difficult to have a more efficient MAC design, short of analog implementations!) Given that MAC is the basic operation in the dot-product computation for matching data against filters, we believe our SAC result is significant.