Machine learning has rapidly proliferated into different field of life ranging from automotive and smart environments to medicine. Due to their high accuracy, larger and deeper Convolutional Neural Networks (CNNs) have become the key technology for many applications like advanced vision processing. However, it comes at the cost of significant computational and energy requirements, and thus, requiring specialized accelerator-based architectures.
I-a State-of-the-Art and Their Limitations
Plenty of work has been carried out in designing CNN accelerators , which mostly focus on un-structurally sparse neural networks like EIE , SCNN  and Cnvlutin . The EIE and SCNN architectures make use of both sparsity in weights and activations for accelerating the networks, while Cnvlutin
exploits sparsity in activations only. The sparsity in activations usually comes from converting all the negative activations to zeros in the Rectified Linear Unit (ReLU), as used in these designs. However, the above accelerators cannot efficiently handle advanced activation functions like Leaky-ReLU that do not result in high sparsity to help the training process 111Other non-linear activation functions like scaled exponential linear units (SELU) are not in the scope of this work.
. Tensor Processing Unit (TPU), DaDianNao  and Eyeriss  architectures accelerate dense neural networks, and can also be modified to provide support for structurally sparse networks . Although these architectures show good performance in terms of latency for the convolutional (CONV) layers, they offer very limited acceleration for the fully-connected (FC) layers, as we will show with the help of a motivational case study in the following.
I-B Motivational Case-Study
To achieve high performance and power/energy efficiency, state-of-the-art CNN accelerators exploit the reuse of activations, weights and partial sums, thereby increasing the data locality and reducing the number of off-chip memory accesses . In this respect, the conventional systolic array-based designs (like Google’s TPU ) render very effective, because each Processing Element (PE) in the Systolic Array (SA) performs three key tasks. (1) It receives data from their upstream neighbor(s). (2) It performs the basic multiply-and-accumulate (MAC) operation(s). (3) It passes the data along with the partial result(s) to their downstream neighbor(s). Hence, for computations that involve both activation and weight reuse (i.e., CONV layer), the overall speedup of these systolic arrays is significant as shown in Fig. 1 for AlexNet . However, in case of only activation reuse – i.e., where one single input has to be used for multiple computations while the weights have to be used only once – the amount of speedup is very limited; see Fig. 1. Such operations are excessively found in the FC layers, as illustrated by their dataflow in Fig. 2.
Our analysis in Fig. 1b illustrates that although the overall speedup for the CONV layers is significant, the conventional systolic array does not provide matching speedup for the FC layers. This ultimately limits the overall performance of accelerating the networks, especially when dominated by FC layers. Hence, there is a significant need for an accelerator-based architecture that can expedite both the CONV and the FC layers, to achieve a high speedup for the complete CNN. Designing such an architecture, however, bears a broad range of challenges, as discussed below.
I-C Associated Scientific Challenges
Firstly, specialized systolic arrays need to be designed that can accelerate both CONV and FC layers without incurring significant area and power/energy overheads compared to the conventional approaches. Such systolic arrays should account for diverse dataflow of both types of layers, while fully utilizing the available memory bandwidth. For instance, the CONV layers’ acceleration needs simple, fast yet massively-parallel PEs to exploit activation, weight, and partial sums reuse. While, the FC layers’ acceleration can only exploit activation reuse in a single-sample batch processing. Note: the FC layers acceleration can exploit weight reuse only in a multi-sample batch processing, which is not suitable for real-time or latency-sensitive applications.
I-D Our Novel Contributions
To overcome the above research challenges, we make the following novel contributions.
MPNA: A Massively-Parallel Neural Array (Section IV): It integrates heterogeneous systolic arrays, an efficient dataflow controller and specialized on-chip memory to maximize data reuse, and other necessary architectural components to jointly accelerate FC and CONV layers.
A Design Methodology (Section III): The MPNA architecture is systematically designed using a synergistic methodology that explores different data reuse techniques and architectural alternatives. Towards this, we also present the computational complexity and data-reuse analysis for CNNs (Section III-A).
Optimized Dataflows (Section V): We propose different dataflow optimizations for efficient processing on heterogeneous systolic arrays while reducing the number of DRAM accesses and maximally using the data reuse, thereby improving the overall processing efficiency.
Hardware Implementation and Evaluation (Section VI and VII): We synthesize the complete MPNA architecture for a 28nm CMOS technology library using the ASIC design tools, and perform functional and timing validation. Our results show that the MPNA architecture offers overall performance improvement compared to state-of-the-art accelerator, and % energy saving compared to the baseline architecture. MPNA achieves GOPS/W performance efficiency at MHz and consumes mW.
Before proceeding further, first, we present basics of CNNs, which are necessary to understand the contributions in the later sections.
Neural networks are composed of various layers, which are connected in cascade. Each layer receives some input from the preceding layer, performs certain operations, and forwards the result to the succeeding layer. A CNN mainly consists of four types of processing layers: (1) Convolutional, for extracting features; (2) Fully-Connected, for classification; (3) Activation, for introducing non-linearity; and (4) Pooling, for sub-sampling. Among these layers, the convolutional (CONV) layers are the most computationally intensive, while the fully-connected (FC) layers are the most memory intensive ones. Fig. 6a illustrates this observation using the percentage of weights and MAC operations required for the CONV and FC layers in the AlexNet and VGG-16 networks.
A convolutional layer receives the data from the inputs or the preceding layer, and performs the convolution operation using several filters to obtain several output feature maps, each corresponding to the output of one filter. Fig. 3 shows a detailed illustration of a single convolutional layer. Here, is the 2D input feature map, is the 2D output feature map, is the 2D kernel of a filter between and . The term denotes the activation at location (m,n) of the output feature map, i.e., . Similarly,
denotes the weight/synapse at location (p,q) in the 2D filter kernel betweenand
. Let’s consider convolutional stride = 1, unless stated otherwise. The FC layers can be considered as a special case of CONV layers where the input and output is a 1D array and, therefore, can be represented by the above terminologies.
Iii Methodology for Designing DNN Accelerators
Our methodology for designing optimized DNN accelerator-based architectures is shown in Fig. 4. It consists of the following key steps, which are explained in detail in the subsequent sections.
Analyze different data-reuse techniques, which can be exploited for energy- and performance-efficient execution of a given DNN on the hardware accelerator.
Design and optimize individual hardware components for all the elementary functions required for the DNN execution.
Design area and power/energy-efficient processing arrays as key accelerator units, which can support the most effective types of dataflow/parallelism for high-performance execution of all the computational layers of a given DNN.
Devise an optimized hardware configuration considering key architectural parameters like the size and number of processing arrays, interconnect of components affecting the supported dataflows, on-chip buffers, data reuse and memory organization considering the available DRAM bandwidth.
Define highly efficient dataflows for reducing the total number of DRAM accesses required for the DNN inference.
Synthesis of the complete hardware architecture for detailed benchmarking for area, performance/throughput, and power/energy consumption considering different DNN configurations.
Iii-a Computational Complexity and Data-Reuse Analysis
The CNN complexity can be estimated by analyzing the number of computations required for the CONV and FC layers. Fig.5 illustrates the pseudocode of the CNN layer execution. Here, and define the number of input and output feature maps; and define the number of rows and columns in the output feature maps; and and define the number of rows and columns in filter kernels. These parameters can be used to define other parameters, like the the number of filters can be derived from .
Table I shows the number of MAC operations required for each CONV and FC layer of the AlexNet and VGG-16 networks. Fig. 6b and c show the weights-, input activations-, and output activations-reuse factor for the AlexNet and VGG-16, respectively. The data-reuse factor
defines the number of MAC operations in which a specific data is used. It can be observed form the figures that the data reuse pattern can mainly be classified into two main categories: (1) CONV layers, where all types of data has significant reuse factor, and (2) FC layers, where per sample weight-reuse is 1. This classification is also supported by the fact that the CONV layers are more computationally intensive and the FC layers are more memory intensive, as shown in Fig.6a. These observations along with the data-reuse pattern is exploited in section IV and V for designing a novel architecture that can maximally benefit from the data-reuse and a dataflow to reduce the number of off-chip memory accesses, respectively.
|Observation||# of MACs/Sample||# of Weights|
Iv The MPNA Architecture
Fig. 7 presents the top-level view of our MPNA architecture with detailed components, as discussed in the subsequent sub-sections.
Iv-a Overview of our MPNA Architecture (Fig. 7A)
The MPNA architecture is composed of two heterogeneous systolic arrays, an accumulation unit, a pooling & activation unit, on-chip data and weight buffers, a control unit, and connectivity to DRAM. Each systolic array is specialized to support specific types of data parallelism for accelerating a specific set of configurations of computational layers while incurring minimum overheads. The systolic arrays receive data and weights from on-chip buffers, perform MAC operation, and forward the resultant partial sums to the accumulation block. The accumulation block is meant to hold the partial outputs while rest of their corresponding partial outputs are being computed, which are also then accumulated together inside the accumulator block. Once the output is complete, the accumulator block forwards the output(s) to the subsequent block for pooling and activation operation, or sends it back to the on-chip data buffer. The data is then either used for the next layer or is moved to the DRAM until rest of the intermediate operations are completed.
Iv-B Heterogeneous Systolic Arrays (Fig. 7B-D)
Systolic Array for the Convolutional Layers (SA-CONV, Fig. 7B-C): Following the advanced architectural trends in DNN systolic arrays like , we design the CONV systolic array that can also exploit the activation (input data), weight, and partial sum reuse. Our SA-CONV
integrates a massively parallel array of Processing Elements (PEs) for dense MAC processing. Each PE receives the activations (input data) from its neighboring-left PE, and weights and partial sum from the neighboring-top PE, and passes the output to its downstream neighbor. The left-most PEs in the array receive data from the input buffers and the top-most PEs receive weights from the weight buffer. The processed data is then forwarded to the accumulation block by the bottom-most PEs. Such a systolic array enforces to have weights from the same filter/neuron to be mapped on the same column of the array, while the weights that are to be multiplied with the same input activations to be mapped on parallel columns. This enables high activation and weight reuse.
To support parallel weight movement during computation, we proposed to include an additional register that can hold the weight values while the values which are to be used in the next iteration can be moved to their respective locations. This significantly reduces the initialization time of the systolic array.
Systolic Array for the Fully-Connected Layers (SA-FC, Fig. 7C-D): As also supported by the studies of  for convolutional accelerators, the SA-CONV can provide significant throughput for multi-batch processing (with larger batch sizes), provided a reasonable size of on-chip memory. However, this can significantly affect the latency of DNN inference which is an important parameter for almost all the real-world applications. To support such scenarios, we propose a novel systolic array architecture (SA-FC), which can accelerate both the CONV and the FC layers for smaller batch sizes as well. The design is based on the observation that the weight reuse factor-per-sample in all the FC layers is 1, as shown in Fig. 6b and c. This makes the SA-CONV ineffective as highlighted in Section I. However, the overall bandwidth required for such cases is huge, especially for larger DNNs. Therefore, our proposed systolic array SA-FC can be time-multiplexed for processing bandwidth intensive FC and computational intensive CONV layers. Towards generalization, it can also be effectively used for multi-batch processing while incurring minimum area and power overheads when compared to SA-CONV. However, integrating both SA-FA and SA-CONV is a better design option w.r.t. the area, performance,and power/energy efficiency as we will show in the results section. Fig. 7D shows that, unlike in SA-CONV, the SA-FC has dedicated connections from the weight buffer to each individual PE. This enables the system to update the weights in PEs at every clock cycle, and thereby providing the capability to support high-performance execution of the FC layers. The supporting dataflow for the SA-FC is illustrated in Fig. 8.
Iv-C Accumulation Unit (Fig. 7E)
It is composed of several sub-units (equal to the total number of columns in SA-CONV and SA-FC) to support parallel processing. Each sub-unit is composed of a Scratch-Pad-Memory (SPM) for storing the partial output activations generated by the systolic arrays, and an adder for the accumulation of incoming partial sums with the stored values. Once the output activations are complete, the values are forwarded to the succeeding pooling and activation block for further processing.
Iv-D Pooling and Activation Unit (Fig. 7F-I)
After the CONV and FC layers, the activation function is employed followed by a pooling layer that reduces the size of the feature maps for subsequent layers. The MPNA provides support for the state-of-the-art MaxPooling, which is deployed in almost all the modern DNNs. Since the activation functions are typically monotonically increasing functions, they can be moved after the pooling operation to curtail the number activation functions to be performed and to reduce the hardware complexity. Figs. 7F-H show that this block consists of an SPM to hold the intermediate pooling results, and a pooling and activation computation module. The MPNA architecture currently supports two of best activation functions which are commonly used in DNNs (i.e., ReLU and Leaky-ReLU ).
V Dataflow Optimization
To effectively use the memory (both on-chip and off-chip) and the compute capabilities of our architecture, we propose a set of dataflows (Fig. 9) that can be employed depending on the configuration of the CONV and FC layers. To explain this, we first present the types of data-reuse and their dependencies on different data.
V-a Data-reuse and Their Dependencies
Input Activation-Reuse is the number of times an input activation is used by the same filter multiplied by the number of filters in a layer. To fully exploit this reuse, all the weights (i.e., where represents the feature map index of the input activation) and the corresponding output activations should be available for each available input sample.
Output Activation-Reuse is defined by the number of times partial sums are added into an output activation which, are determined by the size of the filters in a layer. To fully exploit this reuse, all the input activations and weight values corresponding to the output activation should be available.
Weight-Reuse is given as the number of times a weight value is used in the computation of a layer which equals the size of . To exploit this completely, all the input activations and the corresponding map should be available on chip.
V-B Possible Scenarios and Corresponding Dataflows (Fig. 9)
Case 1: All input and output activations, and a set of weights () that has to be uploaded in the processing array in the next stage can be stored on-chip. Here, and represents the number of rows and columns in the systolic array. Also, the output activations in one can be accommodated in the SPM of a single accumulation sub-unit. In this case, we can avoid input and output activation movement to DRAM and will fetch the weights once only. This is very effective for the later CONV layers where the total size of the input and output activation maps is small, and the number of filter parameters is huge.
Case 2: All input and output activations can be completely stored in on-chip data buffer. The output activations in one cannot be accommodated in the SPM of a single accumulation sub-unit. In this case, if the overall weight buffer allows to accommodate (or ) complete filters, we partition the input feature maps into multiple blocks to fit the output channels in SPMs, as shown in Fig. 9
Case 3: The Input and output activations cannot be completely stored on-chip. In this case, we give preference to input activations if they can be completely stored, and hence the Case 1 can be used while moving the resultant outputs to off-chip memory.
Case 4: For all other cases, the best-possible configuration for partitioning data is selected using the methodology proposed in  with following constraints: (1) The set of filters being processed together should be a multiple of . (2) The number of weights selected from each filter at one time should be a multiple of .
V-C Observations and Hardware Configurations
The of CONV till CONV5 (i.e., last three CONV layers) should fit in SPM of the accumulation, and the pooling and activation units. Since the size of in these layers is , we selected SPM which can hold up to elements.
For holding the input & output activations of CONV till CONV layers of the AlexNet on-chip, we selected a KB data buffer for two systolic arrays, i.e., greater than four times which is the size of the activation maps of CONV.
Systolic array of size , which provides significant parallelism while not requiring much off-chip memory bandwidth.
|Systolic Arrays||Size of SA-CONV = 8x8 of PEs|
|Size of SA-FC = 8x8 of PEs|
|SPM||Size of SPM in each sub-unit of Accumulation block|
|and Pooling & Activation block = 256B|
|Weight Buffer||Size of weight buffer = 36KB|
|Data Buffer||Size of weight buffer = 256KB|
|DRAM||Size of DRAM = 2Gb|
|Bandwidth of DRAM = 12.8GB/s |
Vi Evaluation Methodology (Fig. 10)
We developed a fully functional simulator to model the behavior of the MPNA. It is integrated with CACTI 7.0  for memory models and respective area, power,and energy estimation. The complete MPNA architecture is also designed in RTL and synthesized for a nm technology using Synopsys design tools. We used ModelSim for logic simulation for functional and timing validations, and obtained the critical path delay, area, and power afterwards. We compared our SA-FC architecture with SA-CONV, and our MPNA with conventional systolic array based accelerators (as baselines), for size of , , and using AlexNet. We also compared our MPNA with several state-of-the-art accelerators such as Eyeriss , SCNN , and FlexFlow .
Vii Experimental Results and Discussion
Systolic Array: We compared our SA-FC and SA-CONV to get the profile of the proposed SA-FC in terms of area and power. Fig. 11 shows that the SA-FC incurs insignificant area and power overhead ( and , respectively) compared to SA-CONV. Fig. 12a shows that SA-FC achieves 8.1 speed-up, compared to when only using the SA-CONV for FC layers, due to its microarchitural enhancements that can provide the data timely to PEs for generating results each clock cycle.
Key Observations for Performance Evaluation (Figs. 12b-d):
Our MPNA achieves 1.4 – 7.2 higher speed-up for AlexNet compared to the conventional systolic array-based architectures. These improvements come from the parallelism of heterogeneous computing arrays and efficient dataflows.
Further comparison of MPNA with the state-of-the-art accelerators is summarized in Table III, showing competitive characteristics of our MPNA for full CNN acceleration.
Key Observations for Power/Energy Evaluation (Figs. 12e,g):
MPNA consumes mW average power, which is dominated by pooling and activation unit due to its local memories and activation function processing.
MPNA achieves overall 51% of energy reduction compared to baseline architecture (Fig. 12e) due to reduced memory access and maximal data-reuse as a result of optimized dataflows.
Key Observations for Area Evaluation (Figs. 12f): The area of MPNA (mm) is occupied by computational parts (mm) and on-chip memories (mm) comprising data and weights buffers. Table III shows that our MPNA consumes a competitively small area compared to other state-of-the-art accelerators.
|On-chip Memory (KB)||181.5||1024||64||288|
In this work, we demonstrate that a significant speedup for both CONV and FC layers can be achieved by a synergistic design methodology encompassing dataflow optimization, diverse types of data-reuse and the MPNA architecture with heterogeneous systolic arrays and specialized buffers. The complete architecture is synthesized in a nm technology with ASIC design flow, and a comprehensive evaluation is done for area, performance, power, and energy, showing significant gains of our approach over various state-of-the-art. Our novel concepts and open-source hardware would enable further research on accelerating emerging DNNs (like Capsule neural networks).
-  V. Sze and Y. H. Chen and T. J. Yang and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017.
-  S. Han et al.,“EIE: Efficient Inference Engine on Compressed Deep Neural Network,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 243-254.
-  A. Parashar et al., “SCNN: An accelerator for compressed-sparse convolutional neural networks,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 27-40.
-  J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger and A. Moshovos,“Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1-13.
-  G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in Neural Information Processing Systems 30, pp. 971–980, 2017.
-  N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 1-12.
-  T. Luo et al., “DaDianNao: A Neural Network Supercomputer,”in IEEE Transactions on Computers, vol. 66, no. 1, pp. 73-88, 1 Jan. 2017.
-  Y. Chen, T. Krishna, J. S. Emer and V. Sze,“Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017.
-  S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural networks,” CoRR, vol. abs/1512.08571, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, pp. 1097–1105, 2012.
J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 779-788.
-  N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7.0.” https://github.com/HewlettPackard/cacti, 2018.
-  Y. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” in 2016 IEEE International Solid-State Circuits Conference (ISSCC), pp. 262–263, Jan. 2016.
-  W. Lu, G. Yan, J. Li, S. Gong, Y. Han and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, 2017, pp. 553-564.
J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu and X. Li “SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators,” DATE, 2018.
-  K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis and M. Horowitz, “Towards energy-proportional datacenter memory with mobile DRAM,” 2012 39th Annual International Symposium on Computer Architecture (ISCA), Portland, OR, 2012, pp. 37-48.