I Introduction
Machine learning has rapidly proliferated into different field of life ranging from automotive and smart environments to medicine. Due to their high accuracy, larger and deeper Convolutional Neural Networks (CNNs) have become the key technology for many applications like advanced vision processing. However, it comes at the cost of significant computational and energy requirements, and thus, requiring specialized acceleratorbased architectures.
Ia StateoftheArt and Their Limitations
Plenty of work has been carried out in designing CNN accelerators [1], which mostly focus on unstructurally sparse neural networks like EIE [2], SCNN [3] and Cnvlutin [4]. The EIE and SCNN architectures make use of both sparsity in weights and activations for accelerating the networks, while Cnvlutin
exploits sparsity in activations only. The sparsity in activations usually comes from converting all the negative activations to zeros in the Rectified Linear Unit (ReLU), as used in these designs. However, the above accelerators cannot efficiently handle advanced activation functions like LeakyReLU that do not result in high sparsity to help the training process
[5] ^{1}^{1}1Other nonlinear activation functions like scaled exponential linear units (SELU) are not in the scope of this work.. Tensor Processing Unit (TPU)
[6], DaDianNao [7] and Eyeriss [8] architectures accelerate dense neural networks, and can also be modified to provide support for structurally sparse networks [9]. Although these architectures show good performance in terms of latency for the convolutional (CONV) layers, they offer very limited acceleration for the fullyconnected (FC) layers, as we will show with the help of a motivational case study in the following.IB Motivational CaseStudy
To achieve high performance and power/energy efficiency, stateoftheart CNN accelerators exploit the reuse of activations, weights and partial sums, thereby increasing the data locality and reducing the number of offchip memory accesses [8]. In this respect, the conventional systolic arraybased designs (like Google’s TPU [6]) render very effective, because each Processing Element (PE) in the Systolic Array (SA) performs three key tasks. (1) It receives data from their upstream neighbor(s). (2) It performs the basic multiplyandaccumulate (MAC) operation(s). (3) It passes the data along with the partial result(s) to their downstream neighbor(s). Hence, for computations that involve both activation and weight reuse (i.e., CONV layer), the overall speedup of these systolic arrays is significant as shown in Fig. 1 for AlexNet [10]. However, in case of only activation reuse – i.e., where one single input has to be used for multiple computations while the weights have to be used only once – the amount of speedup is very limited; see Fig. 1. Such operations are excessively found in the FC layers, as illustrated by their dataflow in Fig. 2.
Our analysis in Fig. 1b illustrates that although the overall speedup for the CONV layers is significant, the conventional systolic array does not provide matching speedup for the FC layers. This ultimately limits the overall performance of accelerating the networks, especially when dominated by FC layers. Hence, there is a significant need for an acceleratorbased architecture that can expedite both the CONV and the FC layers, to achieve a high speedup for the complete CNN. Designing such an architecture, however, bears a broad range of challenges, as discussed below.
IC Associated Scientific Challenges
Firstly, specialized systolic arrays need to be designed that can accelerate both CONV and FC layers without incurring significant area and power/energy overheads compared to the conventional approaches. Such systolic arrays should account for diverse dataflow of both types of layers, while fully utilizing the available memory bandwidth. For instance, the CONV layers’ acceleration needs simple, fast yet massivelyparallel PEs to exploit activation, weight, and partial sums reuse. While, the FC layers’ acceleration can only exploit activation reuse in a singlesample batch processing. Note: the FC layers acceleration can exploit weight reuse only in a multisample batch processing, which is not suitable for realtime or latencysensitive applications.
ID Our Novel Contributions
To overcome the above research challenges, we make the following novel contributions.

MPNA: A MassivelyParallel Neural Array (Section IV): It integrates heterogeneous systolic arrays, an efficient dataflow controller and specialized onchip memory to maximize data reuse, and other necessary architectural components to jointly accelerate FC and CONV layers.

A Design Methodology (Section III): The MPNA architecture is systematically designed using a synergistic methodology that explores different data reuse techniques and architectural alternatives. Towards this, we also present the computational complexity and datareuse analysis for CNNs (Section IIIA).

Optimized Dataflows (Section V): We propose different dataflow optimizations for efficient processing on heterogeneous systolic arrays while reducing the number of DRAM accesses and maximally using the data reuse, thereby improving the overall processing efficiency.

Hardware Implementation and Evaluation (Section VI and VII): We synthesize the complete MPNA architecture for a 28nm CMOS technology library using the ASIC design tools, and perform functional and timing validation. Our results show that the MPNA architecture offers overall performance improvement compared to stateoftheart accelerator, and % energy saving compared to the baseline architecture. MPNA achieves GOPS/W performance efficiency at MHz and consumes mW.
Ii Preliminaries
Before proceeding further, first, we present basics of CNNs, which are necessary to understand the contributions in the later sections.
Neural networks are composed of various layers, which are connected in cascade. Each layer receives some input from the preceding layer, performs certain operations, and forwards the result to the succeeding layer. A CNN mainly consists of four types of processing layers: (1) Convolutional, for extracting features; (2) FullyConnected, for classification; (3) Activation, for introducing nonlinearity; and (4) Pooling, for subsampling. Among these layers, the convolutional (CONV) layers are the most computationally intensive, while the fullyconnected (FC) layers are the most memory intensive ones. Fig. 6a illustrates this observation using the percentage of weights and MAC operations required for the CONV and FC layers in the AlexNet and VGG16 networks.
A convolutional layer receives the data from the inputs or the preceding layer, and performs the convolution operation using several filters to obtain several output feature maps, each corresponding to the output of one filter. Fig. 3 shows a detailed illustration of a single convolutional layer. Here, is the 2D input feature map, is the 2D output feature map, is the 2D kernel of a filter between and . The term denotes the activation at location (m,n) of the output feature map, i.e., . Similarly,
denotes the weight/synapse at location (p,q) in the 2D filter kernel between
and. Let’s consider convolutional stride = 1, unless stated otherwise. The FC layers can be considered as a special case of CONV layers where the input and output is a 1D array and, therefore, can be represented by the above terminologies.
Iii Methodology for Designing DNN Accelerators
Our methodology for designing optimized DNN acceleratorbased architectures is shown in Fig. 4. It consists of the following key steps, which are explained in detail in the subsequent sections.

Analyze different datareuse techniques, which can be exploited for energy and performanceefficient execution of a given DNN on the hardware accelerator.

Design and optimize individual hardware components for all the elementary functions required for the DNN execution.

Design area and power/energyefficient processing arrays as key accelerator units, which can support the most effective types of dataflow/parallelism for highperformance execution of all the computational layers of a given DNN.

Devise an optimized hardware configuration considering key architectural parameters like the size and number of processing arrays, interconnect of components affecting the supported dataflows, onchip buffers, data reuse and memory organization considering the available DRAM bandwidth.

Define highly efficient dataflows for reducing the total number of DRAM accesses required for the DNN inference.

Synthesis of the complete hardware architecture for detailed benchmarking for area, performance/throughput, and power/energy consumption considering different DNN configurations.
Iiia Computational Complexity and DataReuse Analysis
The CNN complexity can be estimated by analyzing the number of computations required for the CONV and FC layers. Fig.
5 illustrates the pseudocode of the CNN layer execution. Here, and define the number of input and output feature maps; and define the number of rows and columns in the output feature maps; and and define the number of rows and columns in filter kernels. These parameters can be used to define other parameters, like the the number of filters can be derived from .Table I shows the number of MAC operations required for each CONV and FC layer of the AlexNet and VGG16 networks. Fig. 6b and c show the weights, input activations, and output activationsreuse factor for the AlexNet and VGG16, respectively. The datareuse factor
defines the number of MAC operations in which a specific data is used. It can be observed form the figures that the data reuse pattern can mainly be classified into two main categories: (1) CONV layers, where all types of data has significant reuse factor, and (2) FC layers, where per sample weightreuse is 1. This classification is also supported by the fact that the CONV layers are more computationally intensive and the FC layers are more memory intensive, as shown in Fig.
6a. These observations along with the datareuse pattern is exploited in section IV and V for designing a novel architecture that can maximally benefit from the datareuse and a dataflow to reduce the number of offchip memory accesses, respectively.Observation  # of MACs/Sample  # of Weights  

AlexNet  VGG16  AlexNet  VGG16  
CONV  1.07B  15.34B  3.74M  14.71M 
FC  58.62M  123.63M  58.63M  123.64M 
Iv The MPNA Architecture
Fig. 7 presents the toplevel view of our MPNA architecture with detailed components, as discussed in the subsequent subsections.
Iva Overview of our MPNA Architecture (Fig. 7A)
The MPNA architecture is composed of two heterogeneous systolic arrays, an accumulation unit, a pooling & activation unit, onchip data and weight buffers, a control unit, and connectivity to DRAM. Each systolic array is specialized to support specific types of data parallelism for accelerating a specific set of configurations of computational layers while incurring minimum overheads. The systolic arrays receive data and weights from onchip buffers, perform MAC operation, and forward the resultant partial sums to the accumulation block. The accumulation block is meant to hold the partial outputs while rest of their corresponding partial outputs are being computed, which are also then accumulated together inside the accumulator block. Once the output is complete, the accumulator block forwards the output(s) to the subsequent block for pooling and activation operation, or sends it back to the onchip data buffer. The data is then either used for the next layer or is moved to the DRAM until rest of the intermediate operations are completed.
IvB Heterogeneous Systolic Arrays (Fig. 7BD)
Based on the observations of Fig. 6 in Section IIIA, we conclude that there is a need for two heterogeneous systolic arrays for processing different types of layers in a given CNN.
Systolic Array for the Convolutional Layers (SACONV, Fig. 7BC): Following the advanced architectural trends in DNN systolic arrays like [6], we design the CONV systolic array that can also exploit the activation (input data), weight, and partial sum reuse. Our SACONV
integrates a massively parallel array of Processing Elements (PEs) for dense MAC processing. Each PE receives the activations (input data) from its neighboringleft PE, and weights and partial sum from the neighboringtop PE, and passes the output to its downstream neighbor. The leftmost PEs in the array receive data from the input buffers and the topmost PEs receive weights from the weight buffer. The processed data is then forwarded to the accumulation block by the bottommost PEs. Such a systolic array enforces to have weights from the same filter/neuron to be mapped on the same column of the array, while the weights that are to be multiplied with the same input activations to be mapped on parallel columns. This enables high activation and weight reuse.
To support parallel weight movement during computation, we proposed to include an additional register that can hold the weight values while the values which are to be used in the next iteration can be moved to their respective locations. This significantly reduces the initialization time of the systolic array.
Systolic Array for the FullyConnected Layers (SAFC, Fig. 7CD): As also supported by the studies of [6] for convolutional accelerators, the SACONV can provide significant throughput for multibatch processing (with larger batch sizes), provided a reasonable size of onchip memory. However, this can significantly affect the latency of DNN inference which is an important parameter for almost all the realworld applications. To support such scenarios, we propose a novel systolic array architecture (SAFC), which can accelerate both the CONV and the FC layers for smaller batch sizes as well. The design is based on the observation that the weight reuse factorpersample in all the FC layers is 1, as shown in Fig. 6b and c. This makes the SACONV ineffective as highlighted in Section I. However, the overall bandwidth required for such cases is huge, especially for larger DNNs. Therefore, our proposed systolic array SAFC can be timemultiplexed for processing bandwidth intensive FC and computational intensive CONV layers. Towards generalization, it can also be effectively used for multibatch processing while incurring minimum area and power overheads when compared to SACONV. However, integrating both SAFA and SACONV is a better design option w.r.t. the area, performance,and power/energy efficiency as we will show in the results section. Fig. 7D shows that, unlike in SACONV, the SAFC has dedicated connections from the weight buffer to each individual PE. This enables the system to update the weights in PEs at every clock cycle, and thereby providing the capability to support highperformance execution of the FC layers. The supporting dataflow for the SAFC is illustrated in Fig. 8.
IvC Accumulation Unit (Fig. 7E)
It is composed of several subunits (equal to the total number of columns in SACONV and SAFC) to support parallel processing. Each subunit is composed of a ScratchPadMemory (SPM) for storing the partial output activations generated by the systolic arrays, and an adder for the accumulation of incoming partial sums with the stored values. Once the output activations are complete, the values are forwarded to the succeeding pooling and activation block for further processing.
IvD Pooling and Activation Unit (Fig. 7FI)
After the CONV and FC layers, the activation function is employed followed by a pooling layer that reduces the size of the feature maps for subsequent layers. The MPNA provides support for the stateoftheart MaxPooling, which is deployed in almost all the modern DNNs. Since the activation functions are typically monotonically increasing functions, they can be moved after the pooling operation to curtail the number activation functions to be performed and to reduce the hardware complexity. Figs. 7FH show that this block consists of an SPM to hold the intermediate pooling results, and a pooling and activation computation module. The MPNA architecture currently supports two of best activation functions which are commonly used in DNNs (i.e., ReLU and LeakyReLU [11]).
V Dataflow Optimization
To effectively use the memory (both onchip and offchip) and the compute capabilities of our architecture, we propose a set of dataflows (Fig. 9) that can be employed depending on the configuration of the CONV and FC layers. To explain this, we first present the types of datareuse and their dependencies on different data.
Va Datareuse and Their Dependencies

Input ActivationReuse is the number of times an input activation is used by the same filter multiplied by the number of filters in a layer. To fully exploit this reuse, all the weights (i.e., where represents the feature map index of the input activation) and the corresponding output activations should be available for each available input sample.

Output ActivationReuse is defined by the number of times partial sums are added into an output activation which, are determined by the size of the filters in a layer. To fully exploit this reuse, all the input activations and weight values corresponding to the output activation should be available.

WeightReuse is given as the number of times a weight value is used in the computation of a layer which equals the size of . To exploit this completely, all the input activations and the corresponding map should be available on chip.
VB Possible Scenarios and Corresponding Dataflows (Fig. 9)
Case 1: All input and output activations, and a set of weights () that has to be uploaded in the processing array in the next stage can be stored onchip. Here, and represents the number of rows and columns in the systolic array. Also, the output activations in one can be accommodated in the SPM of a single accumulation subunit. In this case, we can avoid input and output activation movement to DRAM and will fetch the weights once only. This is very effective for the later CONV layers where the total size of the input and output activation maps is small, and the number of filter parameters is huge.
Case 2: All input and output activations can be completely stored in onchip data buffer. The output activations in one cannot be accommodated in the SPM of a single accumulation subunit. In this case, if the overall weight buffer allows to accommodate (or ) complete filters, we partition the input feature maps into multiple blocks to fit the output channels in SPMs, as shown in Fig. 9
Case 3: The Input and output activations cannot be completely stored onchip. In this case, we give preference to input activations if they can be completely stored, and hence the Case 1 can be used while moving the resultant outputs to offchip memory.
Case 4: For all other cases, the bestpossible configuration for partitioning data is selected using the methodology proposed in [15] with following constraints: (1) The set of filters being processed together should be a multiple of . (2) The number of weights selected from each filter at one time should be a multiple of .
VC Observations and Hardware Configurations
We analyzed the configuration of the AlexNet [10] and defined our hardware configuration (shown in Table II) on the following observations.

The of CONV till CONV5 (i.e., last three CONV layers) should fit in SPM of the accumulation, and the pooling and activation units. Since the size of in these layers is , we selected SPM which can hold up to elements.

For holding the input & output activations of CONV till CONV layers of the AlexNet onchip, we selected a KB data buffer for two systolic arrays, i.e., greater than four times which is the size of the activation maps of CONV.

Systolic array of size , which provides significant parallelism while not requiring much offchip memory bandwidth.
Module  Description 

Systolic Arrays  Size of SACONV = 8x8 of PEs 
Size of SAFC = 8x8 of PEs  
SPM  Size of SPM in each subunit of Accumulation block 
and Pooling & Activation block = 256B  
Weight Buffer  Size of weight buffer = 36KB 
Data Buffer  Size of weight buffer = 256KB 
DRAM  Size of DRAM = 2Gb 
Bandwidth of DRAM = 12.8GB/s [16] 
Vi Evaluation Methodology (Fig. 10)
We developed a fully functional simulator to model the behavior of the MPNA. It is integrated with CACTI 7.0 [12] for memory models and respective area, power,and energy estimation. The complete MPNA architecture is also designed in RTL and synthesized for a nm technology using Synopsys design tools. We used ModelSim for logic simulation for functional and timing validations, and obtained the critical path delay, area, and power afterwards. We compared our SAFC architecture with SACONV, and our MPNA with conventional systolic array based accelerators (as baselines), for size of , , and using AlexNet. We also compared our MPNA with several stateoftheart accelerators such as Eyeriss [13], SCNN [3], and FlexFlow [14].
Vii Experimental Results and Discussion
Systolic Array: We compared our SAFC and SACONV to get the profile of the proposed SAFC in terms of area and power. Fig. 11 shows that the SAFC incurs insignificant area and power overhead ( and , respectively) compared to SACONV. Fig. 12a shows that SAFC achieves 8.1 speedup, compared to when only using the SACONV for FC layers, due to its microarchitural enhancements that can provide the data timely to PEs for generating results each clock cycle.
Key Observations for Performance Evaluation (Figs. 12bd):

Our MPNA achieves 1.4 – 7.2 higher speedup for AlexNet compared to the conventional systolic arraybased architectures. These improvements come from the parallelism of heterogeneous computing arrays and efficient dataflows.

Further comparison of MPNA with the stateoftheart accelerators is summarized in Table III, showing competitive characteristics of our MPNA for full CNN acceleration.
Key Observations for Power/Energy Evaluation (Figs. 12e,g):

MPNA consumes mW average power, which is dominated by pooling and activation unit due to its local memories and activation function processing.

MPNA achieves overall 51% of energy reduction compared to baseline architecture (Fig. 12e) due to reduced memory access and maximal datareuse as a result of optimized dataflows.
Key Observations for Area Evaluation (Figs. 12f): The area of MPNA (mm) is occupied by computational parts (mm) and onchip memories (mm) comprising data and weights buffers. Table III shows that our MPNA consumes a competitively small area compared to other stateoftheart accelerators.
Reference  Eyeriss  SCNN  FlexFlow  MPNA 
[8]  [3]  [14]  (this work)  
Technology (nm)  65  16  65  28 
Precision (fixedpoint)  16bit  16bit  16bit  8bit 
# PEs  168  64  256  128 
Onchip Memory (KB)  181.5  1024  64  288 
Area (mm)  12.25  7.9  3.89  2.34 
Power (mW)  278  NA  1000  239 
Frequency (MHz)  100250  1000  1000  280 
Performance (GOPS)  23.1  NA  420  35.8 
Efficiency (GOPS/W)  83.1  NA  300500  149.7 
Acceleration Target  CONV  CONV  CONV  CONV+FC 
Viii Conclusion
In this work, we demonstrate that a significant speedup for both CONV and FC layers can be achieved by a synergistic design methodology encompassing dataflow optimization, diverse types of datareuse and the MPNA architecture with heterogeneous systolic arrays and specialized buffers. The complete architecture is synthesized in a nm technology with ASIC design flow, and a comprehensive evaluation is done for area, performance, power, and energy, showing significant gains of our approach over various stateoftheart. Our novel concepts and opensource hardware would enable further research on accelerating emerging DNNs (like Capsule neural networks).
References
 [1] V. Sze and Y. H. Chen and T. J. Yang and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” in Proceedings of the IEEE, vol. 105, no. 12, pp. 22952329, Dec. 2017.
 [2] S. Han et al.,“EIE: Efficient Inference Engine on Compressed Deep Neural Network,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 243254.
 [3] A. Parashar et al., “SCNN: An accelerator for compressedsparse convolutional neural networks,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 2740.
 [4] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger and A. Moshovos,“Cnvlutin: IneffectualNeuronFree Deep Neural Network Computing,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 113.
 [5] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Selfnormalizing neural networks,” in Advances in Neural Information Processing Systems 30, pp. 971–980, 2017.
 [6] N. P. Jouppi et al., “Indatacenter performance analysis of a tensor processing unit,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 112.
 [7] T. Luo et al., “DaDianNao: A Neural Network Supercomputer,”in IEEE Transactions on Computers, vol. 66, no. 1, pp. 7388, 1 Jan. 2017.
 [8] Y. Chen, T. Krishna, J. S. Emer and V. Sze,“Eyeriss: An EnergyEfficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in IEEE Journal of SolidState Circuits, vol. 52, no. 1, pp. 127138, Jan. 2017.
 [9] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural networks,” CoRR, vol. abs/1512.08571, 2015.

[10]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1, NIPS’12, USA, pp. 1097–1105, 2012.

[11]
J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, RealTime Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 779788.
 [12] N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7.0.” https://github.com/HewlettPackard/cacti, 2018.
 [13] Y. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” in 2016 IEEE International SolidState Circuits Conference (ISSCC), pp. 262–263, Jan. 2016.
 [14] W. Lu, G. Yan, J. Li, S. Gong, Y. Han and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, 2017, pp. 553564.

[15]
J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu and X. Li “SmartShuttle: Optimizing offchip memory accesses for deep learning accelerators,” DATE, 2018.
 [16] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis and M. Horowitz, “Towards energyproportional datacenter memory with mobile DRAM,” 2012 39th Annual International Symposium on Computer Architecture (ISCA), Portland, OR, 2012, pp. 3748.
Comments
There are no comments yet.