1 Introduction
Deep neural networks (DNNs) have revolutionized different applications ranging from computer vision to speech and natural language processing
lecun:2015:dl, and are now widely deployed in data centers Jouppi:2017; kim:2018:aml_facebook; jongsoo:2018:datacetner and edge devices Du:2017; zhang:2017:helloedge; carole:2019:edge. As modern DNNs usually require significant computation, neural network accelerators are extensively studied in recent years to enable energyefficient processing chen:2014:diannao; chen:2016:eyeriss; Sze:2017; Jouppi:2017; hardik:2018:bitfusion.Datapath of the neural network accelerator, including the arithmetic compute units and the data bus among the units, lies at the heart of neural network accelerators. It plays an important role in terms of energy consumption of the neural network accelerator. With the trend of aggressive operand quantization ( 8 bit) and near/inmemory computation, the energy consumption of memory accesses in neural network accelerators is greatly reduced. In many stateoftheart accelerator designs Andri:2016; Gao:2017; Park:2018, datapath can consume 4070% of the total energy.
Conventionally, the datapath energy consumption in a neural network accelerator can be estimated as
, where denotes the total number of operations of the neural network, is the datapath energy consumption of one operation and is a correction term that depends on the network parameters and the underlying hardware design,. Previous researches mainly focus on reducing , e.g., by optimizing the network topology forrest:2016:squeezenet; howard:2017; tan:2019:mnasnet or network pruning han:2015:deepcompression; he:2017, and reducing , e.g., by network quantization moons:2016; Park:2018; hardik:2018:bitfusionor binarization
courbariaux:2016:binarized etc. In contrast, reducing receives less attention. Existing works mainly focus on exploiting the sparsity of the network parameters and activations to gate the compute units and skip the unnecessary computations chen:2016:eyeriss.In this work, we explore a new dimension to reduce and the datapath energy. We show that as most accelerators leverage spatial data reuse chen:2016:eyeriss and stream input operands into the compute array, the sequence of the input operands significantly impacts the datapath energy. Specifically, we find that the datapath energy is strongly correlated to the bit flips when streaming the input operands. In this paper, we leverage the concept of hamming distance to formalize the bit flip analysis. A series of posttraining and trainingaware techniques are proposed to codesign and cooptimize the accelerator and the network to reduce the hamming distance of the input operand sequence. Experimental results based on the postlayout simulation demonstrates on average 3.6 datapath energy reduction and up to energy reduction for certain layers. The proposed techniques are compatible with other optimization knobs, e.g., pruning, quantization, etc. The contributions of the paper can be summarized as follows:

We discover the correlation between the datapath energy and the hamming distance when streaming the input operands and further propose the concept of hamming distance optimization as a new direction of datapath energy optimization;

We propose a posttraining optimization algorithm to reduce the hamming distance of the neural network model, which introduces negligible hardware overhead and no impact on the model output;

We propose a hammingdistanceaware training algorithm, which reduces the hamming distance of the neural network model with negligible effect on accuracy;

Experiments based on the postlayout simulation demonstrate promising results (up to 8.51 datapath energy reduction) by combining the hammingdistanceaware training and the postprocessing algorithm.
2 Background: Spatial Accelerators
Modern NN accelerators usually comprise of the following major components  a twodimensional arithmetic compute array, a networkonchip (NoC), control blocks, and an onchip memory Sze:2017. Specifically, the onchip memory usually consists of several levels of hierarchies, including a global buffer, an interunit network to facilitate data pass among the arithmetic units, and register files (RFs) within each arithmetic unit chen:2016:eyeriss. The memory access energy to different memory hierarchies can vary significantly.
To reduce access to more expensive memory hierarchies, specialized processing dataflows are designed to enable data reuse across different computation units. Representative dataflows include input stationary, output stationary, row stationary, etc chen:2016:eyeriss; Sze:2017. The dataflow architecture dictates what data gets read into the memory hierarchy and how data are propagated in the compute array. Figure 3 shows two widely used designs chen:2018:eyerissv2. The design in Figure LABEL:sub@fig:is leverages the input stationary and relies on unrolling both the input channel dimension () and input spatial locations () to map the operations spatially onto the array to exploit the computation parallelism. The weights are streamed into the array and can be reused horizontally with input pixels from different spatial locations, while the partial sums are accumulated spatially across the column. Instead of saving the partial sums directly to the activation SRAM, they are usually stored into an accumulation buffer first to reduce the memory access energy. Until the partial sums are fully reduced, they may go through the nonlinear units and be stored back to the global SRAM. Similarly, the design in Figure LABEL:sub@fig:os_hwk leverages the output stationary dataflow and relies on unrolling the output channel dimension () and output spatial dimensions () to enable data reuse. In this scheme, the weights are still streamed along the row direction and the input activations are streamed in the orthogonal direction to reuse across different output channels.
Popular neural network layers, such as the convolution layer and the fullyconnected layer, can be easily mapped to the accelerator. Consider the example of a 1by1 convolution in Figure 6. To map the computation into the input stationary compute array in Figure LABEL:sub@fig:is, the input activations are prefilled with different input spatial locations unrolled horizontally and different input channels unrolled vertically. The weights are streamed in a sequence into the arithmetic array. For the input stationary dataflow, weights from different input channels are fed spatially into different rows and weights from different output channels are streamed temporally into the same row.
The energy consumption of the accelerator is composed of the datapath energy (including the arithmetic computation energy and the data propagation energy among compute units), the memory access energy and the control energy. When all the operands can fit into the local SRAM Park:2018, the datapath and memory access energy can be computed as
where denote the reuse factor of the weights, input activation, and the partial sums, respectively. denotes the datapath energy and denotes the SRAM access energy that includes the SRAM read/write and the data movement energy from SRAM to the compute array.
Assume the ratio between the compute energy, interunit propagation energy and the SRAM access energy is 1:2:6 chen:2016:eyeriss. For a reasonable design with and , the ratio between the datapath energy and the SRAM energy becomes 3: 4 (assuming weights and inputs are 8bit and partial sums are 32bit). The datapath energy consumes a significant portion of the total energy and hence it is crucial to reduce the datapath energy.
The datapath energy can be further divided into three parts, including switching energy, glitch energy, and leakage Rabaey:2008. Both the switching energy and glitch energy are caused by the circuit nodes switching from 0 to 1 or from 1 to 0, denoted as bit flips. Leakage energy is caused by the small leakage current when the transistors are turned off and its contribution to the datapath energy is usually orders of magnitude smaller than glitch and switching. Hence, we ignore the leakage energy in the paper.
3 Motivation: Bit Flips Impact Datapath Energy
As described in Section 2, while the datapath energy accounts for a significant portion of the total energy, the bit flips inside the datapath are the main culprit. The datapath bit flips are determined by the value and the streaming pattern of the input operands, i.e., weights, input activations, and partial sums. Because the activations and partial sums are input dependent, we focus on analyzing the impact of weight matrices.
Consider the example of the 2bit weight matrix :
Without loss of generality, we assume an inputstationary compute array as shown in Figure LABEL:sub@fig:is and is streamed into the array following Figure LABEL:sub@fig:1x1_mapping. Then, the weight sequence fed into the first row of the compute array is and the bit flips of the weight sequence at the compute array input are 6. To confirm the relation between the bit flips of the weight sequence and the datapath energy, we use the weight matrices of MobileNetV2 sandler:2018:mobilenetv2 trained on Cifar100 dataset as an example and generate random input activations. We evaluate the bit flips of the weight sequence and the datapath energy consumption with postlayout simulation (see Section 6 for detailed experimental setup). As shown in Figure 7
, the total bit flips of the weight sequence and the energy consumption demonstrate a strong linear relation. Moreover, given a fixed total bit flips, the energy is independent of the length of the weight sequence and the bit flipping probability.
Hence, to minimize the datapath energy, an effective approach is to reduce the bit flips of the weight sequence. We observe that the bit flips can be reduced if the sequence of weight streaming are carefully reordered. Consider in the example above. If we swap the second row and the third row of the matrix, we have as below:
Now, by streaming into the compute array, the bit flips can be reduced from to . Note that swapping the rows of the weight matrix is essentially adjusting the order of generating output channels and there is no influence in terms of neural network functionality. As the swapping can be finished via postprocessing in model level, no specific hardware support is needed.
Besides the posttraining processing of the weight matrices, another orthogonal approach is to incorporate the bit flips of the weight sequence into the training procedure and reduce the bit flips without sacrificing the model accuracy. Consider and below:
While , the bit flips of the weight sequence for is . Hence, without impacting the computation results, the bit flips can be reduced by . In fact, by further reordering the output channels of , the bit flips can be reduced to .
In the rest of the section, we will formalize our analysis of the bit flips of the weight sequence and formally describe our posttraining and trainingaware techniques to reduce the bit flips of the weight sequence.
4 Methodology: Hamming Distance Optimization
In this section, we will formalize the concept of bit flips and propose both posttraining and trainingaware techniques to minimize the bit flips of the streaming weights. For convenience, we use the input stationary dataflow (e.g., Figure LABEL:sub@fig:is) as an example throughout the analysis but the definition, analysis, and conclusion can be easily applied to other dataflow schemes once the weights are streamed into the compute array. The notations used in this paper are summarized in Table 1.
Output batch, height, width, channel  

Input channel, filter height, width  
Model weight matrix  
Bit width of model weights  
Sequence of output channels  
Cluster of input channels 
4.1 Problem Formulation
In coding theory, the bit difference between two binary strings are formally defined as the hamming distance Hamming:1950. Accordingly, we define the hamming distance between two bit numbers and as
where denotes the XOR operation and is the function that extracts the th bit of the number.
Consider a weight matrix ^{1}^{1}1We assume for the weight matrix in this case, but the definition and analysis can be easily extended to cases where and are larger than 1.. As the input stationary dataflow unrolls the input channel dimension () along the compute array column direction and stream the weights along different output channels () in temporal sequence to the array, we define the hamming distance of streaming as
We also define the normalized hamming distance (NHD) of streaming as
Hence, captures the total bit flips of streaming and represents the bit flip probability. We show for different layers of the MobileNetV2 and ResNet26 trained on Cifar100 in Figure 8 and as we can see, is close to 0.5 for all the layers. In the following sections, we will propose techniques to minimize and to reduce the bit flips and the datapath energy.
4.2 Output Channel Reordering
Inspired by the example in Section 3, a straightforward technique to minimize is to reorder the sequence of streaming into the compute array. Let denote the sequence of output channels to stream into the array and denote the hamming distance of streaming following , then, we have
The output channel reordering problem is defined as follows.
Problem 1
(Output Channel Reordering) Given a weight matrix , find such that is minimized, i.e.,
As is a reordering of the output channels which consists of each output channel exactly once, we map the reordering problem to a Traveling Salesman Problem (TSP) Miller:1960. Specifically, each output channel corresponds to one location to visit, and the hamming distance between two output channels and , i.e., , corresponds to the distance between two locations. Hence, minimizing is equivalent to searching for the shortest path to visit all the locations. Hence the complexity of solving the output channel reordering problem scales exponentially, which quickly becomes intractable for moderate size problems.
To efficiently solve the reordering problem, we propose a greedy search algorithm as described in Algorithm 1. The algorithm first initializes the sequence by assigning the first output channel to the starting position of . After that, the output channel that has the smallest hamming distance compared with the previous channel in is added to . The complexity of the algorithm scales quadratically with the number of output channels, which is very efficient in practice.
4.3 Input Channel Segmentation and Clustering
While the output channel reordering can help reduce the hamming distance of streaming , the effectiveness is impacted by the number of input channels . We use MobileNetV2 on Cifar100 dataset krizhevsky:2009:cifar as an example and evaluate the hamming distance reduction for different layers. As shown in Table 2, with the increase of , the hamming distance reduction slows down significantly.
layer  HD Reduction  

layer 7  192  32  1.53 
layer 15  384  64  1.33 
layer 21  576  96  1.27 
layer 27  960  160  1.18 
layer 33  1280  320  1.21 
One straightforward method to improve the effectiveness of the output channel reordering is to segment the weight matrix along the input channel direction into several small submatrices. For different submatrices, we can use Algorithm 1 to search for the optimal output channel order to reduce the hamming distance. We denote this method as the segmentthenreorder approach. It should be noted that as the output channel sequence changes, specific hardware support in the accumulator is required to make sure the partial sums corresponding to the same output channel are correctly accumulated. We will detail the hardware support in Section 5, which introduces negligible overhead to the accumulator. With the segmentthenreorder approach, the hamming distance can be further reduced by 1.52.5 compared with the direct output channel reordering (see Section 6).
As expected, the smaller each input channel group is, the better the hamming distance reduction can be achieved. Hence, the segmentthenreorder algorithm would favor the compute array with a skewed aspect ratio, i.e., more columns and fewer rows. However, the aspect ratio of the compute array also impacts the reuse of different operands
chen:2016:eyeriss and utilization. For example, with more number of columns in the input stationary array, it takes more pixels in the spatial plane to fill the whole array and thus, leads to under utilization for small input activation sizes. While this is not a problem for smallscale arrays with a small number of compute units, it may induce utilization issue for large arrays.To further improve the effectiveness when the input channel per segment is large, we propose to cluster the input channels first before segmenting the weight matrix. Then, the output channels are reordered for each cluster separately. We denote this approach as clusterthenreorder.
Consider the example of the following weight matrix:
Assume the compute array has 4 rows and only allows for streaming 4 input channels simultaneously. Instead of directly segmenting , we can first cluster the input channels into 2 groups, i.e., and , and then segment into and as below. Compared to the segmentthenreorder approach, the hamming distance can be reduced from 22 to 16. Note that the clustering of the input channels does not impact the output.
Let denote the clusters of the input channel. The input channel clustering problem is then defined as follows.
Problem 2
(Input Channel Clustering) Given a weight matrix , find clusters such that the total hamming distance of streaming each submatrix is minimized, i.e.,
This is a nested optimization problem which is computationally expensive to solve optimally even if the inner optimization loop can be solved with the proposed Algorithm 1. Hence, we propose a greedy iterative method to solve the nested optimization problem. As shown in Algorithm 2, in the initialization process, input channels are randomly selected and are initialized to minimize the total hamming distance for each input channel. The algorithm alternates between the assignment step and the update step for total iterations. In the assignment step, for each input channel , we evaluate its hamming distance following the optimal sequence of each cluster, i.e., . The input channel is then added to the cluster with the smallest hamming distance. In the update step, we recompute the optimal sequence for each cluster of the input channels.
The convergence of the proposed clustering algorithm can be guaranteed if the inner loop optimization, i.e., the output channel reordering problem, can be optimally solved. This is because the objective function of the clustering problem is always bounded and it is guaranteed to be reduced in the assignment and update step in each iteration. In practice, we use the greedy algorithm to solve the update step as described in Section 4.2. We find the clusterthenreorder algorithm converges very well and continuously outperform the segmentthenreorder algorithm.
We use the layers of the MobileNetV2 sandler:2018:mobilenetv2 on Cifar100 dataset as an example and run the clustering algorithm 20 times with random initialization. The convergence plot is shown in Figure LABEL:sub@fig:convergence. The normalized hamming distance is computed as the hamming distance of different algorithms normalized by the hamming distance without output channel reordering. As we can see, the clustering algorithm converges within 15 iterations and the runtorun variation of the clustering algorithm is very small. We also compare the clusterthenreorder algorithm with the segmentthenreorder algorithm for different layers and different numbers of channels per cluster. As shown in Figure LABEL:sub@fig:effective and LABEL:sub@fig:cluster_num, the clusterthenreorder algorithm can outperform the baseline algorithm by up to 1.21.
The proposed algorithm is very efficient since the complexity of the update step scales and the complexity of the assignment step scales with the number of input channels , output channels , and the number of clusters .
4.4 Hamming DistanceAware Training
While the techniques proposed above focus on posttraining optimization, we also propose a hamming distanceaware training procedure to further reduce the hamming distance of streaming
. The basic idea is to incorporate the hamming distance loss into the loss function and explicitly encourage the reduction of hamming distance as shown below:
where represents the original crossentropy loss. is used to explicitly control the tradeoff between the accuracy and the hamming distance reduction.
However, there are two main problems with . Firstly, to compute , is needed. Consider a integer , to get the th bit, we have
Because is not differentiable, is not differentiable as well.
Previously, straightthroughestimator (STE) has been proposed to approximate the gradients for bengio:2013. However, directly applying STE leads to
This indicates that only the most significant bit of the weight parameters can be regularized. Hence, we propose an iterative freezeandregularize procedure. In the network training process, we first add regularization to the most significant bit and after several epochs, we freeze the most significant bit and after that regularize the second most significant bit. The iterative process continues until we fix all the bits of the weights.
The second problem with is that to compute , the input channel clusters and output channel orders are needed. As the weight matrices get updated during training, both the optimal input channel clusters and the optimal output channel order can change. Hence, after each epoch of training, we leverage the clusterthenreorder algorithm to cluster the input channels and reorder the output channels. The final training procedure is shown in Figure 13.
5 Hardware Support
In this section, we discuss the necessary hardware support for the proposed algorithms, including direct greedy reorder, segmentthenreorder, and clusterthenreorder algorithms.
The direct reorder algorithm only switches the sequence for the output channel generation. No extra hardware support is needed for the direct reorder algorithm. Instead, a posttraining processing of the model to rearrange the weight matrices is sufficient. Consider the example in Figure 16. To switch the output channels of the first layer, both the rows of the weight matrix in the first layer, i.e., , and the columns of the weight matrix in the second layer, i.e., , need to be switched accordingly.
The segmentthenreorder algorithm divides the input channels into segments and reorders the output channels for each segment separately. Hence, the same row in different segmented weight submatrices may correspond to the partial sum of different output channels. To guarantee correct reduction of the partial sums, we add an output address lookup table (LUT) to translate the index of the counter in the accumulator to the actual address for accumulation as shown in Figure LABEL:sub@fig:output_lut. We also show in Figure LABEL:sub@fig:output_lut_exg an example on how to use the address LUT to guide the accumulation. As can be seen, by modifying the LUT entry corresponding to different counter indices, the partial sums are correctly accumulated.
If we assume the output buffer depth to be , the LUT needs to have at least entries and each entry needs to have bits. For a reasonable output buffer depth, e.g., 1024, the LUT SRAM size is less than 2 KB, which is very small and thus has negligible energy and area overhead.
Compared to the segmentthenreorder algorithm, the clusterthenreorder algorithm also changes the order of the input channels, i.e., the columns of the weight matrices. While the clustering does not impact the correctness of the outputs, it may impact the memory fetching of the input activations. We leverage the output address LUT to swap the activations to avoid any complication or modification to the input fetching logic. For example, let’s assume the required input channel sequence for the current layer to be . When executing the previous layer, the output address LUT can simply be set to to reorder the sequence of channel generation.
6 Experimental Results
6.1 Experimental Setup
In this section, we report on our experiments to demonstrate the effectiveness of the proposed hamming distance reduction techniques. We use MobileNetV2 sandler:2018:mobilenetv2 and ResNet26 he:2016:resnet trained on the Cifar10 and Cifar100 dataset for the evaluation. The 1by1 convolution layers in MobileNetV2 and the 3by3 convolution layers in ResNet26 are picked ^{2}^{2}23by3 depthwise separable convolutions are not considered as they are usually hard to map on the systolic arrays and they only consume a very small part of the total energy.. The layer shapes are shown in Appendix A. To evaluate the energy consumption, we use simulation on a postlayout extracted netlist. We designed an inputstationary systolic array with 8 rows and 8 columns. Each PE in the array can support the multiplication and accumulation of 8bit activations and 4bit weights. The array is synthesized and placed and routed using a commercial technology library and the energy consumption is evaluated in a typical process corner. The leakage energy is ignored in the evaluation as it is more than two orders of magnitude less than dynamic energy.
Dataset  Top1 Acc  Top5 Acc 






Cifar10  94.38  99.82  1.0  1.0  1.0  1.0  
94.22  99.00  41.6  7.55  18.6  6.63  
Cifar100  78.21  94.53  1.0  1.0  1.0  1.0  
77.98  94.20  2.31  1.24  2.17  1.26  
77.47  94.07  3.19  1.50  2.86  1.47  
77.29  94.24  4.54  1.76  3.88  1.67  
77.62  94.26  5.95  2.00  4.92  1.86 
6.2 PostTraining Hamming Distance Optimization
We first compare the effectiveness of different posttraining hamming distance optimization algorithms, including the direct reorder, segmentthenreorder, and clusterthenreorder algorithms. We select the 1by1 convolution layers from the MobileNetV2 and the 3by3 layers from the ResNet26 for the evaluation. We compare the hamming distance of different algorithms with the baseline setting without any optimization. As shown in Figure 23, when the number of input channels per cluster is 8, the average hamming distance can be reduced by 1.96 and 1.54 for MobileNetV2 and ResNet26, respectively, which translate to 1.62 and 1.49 reduction of the average energy consumption.
We also have a more detailed comparison between the segmentthenreorder and the clusterthenreorder algorithms for MobileNetV2 as shown in Figure 20. As shown in the figure, the clusterthenreorder algorithm usually results in a higher reduction for the even layers, e.g., layer 2, layer 4, etc. These layers are the second 1by1 convolution layers in the inverted residual blocks, which have a larger number of input channels and a smaller number of output channels. For these layers, more clusters can be formed to achieve better results. For the even layers with a smaller number of input channels and a larger number of output channels, the two methods perform similarly.
6.3 TrainingAware Hamming Distance Optimization





Baseline  1.0  1.0  1.0  1.0  
, CTR  2.44  1.96  2.27  1.84  
5.95  2.00  4.92  1.86  
, CTR  10.2  3.79  8.51  2.85 
We now evaluate the effectiveness of the trainingaware hamming distance algorithms. We select MobileNetV2 and train the network on Cifar10 and Cifar100 datasets. By controlling the regularization coefficients , we explore the tradeoff between the accuracy and the reduction of hamming distance. For practical purpose, we constrain the accuracy degradation within 1%. As shown in Table 3, on Cifar10 dataset, the average hamming distance can be reduced by 7.55, which leads to 6.63 reduction of the average energy across layers. On Cifar100 dataset, the average hamming distance reduction and the average energy reduction are 2.00 and 1.86, respectively.
6.4 Combined Hamming Distance Optimization.
We now combine the posttraining optimization techniques with the trainingaware optimization algorithm. As shown in Table 4, the proposed trainingaware and posttraining optimization techniques can work orthogonal to each other. By combining these optimization techniques, for MobileNetV2 trained on Cifar100, the average hamming distance of streaming the weight matrices can be reduced by 3.79 and the average datapath energy can be reduced by 2.85.
7 Conclusion
Energy consumption of arithmetic datapath in a neural network accelerator is heavily dependent on the hamming distance of the input sequence. With the proposed HammingDistanceAware training and postprocessing algorithm, the energy consumption of datapath can be significantly reduced. Evaluation with MobileNetV2 and ResNet neural networks shows that our proposed methods can achieve 2.85 datapath energy reduction on average and up to 8.51 datapath energy reduction for certain network layers, which demonstrates significant potential in energycritical neural network accelerator designs.
References
Appendix Appendix A Layer shapes of MobileNetV2 and ResNet26
# Layer  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33 

16  96  24  144  24  144  32  192  32  192  32  192  64  384  64  384  64  384  64  384  96  576  96  576  96  576  160  960  160  960  160  960  320  
96  24  144  24  144  32  192  32  192  32  192  64  384  64  384  64  384  64  384  96  576  96  576  96  576  160  960  160  960  160  960  320  1280 
# Layer  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26 

16  16  16  16  16  16  16  16  16  32  16  32  32  32  32  32  32  32  64  32  32  32  32  32  32  32  
16  16  16  16  16  16  16  16  32  32  16  32  32  32  32  32  32  64  64  64  64  64  64  64  64  64 