DeepAI
Log In Sign Up

Improving Efficiency in Neural Network Accelerator Using Operands Hamming Distance optimization

02/13/2020
by   Meng Li, et al.
Facebook
0

Neural network accelerator is a key enabler for the on-device AI inference, for which energy efficiency is an important metric. The data-path energy, including the computation energy and the data movement energy among the arithmetic units, claims a significant part of the total accelerator energy. By revisiting the basic physics of the arithmetic logic circuits, we show that the data-path energy is highly correlated with the bit flips when streaming the input operands into the arithmetic units, defined as the hamming distance of the input operand matrices. Based on the insight, we propose a post-training optimization algorithm and a hamming-distance-aware training algorithm to co-design and co-optimize the accelerator and the network synergistically. The experimental results based on post-layout simulation with MobileNetV2 demonstrate on average 2.85X data-path energy reduction and up to 8.51X data-path energy reduction for certain layers.

READ FULL TEXT VIEW PDF
02/13/2021

CrossLight: A Cross-Layer Optimized Silicon Photonic Neural Network Accelerator

Domain-specific neural network accelerators have seen growing interest i...
09/09/2021

SONIC: A Sparse Neural Network Inference Accelerator with Silicon Photonics for Energy-Efficient Deep Learning

Sparse neural networks can greatly facilitate the deployment of neural n...
12/01/2018

NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI

Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloa...
10/05/2020

NATSA: A Near-Data Processing Accelerator for Time Series Analysis

Time series analysis is a key technique for extracting and predicting ev...
06/02/2022

Exploiting Near-Data Processing to Accelerate Time Series Analysis

Time series analysis is a key technique for extracting and predicting ev...
10/27/2017

Revisit Fuzzy Neural Network: Demystifying Batch Normalization and ReLU with Generalized Hamming Network

We revisit fuzzy neural network with a cornerstone notion of generalized...

1 Introduction

Deep neural networks (DNNs) have revolutionized different applications ranging from computer vision to speech and natural language processing

lecun:2015:dl, and are now widely deployed in data centers Jouppi:2017; kim:2018:aml_facebook; jongsoo:2018:datacetner and edge devices Du:2017; zhang:2017:helloedge; carole:2019:edge. As modern DNNs usually require significant computation, neural network accelerators are extensively studied in recent years to enable energy-efficient processing chen:2014:diannao; chen:2016:eyeriss; Sze:2017; Jouppi:2017; hardik:2018:bitfusion.

Datapath of the neural network accelerator, including the arithmetic compute units and the data bus among the units, lies at the heart of neural network accelerators. It plays an important role in terms of energy consumption of the neural network accelerator. With the trend of aggressive operand quantization ( 8 bit) and near/in-memory computation, the energy consumption of memory accesses in neural network accelerators is greatly reduced. In many state-of-the-art accelerator designs Andri:2016; Gao:2017; Park:2018, datapath can consume 40-70% of the total energy.

Conventionally, the datapath energy consumption in a neural network accelerator can be estimated as

, where denotes the total number of operations of the neural network, is the datapath energy consumption of one operation and is a correction term that depends on the network parameters and the underlying hardware design,. Previous researches mainly focus on reducing , e.g., by optimizing the network topology forrest:2016:squeezenet; howard:2017; tan:2019:mnasnet or network pruning han:2015:deepcompression; he:2017, and reducing , e.g., by network quantization moons:2016; Park:2018; hardik:2018:bitfusion

or binarization

courbariaux:2016:binarized etc. In contrast, reducing receives less attention. Existing works mainly focus on exploiting the sparsity of the network parameters and activations to gate the compute units and skip the unnecessary computations chen:2016:eyeriss.

In this work, we explore a new dimension to reduce and the datapath energy. We show that as most accelerators leverage spatial data reuse chen:2016:eyeriss and stream input operands into the compute array, the sequence of the input operands significantly impacts the datapath energy. Specifically, we find that the datapath energy is strongly correlated to the bit flips when streaming the input operands. In this paper, we leverage the concept of hamming distance to formalize the bit flip analysis. A series of post-training and training-aware techniques are proposed to co-design and co-optimize the accelerator and the network to reduce the hamming distance of the input operand sequence. Experimental results based on the post-layout simulation demonstrates on average 3.6 datapath energy reduction and up to energy reduction for certain layers. The proposed techniques are compatible with other optimization knobs, e.g., pruning, quantization, etc. The contributions of the paper can be summarized as follows:

  • We discover the correlation between the datapath energy and the hamming distance when streaming the input operands and further propose the concept of hamming distance optimization as a new direction of datapath energy optimization;

  • We propose a post-training optimization algorithm to reduce the hamming distance of the neural network model, which introduces negligible hardware overhead and no impact on the model output;

  • We propose a hamming-distance-aware training algorithm, which reduces the hamming distance of the neural network model with negligible effect on accuracy;

  • Experiments based on the post-layout simulation demonstrate promising results (up to 8.51 datapath energy reduction) by combining the hamming-distance-aware training and the post-processing algorithm.

2 Background: Spatial Accelerators

Modern NN accelerators usually comprise of the following major components - a two-dimensional arithmetic compute array, a network-on-chip (NoC), control blocks, and an on-chip memory Sze:2017. Specifically, the on-chip memory usually consists of several levels of hierarchies, including a global buffer, an inter-unit network to facilitate data pass among the arithmetic units, and register files (RFs) within each arithmetic unit chen:2016:eyeriss. The memory access energy to different memory hierarchies can vary significantly.

(a)
(b)
Figure 3: Different dataflow variants: (a) exploits input stationary dataflows, and (b) leverages the output stationary dataflow. For these dataflow variants, weights are organized in a sequence along either the output channel or the input channel dimension and are sent into the compute array consecutively.

To reduce access to more expensive memory hierarchies, specialized processing dataflows are designed to enable data reuse across different computation units. Representative dataflows include input stationary, output stationary, row stationary, etc chen:2016:eyeriss; Sze:2017. The dataflow architecture dictates what data gets read into the memory hierarchy and how data are propagated in the compute array. Figure 3 shows two widely used designs chen:2018:eyerissv2. The design in Figure LABEL:sub@fig:is leverages the input stationary and relies on unrolling both the input channel dimension () and input spatial locations () to map the operations spatially onto the array to exploit the computation parallelism. The weights are streamed into the array and can be reused horizontally with input pixels from different spatial locations, while the partial sums are accumulated spatially across the column. Instead of saving the partial sums directly to the activation SRAM, they are usually stored into an accumulation buffer first to reduce the memory access energy. Until the partial sums are fully reduced, they may go through the nonlinear units and be stored back to the global SRAM. Similarly, the design in Figure LABEL:sub@fig:os_hwk leverages the output stationary dataflow and relies on unrolling the output channel dimension () and output spatial dimensions () to enable data reuse. In this scheme, the weights are still streamed along the row direction and the input activations are streamed in the orthogonal direction to reuse across different output channels.

Popular neural network layers, such as the convolution layer and the fully-connected layer, can be easily mapped to the accelerator. Consider the example of a 1-by-1 convolution in Figure 6. To map the computation into the input stationary compute array in Figure LABEL:sub@fig:is, the input activations are pre-filled with different input spatial locations unrolled horizontally and different input channels unrolled vertically. The weights are streamed in a sequence into the arithmetic array. For the input stationary dataflow, weights from different input channels are fed spatially into different rows and weights from different output channels are streamed temporally into the same row.

(a)
(b)
Figure 6: Mapping a 1-by-1 convolution to the input stationary accelerator in Figure LABEL:sub@fig:is.

The energy consumption of the accelerator is composed of the datapath energy (including the arithmetic computation energy and the data propagation energy among compute units), the memory access energy and the control energy. When all the operands can fit into the local SRAM Park:2018, the datapath and memory access energy can be computed as

where denote the reuse factor of the weights, input activation, and the partial sums, respectively. denotes the datapath energy and denotes the SRAM access energy that includes the SRAM read/write and the data movement energy from SRAM to the compute array.

Assume the ratio between the compute energy, inter-unit propagation energy and the SRAM access energy is 1:2:6 chen:2016:eyeriss. For a reasonable design with and , the ratio between the datapath energy and the SRAM energy becomes 3: 4 (assuming weights and inputs are 8-bit and partial sums are 32-bit). The datapath energy consumes a significant portion of the total energy and hence it is crucial to reduce the datapath energy.

The datapath energy can be further divided into three parts, including switching energy, glitch energy, and leakage Rabaey:2008. Both the switching energy and glitch energy are caused by the circuit nodes switching from 0 to 1 or from 1 to 0, denoted as bit flips. Leakage energy is caused by the small leakage current when the transistors are turned off and its contribution to the datapath energy is usually orders of magnitude smaller than glitch and switching. Hence, we ignore the leakage energy in the paper.

3 Motivation: Bit Flips Impact Datapath Energy

As described in Section 2, while the datapath energy accounts for a significant portion of the total energy, the bit flips inside the datapath are the main culprit. The datapath bit flips are determined by the value and the streaming pattern of the input operands, i.e., weights, input activations, and partial sums. Because the activations and partial sums are input dependent, we focus on analyzing the impact of weight matrices.

Consider the example of the 2-bit weight matrix :

Without loss of generality, we assume an input-stationary compute array as shown in Figure LABEL:sub@fig:is and is streamed into the array following Figure LABEL:sub@fig:1x1_mapping. Then, the weight sequence fed into the first row of the compute array is and the bit flips of the weight sequence at the compute array input are 6. To confirm the relation between the bit flips of the weight sequence and the datapath energy, we use the weight matrices of MobileNetV2 sandler:2018:mobilenetv2 trained on Cifar100 dataset as an example and generate random input activations. We evaluate the bit flips of the weight sequence and the datapath energy consumption with post-layout simulation (see Section 6 for detailed experimental setup). As shown in Figure 7

, the total bit flips of the weight sequence and the energy consumption demonstrate a strong linear relation. Moreover, given a fixed total bit flips, the energy is independent of the length of the weight sequence and the bit flipping probability.

Figure 7: Total bit flips of the weight sequence and the energy consumption demonstrate strong correlation: the colormap represents the average bit flip probability of the input sequence.

Hence, to minimize the datapath energy, an effective approach is to reduce the bit flips of the weight sequence. We observe that the bit flips can be reduced if the sequence of weight streaming are carefully reordered. Consider in the example above. If we swap the second row and the third row of the matrix, we have as below:

Now, by streaming into the compute array, the bit flips can be reduced from to . Note that swapping the rows of the weight matrix is essentially adjusting the order of generating output channels and there is no influence in terms of neural network functionality. As the swapping can be finished via post-processing in model level, no specific hardware support is needed.

Besides the post-training processing of the weight matrices, another orthogonal approach is to incorporate the bit flips of the weight sequence into the training procedure and reduce the bit flips without sacrificing the model accuracy. Consider and below:

While , the bit flips of the weight sequence for is . Hence, without impacting the computation results, the bit flips can be reduced by . In fact, by further reordering the output channels of , the bit flips can be reduced to .

In the rest of the section, we will formalize our analysis of the bit flips of the weight sequence and formally describe our post-training and training-aware techniques to reduce the bit flips of the weight sequence.

4 Methodology: Hamming Distance Optimization

In this section, we will formalize the concept of bit flips and propose both post-training and training-aware techniques to minimize the bit flips of the streaming weights. For convenience, we use the input stationary dataflow (e.g., Figure LABEL:sub@fig:is) as an example throughout the analysis but the definition, analysis, and conclusion can be easily applied to other dataflow schemes once the weights are streamed into the compute array. The notations used in this paper are summarized in Table 1.

Output batch, height, width, channel
Input channel, filter height, width
Model weight matrix
Bit width of model weights
Sequence of output channels
Cluster of input channels
Table 1: Notations used in the paper.

4.1 Problem Formulation

In coding theory, the bit difference between two binary strings are formally defined as the hamming distance Hamming:1950. Accordingly, we define the hamming distance between two -bit numbers and as

where denotes the XOR operation and is the function that extracts the -th bit of the number.

Consider a weight matrix 111We assume for the weight matrix in this case, but the definition and analysis can be easily extended to cases where and are larger than 1.. As the input stationary dataflow unrolls the input channel dimension () along the compute array column direction and stream the weights along different output channels () in temporal sequence to the array, we define the hamming distance of streaming as

We also define the normalized hamming distance (NHD) of streaming as

Hence, captures the total bit flips of streaming and represents the bit flip probability. We show for different layers of the MobileNetV2 and ResNet26 trained on Cifar100 in Figure 8 and as we can see, is close to 0.5 for all the layers. In the following sections, we will propose techniques to minimize and to reduce the bit flips and the datapath energy.

Figure 8: distribution for different layers in MobileNetV2 and ResNet26.

4.2 Output Channel Reordering

Inspired by the example in Section 3, a straightforward technique to minimize is to reorder the sequence of streaming into the compute array. Let denote the sequence of output channels to stream into the array and denote the hamming distance of streaming following , then, we have

The output channel reordering problem is defined as follows.

Problem 1

(Output Channel Reordering) Given a weight matrix , find such that is minimized, i.e.,

As is a reordering of the output channels which consists of each output channel exactly once, we map the reordering problem to a Traveling Salesman Problem (TSP) Miller:1960. Specifically, each output channel corresponds to one location to visit, and the hamming distance between two output channels and , i.e., , corresponds to the distance between two locations. Hence, minimizing is equivalent to searching for the shortest path to visit all the locations. Hence the complexity of solving the output channel reordering problem scales exponentially, which quickly becomes intractable for moderate size problems.

To efficiently solve the reordering problem, we propose a greedy search algorithm as described in Algorithm 1. The algorithm first initializes the sequence by assigning the first output channel to the starting position of . After that, the output channel that has the smallest hamming distance compared with the previous channel in is added to . The complexity of the algorithm scales quadratically with the number of output channels, which is very efficient in practice.

Input: weight matrix

Output: optimal sequence that minimizes

Initialize()

for  do

      
end for
Algorithm 1 Greedy Output Channel Reordering.

4.3 Input Channel Segmentation and Clustering

While the output channel reordering can help reduce the hamming distance of streaming , the effectiveness is impacted by the number of input channels . We use MobileNetV2 on Cifar100 dataset krizhevsky:2009:cifar as an example and evaluate the hamming distance reduction for different layers. As shown in Table 2, with the increase of , the hamming distance reduction slows down significantly.


layer HD Reduction
layer 7 192 32 1.53
layer 15 384 64 1.33
layer 21 576 96 1.27
layer 27 960 160 1.18
layer 33 1280 320 1.21
Table 2: Hamming Distance Reduction with Various and .

One straightforward method to improve the effectiveness of the output channel reordering is to segment the weight matrix along the input channel direction into several small sub-matrices. For different sub-matrices, we can use Algorithm 1 to search for the optimal output channel order to reduce the hamming distance. We denote this method as the segment-then-reorder approach. It should be noted that as the output channel sequence changes, specific hardware support in the accumulator is required to make sure the partial sums corresponding to the same output channel are correctly accumulated. We will detail the hardware support in Section 5, which introduces negligible overhead to the accumulator. With the segment-then-reorder approach, the hamming distance can be further reduced by 1.5-2.5 compared with the direct output channel reordering (see Section 6).

As expected, the smaller each input channel group is, the better the hamming distance reduction can be achieved. Hence, the segment-then-reorder algorithm would favor the compute array with a skewed aspect ratio, i.e., more columns and fewer rows. However, the aspect ratio of the compute array also impacts the reuse of different operands

chen:2016:eyeriss and utilization. For example, with more number of columns in the input stationary array, it takes more pixels in the spatial plane to fill the whole array and thus, leads to under utilization for small input activation sizes. While this is not a problem for small-scale arrays with a small number of compute units, it may induce utilization issue for large arrays.

(a)
(b)
(c)
Figure 12: Performance of the cluster-then-reorder algorithm: (a) convergence plot; and normalized hamming distance comparison with the segment-then-reorder algorithm for (b) different layers (channels per cluster is 8) and (c) different channels per cluster.

To further improve the effectiveness when the input channel per segment is large, we propose to cluster the input channels first before segmenting the weight matrix. Then, the output channels are reordered for each cluster separately. We denote this approach as cluster-then-reorder.

Consider the example of the following weight matrix:

Assume the compute array has 4 rows and only allows for streaming 4 input channels simultaneously. Instead of directly segmenting , we can first cluster the input channels into 2 groups, i.e., and , and then segment into and as below. Compared to the segment-then-reorder approach, the hamming distance can be reduced from 22 to 16. Note that the clustering of the input channels does not impact the output.

Let denote the clusters of the input channel. The input channel clustering problem is then defined as follows.

Problem 2

(Input Channel Clustering) Given a weight matrix , find clusters such that the total hamming distance of streaming each sub-matrix is minimized, i.e.,

This is a nested optimization problem which is computationally expensive to solve optimally even if the inner optimization loop can be solved with the proposed Algorithm 1. Hence, we propose a greedy iterative method to solve the nested optimization problem. As shown in Algorithm 2, in the initialization process, input channels are randomly selected and are initialized to minimize the total hamming distance for each input channel. The algorithm alternates between the assignment step and the update step for total iterations. In the assignment step, for each input channel , we evaluate its hamming distance following the optimal sequence of each cluster, i.e., . The input channel is then added to the cluster with the smallest hamming distance. In the update step, we re-compute the optimal sequence for each cluster of the input channels.

Input: weight matrix , number of iterations , number of clusters

Output: cluster of input channels and the optimal sequence of output channels

Random_Initialize(), 0

while  do

       // Assignment step for  do
            
       end for
      // Update step for  do
            
       end for
      
end while
Algorithm 2 Cluster-then-reorder Algorithm

The convergence of the proposed clustering algorithm can be guaranteed if the inner loop optimization, i.e., the output channel reordering problem, can be optimally solved. This is because the objective function of the clustering problem is always bounded and it is guaranteed to be reduced in the assignment and update step in each iteration. In practice, we use the greedy algorithm to solve the update step as described in Section 4.2. We find the cluster-then-reorder algorithm converges very well and continuously out-perform the segment-then-reorder algorithm.

We use the layers of the MobileNetV2 sandler:2018:mobilenetv2 on Cifar100 dataset as an example and run the clustering algorithm 20 times with random initialization. The convergence plot is shown in Figure LABEL:sub@fig:convergence. The normalized hamming distance is computed as the hamming distance of different algorithms normalized by the hamming distance without output channel reordering. As we can see, the clustering algorithm converges within 15 iterations and the run-to-run variation of the clustering algorithm is very small. We also compare the cluster-then-reorder algorithm with the segment-then-reorder algorithm for different layers and different numbers of channels per cluster. As shown in Figure LABEL:sub@fig:effective and LABEL:sub@fig:cluster_num, the cluster-then-reorder algorithm can out-perform the baseline algorithm by up to 1.21.

The proposed algorithm is very efficient since the complexity of the update step scales and the complexity of the assignment step scales with the number of input channels , output channels , and the number of clusters .

4.4 Hamming Distance-Aware Training

While the techniques proposed above focus on post-training optimization, we also propose a hamming distance-aware training procedure to further reduce the hamming distance of streaming

. The basic idea is to incorporate the hamming distance loss into the loss function and explicitly encourage the reduction of hamming distance as shown below:

where represents the original cross-entropy loss. is used to explicitly control the trade-off between the accuracy and the hamming distance reduction.

However, there are two main problems with . Firstly, to compute , is needed. Consider a integer , to get the -th bit, we have

Because is not differentiable, is not differentiable as well.

Previously, straight-through-estimator (STE) has been proposed to approximate the gradients for bengio:2013. However, directly applying STE leads to

This indicates that only the most significant bit of the weight parameters can be regularized. Hence, we propose an iterative freeze-and-regularize procedure. In the network training process, we first add regularization to the most significant bit and after several epochs, we freeze the most significant bit and after that regularize the second most significant bit. The iterative process continues until we fix all the bits of the weights.

The second problem with is that to compute , the input channel clusters and output channel orders are needed. As the weight matrices get updated during training, both the optimal input channel clusters and the optimal output channel order can change. Hence, after each epoch of training, we leverage the cluster-then-reorder algorithm to cluster the input channels and reorder the output channels. The final training procedure is shown in Figure 13.

Figure 13: Hamming distance-aware training procedure.

5 Hardware Support

In this section, we discuss the necessary hardware support for the proposed algorithms, including direct greedy reorder, segment-then-reorder, and cluster-then-reorder algorithms.

The direct reorder algorithm only switches the sequence for the output channel generation. No extra hardware support is needed for the direct reorder algorithm. Instead, a post-training processing of the model to re-arrange the weight matrices is sufficient. Consider the example in Figure 16. To switch the output channels of the first layer, both the rows of the weight matrix in the first layer, i.e., , and the columns of the weight matrix in the second layer, i.e., , need to be switched accordingly.

(a)
(b)
Figure 16: Post-training processing to reorder the output channel.
(a)
(b)
Figure 19: (a) Hardware support for the segment-then-reorder algorithm and (b) an example with 2 segments.

The segment-then-reorder algorithm divides the input channels into segments and reorders the output channels for each segment separately. Hence, the same row in different segmented weight sub-matrices may correspond to the partial sum of different output channels. To guarantee correct reduction of the partial sums, we add an output address lookup table (LUT) to translate the index of the counter in the accumulator to the actual address for accumulation as shown in Figure LABEL:sub@fig:output_lut. We also show in Figure LABEL:sub@fig:output_lut_exg an example on how to use the address LUT to guide the accumulation. As can be seen, by modifying the LUT entry corresponding to different counter indices, the partial sums are correctly accumulated.

Figure 20: Hamming distance reduction comparison for the segment-then-reorder and the cluster-then-reorder algorithm on MobileNetV2.

If we assume the output buffer depth to be , the LUT needs to have at least entries and each entry needs to have bits. For a reasonable output buffer depth, e.g., 1024, the LUT SRAM size is less than 2 KB, which is very small and thus has negligible energy and area overhead.

Compared to the segment-then-reorder algorithm, the cluster-then-reorder algorithm also changes the order of the input channels, i.e., the columns of the weight matrices. While the clustering does not impact the correctness of the outputs, it may impact the memory fetching of the input activations. We leverage the output address LUT to swap the activations to avoid any complication or modification to the input fetching logic. For example, let’s assume the required input channel sequence for the current layer to be . When executing the previous layer, the output address LUT can simply be set to to reorder the sequence of channel generation.

6 Experimental Results

6.1 Experimental Setup

In this section, we report on our experiments to demonstrate the effectiveness of the proposed hamming distance reduction techniques. We use MobileNetV2 sandler:2018:mobilenetv2 and ResNet26 he:2016:resnet trained on the Cifar10 and Cifar100 dataset for the evaluation. The 1-by-1 convolution layers in MobileNetV2 and the 3-by-3 convolution layers in ResNet26 are picked 2223-by-3 depthwise separable convolutions are not considered as they are usually hard to map on the systolic arrays and they only consume a very small part of the total energy.. The layer shapes are shown in Appendix A. To evaluate the energy consumption, we use simulation on a post-layout extracted netlist. We designed an input-stationary systolic array with 8 rows and 8 columns. Each PE in the array can support the multiplication and accumulation of 8-bit activations and 4-bit weights. The array is synthesized and placed and routed using a commercial technology library and the energy consumption is evaluated in a typical process corner. The leakage energy is ignored in the evaluation as it is more than two orders of magnitude less than dynamic energy.

Dataset Top-1 Acc Top-5 Acc
Best-Layer HD
Reduction
Average HD
Reduction
Best-Layer Energy
Reduction
Average Energy
Reduction
Cifar10 94.38 99.82 1.0 1.0 1.0 1.0
94.22 99.00 41.6 7.55 18.6 6.63
Cifar100 78.21 94.53 1.0 1.0 1.0 1.0
77.98 94.20 2.31 1.24 2.17 1.26
77.47 94.07 3.19 1.50 2.86 1.47
77.29 94.24 4.54 1.76 3.88 1.67
77.62 94.26 5.95 2.00 4.92 1.86
Table 3: Training-aware hamming distance optimization for MobileNetV2 on Cifar10 and Cifar100 dataset.

6.2 Post-Training Hamming Distance Optimization

We first compare the effectiveness of different post-training hamming distance optimization algorithms, including the direct reorder, segment-then-reorder, and cluster-then-reorder algorithms. We select the 1-by-1 convolution layers from the MobileNetV2 and the 3-by-3 layers from the ResNet26 for the evaluation. We compare the hamming distance of different algorithms with the baseline setting without any optimization. As shown in Figure 23, when the number of input channels per cluster is 8, the average hamming distance can be reduced by 1.96 and 1.54 for MobileNetV2 and ResNet26, respectively, which translate to 1.62 and 1.49 reduction of the average energy consumption.

We also have a more detailed comparison between the segment-then-reorder and the cluster-then-reorder algorithms for MobileNetV2 as shown in Figure 20. As shown in the figure, the cluster-then-reorder algorithm usually results in a higher reduction for the even layers, e.g., layer 2, layer 4, etc. These layers are the second 1-by-1 convolution layers in the inverted residual blocks, which have a larger number of input channels and a smaller number of output channels. For these layers, more clusters can be formed to achieve better results. For the even layers with a smaller number of input channels and a larger number of output channels, the two methods perform similarly.

(a)
(b)
Figure 23: Comparison of the post-training optimization techniques for HD reduction on (a) MobileNetV2 and (b) ResNet26.

6.3 Training-Aware Hamming Distance Optimization

Best-Layer HD
Reduction
Average HD
Reduction
Best-Layer Energy
Reduction
Average Energy
Reduction
Baseline 1.0 1.0 1.0 1.0
, CTR 2.44 1.96 2.27 1.84
5.95 2.00 4.92 1.86
, CTR 10.2 3.79 8.51 2.85
Table 4: Average hamming distance and energy reduction of the combined methods (CTR is short for the cluster-then-reorder algorithm).

We now evaluate the effectiveness of the training-aware hamming distance algorithms. We select MobileNetV2 and train the network on Cifar10 and Cifar100 datasets. By controlling the regularization coefficients , we explore the trade-off between the accuracy and the reduction of hamming distance. For practical purpose, we constrain the accuracy degradation within 1%. As shown in Table 3, on Cifar10 dataset, the average hamming distance can be reduced by 7.55, which leads to 6.63 reduction of the average energy across layers. On Cifar100 dataset, the average hamming distance reduction and the average energy reduction are 2.00 and 1.86, respectively.

6.4 Combined Hamming Distance Optimization.

We now combine the post-training optimization techniques with the training-aware optimization algorithm. As shown in Table 4, the proposed training-aware and post-training optimization techniques can work orthogonal to each other. By combining these optimization techniques, for MobileNetV2 trained on Cifar100, the average hamming distance of streaming the weight matrices can be reduced by 3.79 and the average datapath energy can be reduced by 2.85.

7 Conclusion

Energy consumption of arithmetic datapath in a neural network accelerator is heavily dependent on the hamming distance of the input sequence. With the proposed Hamming-Distance-Aware training and post-processing algorithm, the energy consumption of datapath can be significantly reduced. Evaluation with MobileNetV2 and ResNet neural networks shows that our proposed methods can achieve 2.85 datapath energy reduction on average and up to 8.51 datapath energy reduction for certain network layers, which demonstrates significant potential in energy-critical neural network accelerator designs.

References

Appendix Appendix A Layer shapes of MobileNetV2 and ResNet26

# Layer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
16 96 24 144 24 144 32 192 32 192 32 192 64 384 64 384 64 384 64 384 96 576 96 576 96 576 160 960 160 960 160 960 320
96 24 144 24 144 32 192 32 192 32 192 64 384 64 384 64 384 64 384 96 576 96 576 96 576 160 960 160 960 160 960 320 1280
Table 5: Layer shapes of 1-by-1 convolutions in MobileNetV2.
# Layer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
16 16 16 16 16 16 16 16 16 32 16 32 32 32 32 32 32 32 64 32 32 32 32 32 32 32
16 16 16 16 16 16 16 16 32 32 16 32 32 32 32 32 32 64 64 64 64 64 64 64 64 64
Table 6: Layer shapes of 3-by-3 convolutions in ResNet26.