1 Introduction
The rapid improvements in the performance of generalpurpose processors (GPPs) have slowed down due to the breakdown of Dennard scaling and the decline in Moore’s law. At the same time, demands for compute are growing, fueled by the recent progress in artificial intelligence and the everincreasing amount of data that is gathered, stored, and processed. These trends are pushing new avenues of exploration in computer architectures. One such avenue is specialized hardware for machine learning, and specifically deep learning.
Convolutional neural networks (CNNs) are a widely used form of deep neural networks (DNN), introducing stateoftheart results in different applications such as image classification, computer vision tasks, and speech recognition. However, CNNs are compute intensive: for example, classification of a single image from the ImageNet dataset [8] may require billions of multiplyaccumulate (MAC) operations [13]. Furthermore, demands for larger input dimensions, or deeper models, will increase the number of MAC operations per input [5].
To ease the compute intensity of CNNs, we adopt a technique often applied in GPPs — prediction, and more specifically, value prediction. Value prediction has been studied extensively in the context of GPPs [6], [9]; it can be further explored in the context of CNNs, and, in general, DNNs. For example, in contrast to GPPs, values within DNNs are more predictable as they can be constrained to a certain range or quantized to a discrete space. In addition, with DNNs, validation of the prediction correctness is sometimes unnecessary, since DNNs produce approximate results “by design”.
We propose a value prediction method which exploits the spatial correlation of activations inherent in CNNs. We argue that neighboring activations in the CNN output feature maps (ofmaps) share close values (illustrated in Fig. 1). Therefore, some activation values may be predicted according to the values of adjacent activations. By predicting an ofmap activation, an entire convolution operation between the input feature map (ifmap) and the kernel may be saved. We also show that harnessing the learning capability of DNNs compensates for the accuracy degradation that is caused by additional hardware, in our case a value predictor.
This paper makes the following contributions:

We quantify the amount of spatial correlation of zerovalued ofmap activations in three modern CNNs.

We demonstrate prediction of zerovalued activations using a method that exploits the spatial correlation inherent in CNNs, quantify the achievable MAC operations savings, and measure the effect on the model accuracy.

We show how retraining the network with our predictor embedded in the feedforward phase compensates for the degradation in accuracy and increases the predictor’s effectiveness.
2 Exploiting Spatial Correlation in CNNs
CNNs are inspired by the biological visual cortex, where cells are sensitive to specific and confined areas in the visual field, also known as the receptive field. CNNs are therefore a popular choice for applications that involve images, such as image classification and computer vision tasks, and even for speech recognition. It makes sense to process only a restricted area in the visual field, or in our context, the input image, since adjacent pixels within an image are naturally correlated (e.g., the sky is blue and the grass is green).
As the input image propagates through the CNN, features are extracted. Since CNNs are built from layers which operate in a sliding window fashion, it makes sense that adjacent activations will share close values per feature map, and in particular zerovalued activations. We quickly recap how convolution layers work.
2.1 Quick CNN Recap
CNNs are comprised of convolution (CONV) layers. The basic operation of a CONV layer is a multidimensional convolution between the ifmaps and the corresponding filters, to construct the ofmaps. Each CONV layer input comprises a set of ifmaps, each called a channel. Therefore, the ifmap dimensions are 3D (height, width, channels). A single filter is 3D as well (height, width, ifmap channels), and each convolution operation is performed between a single filter and the ifmap. However, a set of filters exists. The ifmapfilter convolution is repeated with each filter within the set. Each convolution yields a 2D ofmap, which is then stacked with the other convolutions to form a 3D ofmap (height, width, channels). The convolution process is illustrated in Fig. 2.
A single ofmap activation requires MACs. We next describe how some of the convolutions may be avoided by exploiting spatial correlation.
2.2 Potential Performance Benefits
To understand the potential performance benefits of exploiting spatial correlation for value prediction, we should quantify the existing amount of spatial correlation. We measure the degree of spatial correlation of each channel of each CONV layer ofmap using a nonoverlapping sliding window. If all ofmap activations within a certain window are equal, they are considered as spatially correlated. We found that strict equality of ofmap activations exists when they are equal to zero. This is a consequence of: (1) using the ReLU activation function that “squeezes” all negative values to zero and (2) using fullprecision models, i.e., models in which floatingpoint representation is used for the activation values and may hold almost any number.
Spatial correlation is measured using varying window sizes from 2x2 to 5x5. Large windows filled with zeros indicate a high degree of spatial correlation, whereas small windows filled with nonzeros indicate a low degree of spatial correlation. A 1x1 window is used to measure the model sparsity.
We use three stateoftheart models — AlexNet [3], VGG16 [11], and ResNet18 [2], and we use ILSVRC2012 [8] as our dataset throughout this paper. Fig. 3 illustrates the averaged results of each model. From these measurements it is apparent that zerovalued activations are spatially correlated. For example, on average, 66% of the zerovalued activations are grouped in a 2x2 window (calculated as ), and 47% are grouped in a 3x3 window (). In addition, we also observe that the deeper layers exhibit more spatial correlation than the layers at the beginning of the model (due to space constraints we do not present a per layer spatial correlation breakdown).
Zerovalued activations may be adjacent to other zerovalued activations. We propose exploiting this characteristic to reduce the number of computations.
2.3 Predicting ZeroValued Activations
The spatial correlation characteristic of CNNs can be exploited in different ways. We exploit it to dynamically predict zerovalued ofmap activations, thereby saving MAC operations of entire ifmapfilter convolutions.
We propose a prediction method by which zerovalued activations are predicted according to nearby zerovalued activations. First, ofmaps are divided into square, nonoverlapping prediction windows. The ofmaps are padded with zeros so predictions can also be made in the presence of margins. Next, the activations positioned diagonally in each window are calculated. If these activations are zerovalued, the remaining activations within that window are predicted to be zerovalued as well, thereby saving their MAC operations. Fig.
4 depicts this method.The predictor is hardwarefriendly, since its resolution is based on activations that must be calculated anyway. Therefore, hardware modifications are bound to the scheduling of the convolutions [1], [12].
Increasing the prediction window size presents a tradeoff. On the one hand, a large window size increases the number of activations that may be predicted per window. On the other hand, spatial correlation diminishes as the size of the prediction window increases, which (1) decreases the number of windows with a zerovalued diagonal, and hence the number of prediction windows; and (2) increases the false prediction rate, since the area around the diagonal increases.
False predictions will decrease the model accuracy. To compensate for the accuracy degradation, we can either choose a set of prediction patterns other than diagonals, according to a tolerable accuracy degradation — this can be done using an offline optimization algorithm similar to the one used in SnaPEA [1], or we can retrain the network.
2.4 Retraining
We expect retraining to compensate for the accuracy degradation caused by false predictions. It is done on the exact same network model, only with an addition of our predictor. Backpropagation is left intact since the predictor is not trainable. Note that this method of retraining enforces constraints directly on the ofmaps, in the same way regularization enforces constraints on the weights.
3 Evaluation
3.1 Methodology
We used PyTorch 0.4.0
[7] (Python 3.5.5) pretrained models of AlexNet [3], VGG16 [11], and ResNet18 [2], with the ILSVRC2012 [8] dataset. We simulated our prediction method by extracting the intermediate values of each CONV hidden layer during feedforward, altering the data according to a given window size, and pushing it back to the next layer. Statistics such as false predictions are recorded by comparing the original feedforward intermediate values with the predicted feedforward values.3.2 Prediction Accuracy
Fig. 5 illustrates the average breakdown of ofmap activations when using the prediction method described in Section 2.3 for different prediction windows. true_predact represents the relative portion of zerovalued activations that are predicted as zeros (i.e., true prediction), whereas false_predact represents the relative portion of nonzerovalued activations that are predicted as zeros (i.e., false prediction). zero_diagact represents the relative portion of zerovalued activations that are arranged in diagonals and triggered a prediction. The rest of the activations may be sparse zeros or any other activation value, and are marked as others.
The tradeoffs of increasing the window size are noticeable throughout our measurements. When considering activation savings, the “sweet spot” of all three models is a window size of 3x3. A larger window of 4x4 or 5x5 decreases the prediction opportunities, since fewer diagonals are equal to zero. On the other hand, a prediction using a smaller window of 2x2 incurs a relatively large overhead, as compared to the larger windows.
Overall, we are able to save a maximum of 34.5%, 27.5%, and 17.6% of the CONV layer ofmap activation computations, with a 3x3 prediction window, in AlexNet, VGG16, and ResNet18, respectively. The question is how many MAC operations are saved.
Network  AlexNet  VGG16  ResNet18  

Prediction Window  2x2  3x3  4x4  5x5  3x3^{§}  2x2  3x3  4x4  5x5  2x2^{§}  2x2  3x3  4x4  5x5  2x2^{§} 
MAC Reduction [%]  34.8  40.8  41.9  41.6  37.8  30.8  36.2  35.5  35.2  30.7  20.8  23.5  21.9  22.0  22.7 
Top1 Degradation [%]  1.9  4.0  6.0  8.5  1.6  3.6  8.4  16.8  17.2  0.7  11.0  17.6  20.4  20.8  2.7 
Top5 Degradation [%]  1.4  2.9  4.5  6.6  1.3  2.0  5.1  11.2  11.9  0.4  7.6  12.6  14.8  15.5  1.7 

With retraining, as described in Section 3.5.
3.3 MAC Savings
The required number of computations per activation depends on the CONV layer itself. For example, for ResNet18 CONV2, 64x3x3 MAC operations are required to compute a single activation value, as opposed to 512x3x3 for ResNet18 CONV5. Obviously, the latter activation requires more computations than the former.
Table I presents the average savings in terms of MAC operations. Using this prediction method, we are able to save a maximum of 41.9%, 36.2%, and 23.5% MAC operations in AlexNet, VGG16, and ResNet18, respectively. However, due to the false predictions, we expect the accuracy of the models to decrease.
3.4 Impact on Model Accuracy
Table I also presents the top1 and top5 accuracy degradation of the three models. Interestingly, we observe that the accuracy decreases as the window size increases, whereas the false prediction rates do not vary dramatically (Fig. 5
). This is probably due to the aggressiveness of the larger prediction windows, which are more likely to zero out important activations than are the smaller windows — it is not only
how many activations are zeroed out, but also which activations are zeroed out.Given an acceptable top5 accuracy loss of 3%, the following prediction windows will be chosen: for AlexNet, a 3x3 window achieves a 40.8% savings in MAC operations with 4.0% degradation in top1 accuracy and 2.9% degradation in top5 accuracy; and for VGG16, a 2x2 window achieves a 30.8% savings in MAC operations with 3.6% degradation in top1 accuracy and 2.0% degradation in top5 accuracy. Unfortunately, the accuracy degradation in ResNet18 is intolerable, with 11.0% degradation in top1 accuracy and 7.6% degradation in top5 accuracy. To compensate for the false predictions of our prediction method, we retrain the network.
3.5 Retraining
The pretrained ResNet18 model was used as a baseline for retraining. We used the same parameters for training as mentioned in [2]. By retraining ResNet18 with our prediction method and using a 2x2 prediction window, we gained back 8.3% of the top1 accuracy and 5.9% of the top5 accuracy — meaning a degradation of 2.7% in top1 accuracy and a degradation of 1.7% in top5 accuracy. Furthermore, the MAC savings increased by 1.9%, from 20.8% to 22.7%.
4 Related Work
Value prediction has long been proposed in the context of GPPs. At its core, value prediction is based on previously seen values [6] and operations [9]. However, the unique characteristics of DNNs have led to different prediction methods and implementations such as early activation prediction [1]
and prediction of ofmap activation signs (for ReLU) or relative size (for max pooling)
[12]. These approaches are closely related to conditional computation methods such as mixtureofexperts [10] and dynamic pruning [4].5 Conclusions and Future Work
Value prediction is a wellknown technique for speculatively resolving true data dependencies in GPPs. However, CNNs behave differently than GPPs. The algorithmic characteristics of CNNs have motivated us to research new approaches that use value prediction in CNNs, with the goal of reducing their computational intensity.
In this paper, we exploit an inherent property of CNNs: spatially correlated output feature map activations. We introduce a method to predict that a group of activations will be zerovalued according to their nearby activations, thereby reducing the required number of computations. We also demonstrate how retraining the network while embedding the predictor in the feedforward phase compensates for the loss of accuracy and improves the prediction performance. Our method reduces the number of MAC operations by 30.4%, averaged on three modern CNNs for ImageNet, with 1.7% top1 and 1.1% top5 accuracy degradation.
Future work will include regaining accuracy by exploring different prediction patterns, i.e., a knob to control prediction performance vs accuracy degradation; quantized models for the examination of nonzerovalue prediction; and architecture evaluation of performance gains versus power, energy, and area costs.
Acknowledgments
We would like to thank Yoav Etsion and Shahar Kvatinsky for their helpful feedback.
References
 [1] V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and R. Gupta, “SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks,” in Intl. Symp. on Computer Architecture (ISCA), 2018.

[2]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proc. of Computer Vision and Pattern Recognition (CVPR)
, 2016, pp. 770–778.  [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
 [4] J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 2181–2191.
 [5] S.C. Lin, Y. Zhang, C.H. Hsu, M. Skach, M. E. Haque, L. Tang, and J. Mars, “The architectural implications of autonomous driving: Constraints and acceleration,” in Intl. Conf. on Arch. Support for Programming Languages & Operating Systems (ASPLOS). ACM, 2018, pp. 751–766.
 [6] M. H. Lipasti and J. P. Shen, “Exceeding the dataflow limit via value prediction,” in Intl. Symp. on Microarchitecture (MICRO), pp. 226–237, 1996.
 [7] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” Neural Information Processing Systems Workshop (NIPSW), 2017.
 [8] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” Intl. Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
 [9] Y. Sazeides and J. E. Smith, “The predictability of data values,” in Intl. Symp. on Microarchitecture (MICRO). IEEE Computer Society, 1997, pp. 248–258.
 [10] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparselygated mixtureofexperts layer,” arXiv preprint arXiv:1701.06538, 2017.
 [11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in Intl. Conf. on Learning Representations (ICLR), 2015.
 [12] M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li, “Prediction based execution on deep neural networks,” in Intl. Symp. on Computer Architecture (ISCA), 2018, pp. 752–763.
 [13] V. Sze, Y.H. Chen, T.J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proc. of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
Comments
There are no comments yet.