Convolutional Neural Network (CNN) are widely used for various computer vision tasks ranging from image classification, object detection and segmentation to image super-resolution and so on. The architectural complexity of these networks in terms of trainable parameters and number of computations have increased significantly over the past few years owing to the availability of high performance computing resources for offline training. Heuristic design of these models results in redundancies. Quantitative estimation of redundancy in CNNs have been explored for removing redundant parameters for the purpose of deployment on resource constrained edge devices. These techniques leverage the redundancies in the network to decrease the number of parameters so as to deploy them on edge devices with constrained computational capabilities. Structural pruning of CNNs which removes redundancies at the kernel level uses certain quantitative measures like the summation of values of kernels, threshold , statistics information .
Most methods for model redundancy estimation disregard the distributed nature of the features  learned by the network. Functionality of individual units in a neural network cannot be comprehended fully since it is also affected by other units along the depth of the network . Also, qualitative analysis and interpretation is lacking in existing methods. In this work, we explore a set of methods aimed at understanding the source of robustness in a CNN and how its response is affected by altering the composition of different layers. Instead of measuring kernel redundancy based on the features learned by the individual kernel or the activation generated by them while using the training data, we propose to analyze the unit impulse responses.
In this work, we explore a set of methods aimed at understanding the source of robustness in a CNN and how its response is affected by altering the composition of different layers. Further, we also empirically present the ineffectiveness of layer-wise pruning strategies in removing redundancy. Instead of measuring kernel redundancy based on the features learned by the individual kernel or the activation generated by them for the training data, we propose to analyze the unit impulse response of the convolution operations across depth of the network.
2 Quantifying redundancy in a convolutional neural network
2.1 Theoretical backdrop
We present methods that will help gain better insight into multifaceted redundancy in a CNN contributing to its robustness. Here, a CNN is analyzed empirically to quantitatively validate existence of redundancy and to understand the causes. Consider two input samples to a CNN, and for which and are the activation obtained from the convolutional layer. Figure 1 shows a set of kernels in the layer and its activation. Robustness of the network is demonstrated by perturbing these activation and observing its effect on the accuracy of prediction. We shuffle the channel of the activation across all samples to obtain modified activation maps as shown in Figure. 2. We refer to this process as the ‘channel shuffling’.
We further explore the robustness of the network by shuffling the pixels within one channel of the input tensor to a layer as shown in Figure 3. This process is referred to as the ‘pixel shuffling’.
The change in activation of the
layer on account of either of the shuffling of the input is measured in terms of cosine similarity (CS) between the corresponding channels ofand . A robust network is expected to have a minimal change in the activation on account of channel shuffling or pixel shuffling.
CS for the channel of the activation is given by,
The effect of shuffling of input can also be measured in terms of change in accuracy of prediction while solving the classification inference on the test dataset.
2.2 Experiments and Results
Channel Shuffling: To validate the effectiveness of the empirical investigations, we perform the experiment on two CNNs, LeNet-5  and an over-parametrized version of it, where, the number of learnable parameters in each layer is increased by a factor of and is referred to as LeNet-5x10. LeNet-5 network having channels in the first layer and channels in the second is trained on the MNIST dataset111http://yann.lecun.com/exdb/mnist/ till convergence. is then shuffled across all samples in each mini-batch of size in the test set. Figure 4 shows shuffling on a sample data. LeNet-5x10 is also trained till convergence on the MNIST dataset and the experiment repeated. The plot of CS between and for each for LeNet-5 and LeNet-5x10 are shown in Figure 5(a) and Figure 5(c) respectively .
Figure 6 shows the plots of accuracy of the two network for both the experiments.
The effect of shuffling of input on the performance of the networks demonstrates the robustness of the CNNs. In the channel and pixel shuffling experiments, if shuffling certain channel of does not yield significant change in the cosine similarity of , it can be inferred that the channel is redundant. It was observed that, the cosine similarity measured in both the shuffling experiments stayed consistently high for LeNet-5x10 across several channels. Whereas, in the case of LeNet-5, cosine similarity dropped in certain cases.
Further, Figure 6 shows that shuffling (channel and pixel) had very less impact on the performance of LeNet-5x10 in comparison with LeNet-5, irrespective of the channel being perturbed. This suggests that LeNet5-x10 is more robust. Thus, it can be inferred that the kernels corresponding to the channels which produced the least change in performance measured in terms of accuracy add redundancy to the network. However, these experiments do not answer the question of how redundancy exists in the network.
3 Unit impulse response as an explainer of redundancy
3.1 Explaining effect of cascaded convolutions
In order to explain the the effect of cascaded convolutions as in a CNN, let us consider the system shown in Figure 7. It consists of two cascaded convolutions which are equivalent to two layers in a CNN. The position of convolutional kernels and are interchanged in configurations 1 and 2. However, the output of the system remains unaffected. In a similar fashion, duplicate kernels may exist along the depth of an over-parametrized CNN. The hierarchical nature of the computations in the network can thus result in similar activation in deeper layers.
3.2 Theoretical backdrop
We propose unit impulse response analysis of a CNN for understanding the redundancies. An impulse input is given as the input to a trained CNN to obtain the impulse response of each convolutional layer. Further, combination of kernels across the depth that result in similar response can be identified by measuring the similarity between the impulse response at each layer measured using normalized cross-correlation.
3.3 Experiments and Results
Impulse response analysis is performed on LeNet-5 and LeNet-5x10. The response of kernels in the two convolutional layers are shown in Figure 8. Cross-correlation between the activation of a layer is computed in a one vs. all fashion and the maximum value of correlation is presented in Figure 9. As an example, the response of 6 kernels in the first layer of LeNet-5 produces a correlation matrix as shown in Figure 9(a).
It is observed from Figure 9 that the over-parametrized LeNet-5x10 has larger number of activation having high correlation in comparison with LeNet-5. This supports the observations from the channel and pixel shuffling experiments that LeNet-5x10 has higher redundancy. The highly correlated responses are produced by duplicate kernels in a layer and also due to the effect of cascaded convolutions which produce similar response as discussed in Section 3.2.
In this work, we have presented an approach for understanding the robustness of a CNN which is observed on account of redundancy in it. Robustness of a CNN owing to redundant kernels are identified by introducing perturbations in the input to the convolutional layers in the form of channel shuffling or pixel shuffling. The proposed method was experimented on LeNet5 and over-parametrized version of it, LeNet-5x10. Further, presence of redundancies were quantitatively measured using unit impulse response of the network. Cross-correlation based analysis showed that over-parametrized network has higher redundancy resulting from cascaded kernels producing similar response.
N. Frosst and G. Hinton.
Distilling a neural network into a soft decision tree.Int. Conf. Ital. Assoc. Artif. Intell., 2017.
-  S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Adv. Neural Info. Process. Sys., pages 1135–1143, 2015.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
-  J.-H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. In Proc. IEEE Int. Conf. Comp. Vis., pages 5058–5066, 2017.
J. Su, J. Faraone, J. Liu, Y. Zhao, D. B. Thomas, P. H. Leong, and P. Y.
Redundancy-reduced mobilenet acceleration on reconfigurable logic for imagenet classification.In Proc. Int. Symp. Applied Reconfig. Comput. Archi. Tools. App., pages 16–28, 2018.