Interpretation of ResNet by Visualization of Preferred Stimulus in Receptive Fields

06/02/2020 ∙ by Genta Kobayashi, et al. ∙ University of Electro-Communications 0

One of the methods used in image recognition is the Deep Convolutional Neural Network (DCNN). DCNN is a model in which the expressive power of features is greatly improved by deepening the hidden layer of CNN. The architecture of CNNs is determined based on a model of the visual cortex of mammals. There is a model called Residual Network (ResNet) that has a skip connection. ResNet is an advanced model in terms of the learning method, but it has no biological viewpoint. In this research, we investigate the receptive fields of a ResNet on the classification task in ImageNet. We find that ResNet has orientation selective neurons and double opponent color neurons. In addition, we suggest that some inactive neurons in the first layer of ResNet effect for the classification task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this decade, deep convolutional neural networks (DCNNs) have been used in many areas such as image processing, audio signal processing, language processing, and so on. Especially, in image classification task, DCNN showed higher performance rather than that of the previous works in the field of computer vision

[10]. DCNN is a model in which the expressive power of features is greatly improved by deepening the hidden layer of the convolutional neural network (CNN). Characteristics of CNN are build to hierarchically stack convolutional layer and pooling layer. Both architectures are determined based on simple cell and complex cell that are the visual cortex of mammals[3]. CNN are added constraints from a biological point of view e.g. weight sharing and sparse activation. LeCun et al. [8]

propose a model of CNN called LeNet-5 for the classification task of digit images, and apply the backpropagation algorithm of the gradient learning method to the model. Krizhevsky

et al. [7] show the effectiveness of DCNN on the natural image classification task. In the wake of their achievements, many researchers proposed various deep models [13, 15]. He et al. [4] also proposed a DCNN model called residual network (ResNet) that has skip connections for bypassing the layers. The ResNet improves the performance of the visual classification task drastically.

The success of DCNNs accelerates the study of understanding themselves from multiple angles. From the viewpoint of the neuroscience, Yamins et al. experimentally showed the similarity between the visual cortex of the primate and a DCNN trained for classification task[17]. On the other hand, from the engineering viewpoint, the mainstream method of understanding DCNN is based on visualization of the inner expression of DCNNs using the gradient backward projection [12, 11, 14]. These methods use the differentiability of the function of DCNNs in the task.

The basic structure of the DCNNs is based on the inspiration from the biological viewpoint[3], however, non-biological improvements, which have been proposed in these years, increases the interpretation difficulties. For instance, ResNet is an improved model so that the gradient based learning methods work well. To understand ResNet, Liao & Poggio study the relation between a model of ResNet and the visual cortex[9]

. They use that the model of ResNet is similar to recurrent neural networks that had a feedback connection. The study shows the relationship between a model of ResNet and recurrent neural network, and then between the ventral stream and the model stacked recurrent neural network. However the model is added a strong constraint and is not commonly used.

In this research, in order to understand ResNet, we focus it from the viewpoint of the development of the preferred stimulus in receptive fields under the visual scene classification task with ImageNet

[1, 10]. The receptive field is a basic concept of the visual cortex system. Roughly speaking, it means the part of the visual input area in which a neuron is able to respond. The preferred stimuli make the strong response of the neuron. We try to use the idea of the preferred stimulus in the receptive field to reveal properties of the ResNet.

2 Methods

2.1 Residual Network

He et al. proposed the concept of Residual Network (ResNet) and showed several models of ResNet, e.g. ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152[4]

. ResNet contains characteristic architecture called “skip connection” or “residual connection”. The concept of the residual connection is to divide the mapping function into linear and non-linear parts explicitly. Let an input vector as

, the output vector as , and nonlinear part of mapping function as . Then skip connection is represented as:

(1)

When the dimensions of and are different, is mapped to sum them by a mapping function. The original ResNets introduce a down-sampling block contains a convolutional layer at some skip connections. Fig. 1 shows the schematic diagram of the components of the ResNet called Residual block. In order to treat skip connection in the Residual block, we introduce pseudo feature maps for the identical part of eq.(1). In the figure, each rectangle shows the feature map, the fixed arrows show the connectivity with trainable weights, and the dashed ones show the connectivity with a fixed weight. We also introduce named PlainNet as the model excluding all the skip connections for comparison. We use ResNet34 and PlainNet34 for our experiment since the ResNet34 shows higher performance rather than those of the other ResNets models and previous DCNNs in our preliminary experiments.

Figure 1:

Schematic diagram of the ResNet34: Each rectangle represents the feature map. The fixed arrows show the connectivity with trainable, and dashed ones show the connectivity with a fixed weight. Conv: convolutinoal layer, BN: batch normalization, ReLU: ReLU function as

, Iden: identity function.

2.2 Receptive Field

In the context of the visual system, the receptive field is the area on the retina to which a neuron has responded. It is considered that the receptive field contains the center and the surround area. Hubel & Wiesel shows almost all the receptive fields in the early visual cortex are very small[5], and they become large as the hierarchy deepens. Their work inspires the Neocognitron[3], which is one of the origin of the DCNN, and influences many image recognition researches.

In the context of CNN, each neuron has the receptive field and also has preferred stimuli that are a part of the patch in the input image. Fig. 2 shows an overview of the receptive field. The most right rectangle shows the feature map of the focused layer, and the middle and the left one shows the intermediate feature map and input respectively. The feature map has neurons aligned with 2-dimensional lattice. When we choose a neuron in the focused feature map, we can determine the connected area in the middle and the input. Thus, the preferred stimuli for the focused neuron are appeared in the red rectangle. Zeiler et al. [18] show samples of the receptive field of DCNN and report the characteristic of each layer. Showing its sample is a simple method to understand trained features of CNN. We use this receptive field to investigate the characteristic of neurons in this research.

Figure 2: Overview of receptive field. Each black boder rectangle is a neuron. The area inside the blue border on input is the receptive field correspond to the blue neuron in feature map . The area inside the red border on input is the receptive field correspond to the red neuron in feature map .

Let be an image of then the receptive field is a set of the spatial index. We can formally describe the receptive field image on the receptive field of the image as .

2.3 Visualization by Using Gradient

Many researchers use gradient base visualization methods to understand deep neural networks[2, 12, 11, 14]. First the work is activation maximization of Erhan et al. [2] and Simonyan et al. [12] apply it to DCNN. Activation maximization is to calculate an input that maximizes the activation of the neuron as an optimization problem. Let denote parameters of neural network and let be the activation of a neuron on a given input . Assuming a fixed , the method is represented as

(2)

Since the solution we are interested in is a direction of input space, we add norm constraint and a regularisation parameter . In general, this method is solved by gradient ascent of iterative methods because this is a non-convex optimization problem. This method can be applied to any differentiable models but the resulting solution may be a boring local solution.

3 Experiment and Results

3.1 Training ResNets

We train ResNet34 and PlainNet34 with ImageNet dataset in the manner of He[4] and Szegedy[15]. The images in ImageNet have

color channels and are whitening with the channels. We apply the stochastic gradient descent method with an initial learning rate of

, a momentum of , and use a weight decay of . The learning rate is divided by every epochs. The total training epoch is with mini-batch size . In the training, the input images of size are randomly resized by an area scale between and , and whose aspect ratio is chosen randomly between and .

3.2 Visualization Filters

Teramoto et al. propose a visualization method for the preferred stimulus as a convolution filter in the second layer of VGG [16]. Let be the convolutional weight connected from channel in layer to channel in layer , and let and be spatial index. Then, the method is to use the weight as the -th filter in the second layer. The weight is represented as

(3)

where . In general, this method is an approximated visualization for filters in higher layers because CNNs have non-linear function between convolutional layers. We call the filter to “virtual filter”, and apply this method to the second down-sampling layer in ResNet34. Fig. 3 shows the virtual filters and the first filters in ResNet34.

((a)) Virtual filter and sorted the weight values. The right graph show the values of weight and sorted index of x-axis is an index sorted in descending order.
((b)) Filters sorted by weight . The number above the image is a sorted index correspond to Fig. ((a))(a).
Figure 3: Visualization a filter of down-sapmling shown at the bottom conv. layer of Fig. 1 in ResNet34.

Looking at the coupling coefficients of the filters in Fig. 3, it can be seen that the coupling to similar filters is stronger. ResNet with a skip structure also acquires features similar to the column structure which is a biological finding.

3.3 Analysis of Preferred Stimulus in Receptive Fields

We focus to the preferred stimuli, which activate a neuron in the ResNets with strongly positive, in the input data set. In order to find the preferred stimuli, We feed validation images of ImageNet to DCNNs at first. After that, in each layer, we align the stimulus with descending order of activation value. Let be the receptive field of neuron and let be the activation value of neuron on a given receptive field image. Now, we can describe the mean receptive field image on positive validation images as

(4)

The positive validation images are validation images which the neuron activate positive and the images are represented by

(5)

We show a few examples of the top 16 at some neurons in Fig. 4 and 6, and the convolutional filter and the mean receptive filed images correspond to the neurons in Fig. 5 and 7. We find that DCNNs prefer a variety features as higher layers from the sample of the receptive fields. At first glance, Fig. ((c))(c) and ((c))(c) appear to be an inconsistent sample, but there are central features from Fig. ((d))(d) and ((d))(d).

((a))

Channel 18 in first max-pooling layer. The receptive field size is

.
((b)) Channel 18 in conv. layer in layer 3. The receptive field size is .
((c)) Channel 18 in conv.layer in layer 7. The receptive field size is .
Figure 4: Samples of the top 16 preferred stimulus images in ResNet34.
((a)) First conv. filter of channel 18.
((b)) Mean preferred stimulus image of channel 18 in first max-pooling layer.
((c)) Mean preferred stimulus image of channel 18 in conv.layer in layer 3.
((d)) Mean preferred stimulus image of channel 18 in conv. layer in layer 7.
Figure 5: First convolutional filter and mean preferred stimulus images in ResNet34.
((a)) Channel 19 in first max-pooling layer.
((b)) Channel 19 in conv. layer in layer 3.
((c)) Channel 19 in conv.layer in layer 7.
Figure 6: Samples of the top 16 preferred stimulus images in PlainNet34.
((a)) First conv. filter of channel 19.
((b)) Mean preferred stimulus image of channel 19 in first max-pooling layer.
((c)) Mean preferred stimulus image of channel 19 in conv.layer in layer 3.
((d)) Mean preferred stimulus image of channel 19 in conv. layer in layer 7.
Figure 7: First convolutional filter and mean preferred stimulus images in PlainNet34.

We find that the characteristics of the same channel are similar in different layers due to the skip connection of the ResNet. We can see that the mean receptive filed images can only find the broad tendencies but it is difficult to find the detailed properties of the neuron.

3.4 Visualization Using Maximization Method

We apply activation maximization method[2, 12] to ResNets and show the results for the neuron and the channel in the layer in Fig. 8 and 9. Optimizing for the neuron is to maximize the activation of the center neuron in a feature map. and optimizing for the channel is to maximize the average of the activation of a channel. We optimize the input by Adam optimizer [6] with a learning rate of and a weight decay of . In addition, we start initial the inputs from a zero image and iterate until times.

From the comparison of Fig. 5 and 8, we can see that the results for optimizing for the neuron are similar to the results of the mean receptive fields. The visualization at higher layers reveals detailed properties for activation maximization, but only simple trends for mean receptive field images. Especially, visualizing by activation maximization for the channel is a good-looking visualization of the neuron but the results vary according to various experimental conditions.

((a)) One optimal input of channel 18 in first max-pooling layer.
((b)) One optimal input of channel 18 in conv. layer in layer 3.
((c)) One optimal input of channel 18 in conv. layer in layer 7.
Figure 8: Examples of visualizations by activation maximization for the neuron in ResNet34.
((a)) One optimal input of channel 18 in first max-pooling layer.
((b)) One optimal input of channel 18 in conv. layer in layer 3.
((c)) One optimal input of channel 18 in conv. layer in layer 7.
Figure 9: Examples of visualizations by activation maximization for the channel in ResNet34. The image size is .

3.5 Inactive Neurons

For validation dataset images, we find that some channels in the first max-pooling layer have no output activation values in other words output zeros value because of ReLU activation function. We call the channel to “inactive neuron”. In addition, we find that ResNet34 appears more inactive neurons rather than that of the PlainNet34 from Table

1.

To investigate the effect of the inactivate neuron on the classification, we perform two classification experiments that add noise to the inactive neurons. The one is to add noise to all inactive neurons and the second is to add noise to one inactive neuron selected randomly every mini-batch. We apply noise where to each spatial dimension of the inactive neuron. Table 1 shows the results, means the value from all noised validation loss minus validation loss, and means the value from randomly noised validation loss minus validation loss. We can see that the inactive neuron of ResNet34 effects for classification task because both and of ResNet34 are positive and bigger than of PlainNet34.

Model # of inactive neurons
ResNet34
PlainNet34
Table 1: Count number of the activation and effect of the inactive neuron in first max-pooling layer for validation dataset in ResNet34 and PlainNet34.

4 Conclusion

We apply the analysis by using receptive fields and activation maximization to ResNets. Using both methods, we can find that ResNet has orientation selective neurons and double opponent color neurons. Both methods are able to characterize the lower layers well but it is harder to use the analysis for the higher layers. We find that there are inactive neurons for the classification task in ResNet34. We speculate that this phenomenon is due to channel sharing by skip connections. One hypothesis is that some channels, inactive neurons, are used for features that are not similar to the features of first convolutional layers. In the future work, we need to consider methods that can apply the analysis to the higher layers, and examine the evidence to support our hypothesis.

References

  • [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1.
  • [2] D. Erhan, Y. Bengio, A. Courville, and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §2.3, §3.4.
  • [3] K. Fukushima (1980-04-01)

    Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

    .
    Biological Cybernetics 36 (4), pp. 193–202. External Links: ISSN 1432-0770, Document, Link Cited by: §1, §1, §2.2.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document, ISSN 1063-6919 Cited by: §1, §2.1, §3.1.
  • [5] D. H. Hubel and T. N. Wiesel (1959) Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology 148 (3), pp. 574–591. Cited by: §2.2.
  • [6] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §1.
  • [8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §1.
  • [9] Q. Liao and T. Poggio (2016) Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640. Cited by: §1.
  • [10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §1, §1.
  • [11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: §1, §2.3.
  • [12] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. External Links: 1312.6034 Cited by: §1, §2.3, §3.4.
  • [13] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [14] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §1, §2.3.
  • [15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §3.1.
  • [16] T. Teramoto and H. Shouno (2019-03) A study of inner feature continuity of the vgg model. In IEICE Technical Report, pp. 239–244. Cited by: §3.2.
  • [17] D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo (2014) Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences 111 (23), pp. 8619–8624. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/111/23/8619.full.pdf Cited by: §1.
  • [18] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §2.2.