In this decade, deep convolutional neural networks (DCNNs) have been used in many areas such as image processing, audio signal processing, language processing, and so on. Especially, in image classification task, DCNN showed higher performance rather than that of the previous works in the field of computer vision. DCNN is a model in which the expressive power of features is greatly improved by deepening the hidden layer of the convolutional neural network (CNN). Characteristics of CNN are build to hierarchically stack convolutional layer and pooling layer. Both architectures are determined based on simple cell and complex cell that are the visual cortex of mammals. CNN are added constraints from a biological point of view e.g. weight sharing and sparse activation. LeCun et al. 
propose a model of CNN called LeNet-5 for the classification task of digit images, and apply the backpropagation algorithm of the gradient learning method to the model. Krizhevskyet al.  show the effectiveness of DCNN on the natural image classification task. In the wake of their achievements, many researchers proposed various deep models [13, 15]. He et al.  also proposed a DCNN model called residual network (ResNet) that has skip connections for bypassing the layers. The ResNet improves the performance of the visual classification task drastically.
The success of DCNNs accelerates the study of understanding themselves from multiple angles. From the viewpoint of the neuroscience, Yamins et al. experimentally showed the similarity between the visual cortex of the primate and a DCNN trained for classification task. On the other hand, from the engineering viewpoint, the mainstream method of understanding DCNN is based on visualization of the inner expression of DCNNs using the gradient backward projection [12, 11, 14]. These methods use the differentiability of the function of DCNNs in the task.
The basic structure of the DCNNs is based on the inspiration from the biological viewpoint, however, non-biological improvements, which have been proposed in these years, increases the interpretation difficulties. For instance, ResNet is an improved model so that the gradient based learning methods work well. To understand ResNet, Liao & Poggio study the relation between a model of ResNet and the visual cortex
. They use that the model of ResNet is similar to recurrent neural networks that had a feedback connection. The study shows the relationship between a model of ResNet and recurrent neural network, and then between the ventral stream and the model stacked recurrent neural network. However the model is added a strong constraint and is not commonly used.
In this research, in order to understand ResNet, we focus it from the viewpoint of the development of the preferred stimulus in receptive fields under the visual scene classification task with ImageNet[1, 10]. The receptive field is a basic concept of the visual cortex system. Roughly speaking, it means the part of the visual input area in which a neuron is able to respond. The preferred stimuli make the strong response of the neuron. We try to use the idea of the preferred stimulus in the receptive field to reveal properties of the ResNet.
2.1 Residual Network
He et al. proposed the concept of Residual Network (ResNet) and showed several models of ResNet, e.g. ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152
. ResNet contains characteristic architecture called “skip connection” or “residual connection”. The concept of the residual connection is to divide the mapping function into linear and non-linear parts explicitly. Let an input vector as, the output vector as , and nonlinear part of mapping function as . Then skip connection is represented as:
When the dimensions of and are different, is mapped to sum them by a mapping function. The original ResNets introduce a down-sampling block contains a convolutional layer at some skip connections. Fig. 1 shows the schematic diagram of the components of the ResNet called Residual block. In order to treat skip connection in the Residual block, we introduce pseudo feature maps for the identical part of eq.(1). In the figure, each rectangle shows the feature map, the fixed arrows show the connectivity with trainable weights, and the dashed ones show the connectivity with a fixed weight. We also introduce named PlainNet as the model excluding all the skip connections for comparison. We use ResNet34 and PlainNet34 for our experiment since the ResNet34 shows higher performance rather than those of the other ResNets models and previous DCNNs in our preliminary experiments.
2.2 Receptive Field
In the context of the visual system, the receptive field is the area on the retina to which a neuron has responded. It is considered that the receptive field contains the center and the surround area. Hubel & Wiesel shows almost all the receptive fields in the early visual cortex are very small, and they become large as the hierarchy deepens. Their work inspires the Neocognitron, which is one of the origin of the DCNN, and influences many image recognition researches.
In the context of CNN, each neuron has the receptive field and also has preferred stimuli that are a part of the patch in the input image. Fig. 2 shows an overview of the receptive field. The most right rectangle shows the feature map of the focused layer, and the middle and the left one shows the intermediate feature map and input respectively. The feature map has neurons aligned with 2-dimensional lattice. When we choose a neuron in the focused feature map, we can determine the connected area in the middle and the input. Thus, the preferred stimuli for the focused neuron are appeared in the red rectangle. Zeiler et al.  show samples of the receptive field of DCNN and report the characteristic of each layer. Showing its sample is a simple method to understand trained features of CNN. We use this receptive field to investigate the characteristic of neurons in this research.
Let be an image of then the receptive field is a set of the spatial index. We can formally describe the receptive field image on the receptive field of the image as .
2.3 Visualization by Using Gradient
Many researchers use gradient base visualization methods to understand deep neural networks[2, 12, 11, 14]. First the work is activation maximization of Erhan et al.  and Simonyan et al.  apply it to DCNN. Activation maximization is to calculate an input that maximizes the activation of the neuron as an optimization problem. Let denote parameters of neural network and let be the activation of a neuron on a given input . Assuming a fixed , the method is represented as
Since the solution we are interested in is a direction of input space, we add norm constraint and a regularisation parameter . In general, this method is solved by gradient ascent of iterative methods because this is a non-convex optimization problem. This method can be applied to any differentiable models but the resulting solution may be a boring local solution.
3 Experiment and Results
3.1 Training ResNets
color channels and are whitening with the channels. We apply the stochastic gradient descent method with an initial learning rate of, a momentum of , and use a weight decay of . The learning rate is divided by every epochs. The total training epoch is with mini-batch size . In the training, the input images of size are randomly resized by an area scale between and , and whose aspect ratio is chosen randomly between and .
3.2 Visualization Filters
Teramoto et al. propose a visualization method for the preferred stimulus as a convolution filter in the second layer of VGG . Let be the convolutional weight connected from channel in layer to channel in layer , and let and be spatial index. Then, the method is to use the weight as the -th filter in the second layer. The weight is represented as
where . In general, this method is an approximated visualization for filters in higher layers because CNNs have non-linear function between convolutional layers. We call the filter to “virtual filter”, and apply this method to the second down-sampling layer in ResNet34. Fig. 3 shows the virtual filters and the first filters in ResNet34.
Looking at the coupling coefficients of the filters in Fig. 3, it can be seen that the coupling to similar filters is stronger. ResNet with a skip structure also acquires features similar to the column structure which is a biological finding.
3.3 Analysis of Preferred Stimulus in Receptive Fields
We focus to the preferred stimuli, which activate a neuron in the ResNets with strongly positive, in the input data set. In order to find the preferred stimuli, We feed validation images of ImageNet to DCNNs at first. After that, in each layer, we align the stimulus with descending order of activation value. Let be the receptive field of neuron and let be the activation value of neuron on a given receptive field image. Now, we can describe the mean receptive field image on positive validation images as
The positive validation images are validation images which the neuron activate positive and the images are represented by
We show a few examples of the top 16 at some neurons in Fig. 4 and 6, and the convolutional filter and the mean receptive filed images correspond to the neurons in Fig. 5 and 7. We find that DCNNs prefer a variety features as higher layers from the sample of the receptive fields. At first glance, Fig. ((c))(c) and ((c))(c) appear to be an inconsistent sample, but there are central features from Fig. ((d))(d) and ((d))(d).
We find that the characteristics of the same channel are similar in different layers due to the skip connection of the ResNet. We can see that the mean receptive filed images can only find the broad tendencies but it is difficult to find the detailed properties of the neuron.
3.4 Visualization Using Maximization Method
We apply activation maximization method[2, 12] to ResNets and show the results for the neuron and the channel in the layer in Fig. 8 and 9. Optimizing for the neuron is to maximize the activation of the center neuron in a feature map. and optimizing for the channel is to maximize the average of the activation of a channel. We optimize the input by Adam optimizer  with a learning rate of and a weight decay of . In addition, we start initial the inputs from a zero image and iterate until times.
From the comparison of Fig. 5 and 8, we can see that the results for optimizing for the neuron are similar to the results of the mean receptive fields. The visualization at higher layers reveals detailed properties for activation maximization, but only simple trends for mean receptive field images. Especially, visualizing by activation maximization for the channel is a good-looking visualization of the neuron but the results vary according to various experimental conditions.
3.5 Inactive Neurons
For validation dataset images, we find that some channels in the first max-pooling layer have no output activation values in other words output zeros value because of ReLU activation function. We call the channel to “inactive neuron”. In addition, we find that ResNet34 appears more inactive neurons rather than that of the PlainNet34 from Table1.
To investigate the effect of the inactivate neuron on the classification, we perform two classification experiments that add noise to the inactive neurons. The one is to add noise to all inactive neurons and the second is to add noise to one inactive neuron selected randomly every mini-batch. We apply noise where to each spatial dimension of the inactive neuron. Table 1 shows the results, means the value from all noised validation loss minus validation loss, and means the value from randomly noised validation loss minus validation loss. We can see that the inactive neuron of ResNet34 effects for classification task because both and of ResNet34 are positive and bigger than of PlainNet34.
|Model||# of inactive neurons|
We apply the analysis by using receptive fields and activation maximization to ResNets. Using both methods, we can find that ResNet has orientation selective neurons and double opponent color neurons. Both methods are able to characterize the lower layers well but it is harder to use the analysis for the higher layers. We find that there are inactive neurons for the classification task in ResNet34. We speculate that this phenomenon is due to channel sharing by skip connections. One hypothesis is that some channels, inactive neurons, are used for features that are not similar to the features of first convolutional layers. In the future work, we need to consider methods that can apply the analysis to the higher layers, and examine the evidence to support our hypothesis.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1.
-  (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §2.3, §3.4.
Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36 (4), pp. 193–202. External Links: Cited by: §1, §1, §2.2.
-  (2016-06) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: §1, §2.1, §3.1.
-  (1959) Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology 148 (3), pp. 574–591. Cited by: §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §1.
-  (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §1.
-  (2016) Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640. Cited by: §1.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §1, §1.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: §1, §2.3.
-  (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. External Links: Cited by: §1, §2.3, §3.4.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §1, §2.3.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §3.1.
-  (2019-03) A study of inner feature continuity of the vgg model. In IEICE Technical Report, pp. 239–244. Cited by: §3.2.
-  (2014) Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences 111 (23), pp. 8619–8624. External Links: Cited by: §1.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §2.2.