In the task of object recognition, the input of a Convolutional Neural Networks (ConvNets) is usually a -channel image. Consequently, dimensions of the filters in the first convolution layer could be . Assuming that the first layer consists of filters, the input to the second convolution layer might be a -channel image where each channel is called a feature map. Therefore, the dimensions of the filters might be . Since convolution filters are the main building block of ConvNets it is crucial to understand what happens when the input image is convolved using these filters. Also, we may be able to decipher the function of each layer in a ConvNet by analyzing each filter separately. However, interpreting 3D filters is not trivial in the spatial domain. Specially, in the case of ConvNets, the third dimension of the filters is usually high since it depends on the number of the input channels which makes them harder to interpret.
There is a large body of work on understanding the internal process of ConvNets through visualization of hidden units. Zeiler & Fergus (2013)
visualize the hidden units using Deconvolutional Networks. To be more specific, they reconstruct the images which have highly activated each unit. By this way, we can assess how each unit see the world and which parts of objects activate each neuron more.Simonyan et al. (2013) find a -regularized image for each class by maximizing the class specific score. They also compute a class saliency map for the input image.
Girshick et al. (2014) keep record of activations for a specific unit by entering many images to ConvNet and calculating their activations on the unit. Then, the images are sorted according to their activation on this particular unit and illustrated. Taking into account the fact that each unit in top layers has a corresponding receptive field on the image, it is possible to see which parts are important for each unit.
Mahendran & Vedaldi (2014) invert the d-dimensional representation of an image computed by function . This approach tells us that to which extend it is possible to reconstruct the image using the representation function . By applying this method on each layer of the network we can understand which information is preserved by each layer. Similarly, Dosovitskiy & Brox (2015) reconstructed the image by minimizing the squared Euclidean distance between the downsampled input image and reconstructed image. Recently, Nguyen et al. (2015)
Even though the visualization approaches help us to better understand the internal process of ConvNets, they do not provide a tool for assessing the stability of a ConvNet against noise. To address this problem, Szegedy et al. (2013) proposed a method for finding a regularized additive noise which minimizes the score of a specific class.
Contribution: In practice, it is necessary to examine how stable are ConvNets when the input image is noisy. This is empirically achievable by evaluating a ConvNet using a contaminated test set. Another way is to analyze the filters in each layer in domains rather than the spatial domain. In this paper, we show how to analyze the filters of different layers in the frequency domain (Section 2). To our knowledge, this is the first time that ConvNets are analyzed in the frequency domain. Then, we empirically assess various ConvNet architectures on different object recognition datasets (Section 3). The experiments try to compare various choices for the loss function, activations and the input size. Moreover, they illustrate that training a ConvNet using a noisy training set may increase the stability of the network. Above all, we analyze the ConvNets in the frequency domain to find out why all ConvNets are sensitive to small changes in the input.
2 Analysis in the Frequency Domain
The response of convolution filters to an arbitrary noisy input can be analyzed empirically or in the frequency domain. For instance, consider two well-known edge detection filters namely Sobel and Prewitt. In real applications, we are interested in an edge detection filter which is invariant to noise. To find out which of the above filters possess this property, we can generate hundreds of noisy images and apply each of these filters on the noisy inputs. Then, it is possible to compare the results with the ground truth images to quantitatively evaluate both filters. Although this is a valid approach, it has two important problems.
First, it might be practically infeasible to use this method for analyzing the ConvNets trained on large-scale datasets. For example, it is not practical to use this method for evaluating the stability of the Googlenet (Szegedy et al., 2014)
on the ImageNet dataset. This is due to the fact that the ImageNet dataset consists of 150K images. To empirically quantify the stability of the Googlenet, we must generate tens of noisy images with various signal-to-noise ratios (SNRs). If the images are degraded using a Gaussian noise with 10 differentvalues and 100 noisy images are generated for each value of , we will end up with noisy images to be fetched into the Googlenet. This is not practically possible since it will take a very long time to complete this experiment.
Second, the aforementioned empirical method does not tell us why a particular ConvNet is not invariant to noise. In other words, in order to make a ConvNet more tolerant to noise, we must know the reasons that make the ConvNet sensitive to image degradation.
More general approach is to transform convolution filters into the frequency domain and analyze their frequency response. High frequencies usually belong to noisy pixels. Hence, a convolution filter can be more invariant to noise if it does not respond to high frequency pixels. In the next section we review the Fourier transform for computing the frequency response of convolution filters.
2.1 Review of The Fourier Transform
The Fourier transform decomposes a N-dimensional signal into N-dimensional sin and cos functions with various frequencies. The strength of each frequency is indicated by the magnitude of the sin and cos functions for that particular frequency. Mathematically, the Fourier transform of a 3-dimensional signal is defined as follows:
In this equation, is the frequency along axis and is a 3D signal. In the case of ConvNets, could be a 3D convolution kernel or a 3D feature map. is a complex number indicating the magnitude and the phase of frequency triple in the signal .
The frequency response of a filter/feature map can be obtained by computing (1) on every spatial location of the filter/feature map. Taking into account the fact that convolution in the spatial domain is equivalent to multiplication in the frequency domain, we realize that a convolution kernel could be more tolerant to noise if the magnitude of its frequency response is low for high frequencies. The Sobel filter is usually the best choice for calculating the first derivative of an image compared with other well-known edge detection filters. To see the reason, we reduced (1) into two dimensions and calculated the frequency response of the Sobel and the Prewitt filters111 Filters are padded with zero to obtain a high resolution frequency response
Filters are padded with zero to obtain a high resolution frequency response. Figure 1 illustrates the responses.
It is clear that the Sobel filter (in X direction) decreases the effect of high frequencies along axis. In contrast, the Prewitt filter is not able to suppress high frequencies along axis. Taking into account that high frequencies are usually the result of noisy pixels, it shows that the Sobel filter is more tolerant against noise. For this reason, it is commonly the best edge detection filter.
2.2 Frequency Response of ConvNets
The filters of a ConvNet can be studied in the same way that we analyzed the Sobel and the Prewitt filters. The only difference is that filters of a ConvNet are usually 3D arrays so they must be visualized using 4D visualization techniques. Szegedy et al. (2013) showed that adding a low magnitude noise to an image which is barely perceivable to human eye may cause a ConvNet to incorrectly classify the noisy image. We can look for the reason in the frequency domain. To this end, it is enough to only study the effect of the additive noise. This is due to the linearity property of the Fourier transform. In other words, representing the image and the noise by and , respectively, linearity property shows that the Fourier transform of the noisy image can be found by separately calculating the Fourier transform of image and noise and adding their results. Mathematically:
Therefore, we only need to transform the noise into the frequency domain in order to analyze the effect of the additive noise on the output of a ConvNet. This is derived by the fact that .
Our goal is to find out why a low magnitude noise may cause a ConvNet to incorrectly classify an image. For this purpose, we consider the pre-trained model of the Googlenet (Szegedy et al., 2014) provided by Jia et al. (2014). Then, it is fine-tuned on the Caltech101 (Fergus & Perona, 2004) dataset by adjusting the weights in the classification layer and freezing the weights in the other layers. Finally, an additive noise is found by minimizing the following objective function:
where is the actual class label, is the predicted class label, is the regularizing weight and
returns the loss vector of the degraded imagecomputed over all classes. Also, is a multiplier to penalize those values of that do not properly degrade the image so it is not misclassified by ConvNet. We minimized the above objective function on a sample image from the Calteach101 dataset. Figure 2 illustrates the frequency response of along with the frequency response of the first 7 filters in the first layer of the Googlenet. Note that the maximum and minimum values of the noise are very small. However, their intensity are normalized for visualization purposes.
First, we observe that the noise affects almost all the frequencies (Note that on the chart, only points with blue color shows a magnitude near zero). Second, the frequency responses of the filters reveal that not only they pass low and mid frequencies they also pass high frequencies. If the frequency response of each filter is multiplied with the response of the noise (i.e. convolution in spatial domain), the result will be another noisy image where the effect of some frequencies are slightly reduced. In other words, the output of the first convolution layer in the Googlenet is a multi-channel noisy image since the filters are not able to suppress the effect of the noise.
When the noisy multi-channel image is passed through a max-pooling layer, it may produce another noisy image where the magnitude of high frequencies may increase. Analyzing several ConvNets in frequency domain shows that they tend to learn filters which respond to most of the frequencies in the image. For this reason, the noise is propagated along the network and they also appear in the last convolution layer where they may alter the output of the ConvNet.
Now the question is can we solve this problem? From the frequency domain perspective, it is not trivial to suppress the additive noise during the convolution process. Because has positive magnitude in nearly all the frequencies. Hence, even discarding effect of the noise on some frequencies is not going to effectively solve the problem since the frequencies which correspond to noise will be passed to the next layers through other frequencies. However, as we will show in the next section, by learning filters which are more localized in the frequency domain, the stability of the network may increase while the accuracy of the network remains the same.
In this section, we study the stability of ConvNets empirically and in the frequency domain. To this end, we utilize ConvNets with different architectures trained on various datasets. Specifically, we use the architecture in Jia et al. (2014) for training a ConvNet on the CIFAR10 dataset (Krizhevsky, 2009). We also use the pre-trained models of Alexnet (Krizhevsky et al., 2012) and Googlenet (Szegedy et al., 2014) from Jia et al. (2014) and fine-tune them on the Caltech101 dataset (Fergus & Perona, 2004). Finally, we train the architectures from Ciresan et al. (2012) and Aghdam et al. (2015) on the GTSRB (Stallkamp et al., 2012) dataset. Table 1 shows the accuracy of each ConvNets trained on the original datasets. It is clear that all the ConvNets have achieved state-of-art results.
|Network||accuracy (%)||Network||accuracy (%)|
|IRCV (GTSRB)(soft+relu)||99.01||IDSIA (GTSRB)(soft+tanh)||98.77|
|Alexnet (Caltech101)(soft+relu)||87.39||Googlenet (Caltech101)(soft+relu)||91.51|
3.1 Stability of ConvNets
To empirically study the stability of the ConvNets against noise, the following procedure is conducted. First, we pick the test images from the original datasets which are correctly classified by the ConvNets. Then, noisy images are generated for each . In other words, noisy images are generated for each of correctly classified test images from the original datasets. The same procedure is repeated on every dataset and the accuracy of the ConvNets is computed using the noisy test sets. Table 2 shows the accuracy of the ConvNets per each value of .
|accuracy (%) for different values of|
First, we observe that except the IRCV and the Alexnet other ConvNets have misclassified a few of the correctly classified test images which are degraded using a Gaussian noise with . Note that when , it is highly improbable that a pixel is degraded more than intensity levels in each channel. However, this slight change in the input can lead some of the ConvNets to incorrectly classify the image. Also, as the value of increases, the accuracy of the ConvNets reduces. This is consistent with the explanation in Section 2.2 in the sense that a higher value of increases the magnitude of the all frequencies. Since the convolution layers are not able to effectively reduce the noise, they are propagated through the ConvNet and alter the output of final convolution layer.
Second, a squashing activation function such astanh seems to be more tolerant against noise since it maps the input with higher values to the outputs with very close values. However, comparing the results obtained from the IRCV (relu) and the IDSIA (tanh) illustrate that a squashing activation function does not necessarily make a ConvNet more robust against additive noise.
Third, comparing the results from the CIFAR10 ConvNets trained using the softmax and the hingloss functions illustrate there is not a golden rule that a specific loss function leads to a more stable ConvNet. We observe that both ConvNets makes mistakes even when .
Fourth, it is observable that there is not a clear relation between the size of the input and the stability of the ConvNet. To be more specific, the size of the input to the IDSIA and the IRCV ConvNets is pixels and it is pixels in the case of the CIFAR10 ConvNets. Moreover, the size of the inputs of the Alexnet and the Googlenet is and pixels, respectively. Notwithstanding, the IRCV and the IDSIA are more stable than the Alexnet and the Googlenet. This is due to the fact that objects in the GTSRB dataset are simpler than the objects in the ImageNet dataset. In addition, the number of classes in the GTSRB dataset is much less than the number of the classes in the ImageNet dataset. For these reasons, a is enough for the IRCV and the IDSIA ConvNets to learn reasonably stable ConvNets. In contrast, the CIFAR10 dataset contains complex objects which are presented in small images. For this reason, some important details of the objects are missed due to down-sampling. When the images are degraded by a strong noise, it dramatically changes the frequency pattern which in turn alters the classification score. In sum, stability of a ConvNet does not solely depend on the size of the input. Instead, choosing an appropriate input size according to the number of the classes and complexity of the objects in the dataset can increase the stability of a ConvNet.
3.2 Smoothing the noisy images
It is a common practice to smooth an image using a Gaussian filter before further processing. In this experiment, we investigate how smoothing affects the stability of the ConvNets. For this purpose, we follow the same procedure as in Section 3.1 but the noisy images are smoothed using a Gaussian filter with before feeding to the ConvNets. Table 3 shows the results.
|accuracy (%) for different values of|
In general, smoothing images that are degraded using the Gaussian noise with has a negative impact on the accuracy. In particular, it dramatically reduces the performance of the CIFAR10 ConvNets (when ). Because, images in the CIFAR10 dataset are very small and smoothing a clean image eliminates some of the important details of the objects. In contrast, smoothing images degraded by a Gaussian noise with increases the overall accuracy of the ConvNets. The reason for this behaviour can be found in the frequency domain. Figure 3 illustrates the frequency response of the Gaussian filter.
According to this figure, the Gaussian filter passes the low frequencies but if slightly reduces the effect of mid frequencies. In addition, it significantly reduces the effect of high frequencies. When, the image is degraded by a Gaussian noise with small values, it might not significantly affect the mid and the low frequencies. However, applying a Gaussian filter on these images affect their mid frequencies considerably which causes that some of the images are incorrectly classified. In other words, the effect of Gaussian smoothing on the mid frequencies is more than the effect of noise on the same frequencies. In contrary, mid and high frequencies are dramatically affected in images degraded with higher values of . As the result, applying a Gaussian smoothing filter may considerably reduce their effect which in sequel increases the accuracy of the ConvNet on these images. The immediate conclusion from this experiment is that if the filters of the ConvNets are further adjusted to weaken high frequencies and to compress the frequency response, the stability of the ConvNets may increase when the input images are noisy.
3.3 Augmenting with noisy images
Augmenting data by applying some transformations on the original dataset is a common practice for increasing the generalization of ConvNets. The data augmentation procedure does not usually involve adding noisy images to a dataset. In this experiment, we augment the original dataset with noisy images which are generated using the Gaussian noise. We consider and 10 different noisy images are generated for each sample in the original training set. Next, all fully connected layers after the last convolution layer are frozen and only convolution layers are trained using the noisy datasets. By this way, fine-tuning procedure only affects the convolution filters. Finally, the ConvNets are evaluated by creating a noisy test set as we mentioned in Section 3.1. Table 4 and Table 5 show the accuracy of the ConvNets obtained by applying on the original test set and the noisy test set, respectively. As it is clear from Table 4, the ConvNets have achieved very close accuracies compared with Table 1.
|Network||accuracy (%)||Network||accuracy (%)|
|IRCV (noisy)||99.29||IDSIA (noisy)||98.59|
|CIFAR10 (noisy+hing)||78.2||cifar10 (noisy+softmax)||76.6|
|accuracy (%) for different values of|
The results illustrate a considerable increase in the accuracy of the ConvNets, especially on the images degraded by a strong Gaussian noise. In contrast to the previous experiment (i.e. smoothing using a Gaussian filter), the fine-tuned convolution kernels do not have a negative effect on the images degraded by a Gaussian noise with . However, it is clear that ConvNets learn to reduce the effect a strong noise. To investigate this issue, we computed the frequency response of the first layer on the IRCV and the CIFAR10 (softmax and hing) ConvNets before and after augmenting the training set with noisy images. Then, the mean spectrum of first layer of each ConvNet was computed, separately. Figure 4 shows the results.
The common point in all plots is that the mean spectrum of the ConvNets fine-tuned using the noisy training set is more localized than the ConvNets trained without noisy images. In other words, a fewer frequencies are passed through the convolution filters trained using the noisy training set. For this reason, these ConvNets have the ability to effectively reduce the additive noise compared with the ConvNets that are trained using only the original dataset. In sum, augmenting a dataset using noisy images is advantageous and they help the training algorithm to learn the convolution filters with more concentrated spectrum.
In this paper, we studied the stability of Convolutional Neural Networks (ConvNets) against image degradation. To this end, we showed how to analyze the convolution filters in a ConvNet by visualizing their Fourier transform in 4-dimensions. Then, we studied why a ConvNet may make mistakes by degrading the image using an additive noise which is barely perceivable to human eye. Specifically, we illustrated that an additive noise affects almost all the frequencies on the image. On the other hand, analyzing the convolution kernels in the frequency domain revealed that they may respond to high frequencies as mush as mid and low frequencies. For this reason, they are not able to effectively denoise the image and the noise is propagated across the ConvNet that alters the classification score. Moreover, our experiments on ConvNets trained on different datasets showed that there is not a golden rule to say a particular loss function or activation function yields a more stable ConvNet. Besides, the size of the input image can only affect the performance if it is not selected based on the complexity of the objects in the dataset and the number of the classes. Next, we studied what happens if the noisy image is smoothed using a Gaussian smoothing kernel. The results suggest that while it can be helpful on images with low signal-to-noise ratio but it can adversely affect the results on images with high signal-to-noise ratio. The frequency response of the Gaussian kernel show that it reduces the effect of mid frequencies. When it is applied on a clean image it manipulates the mid frequencies considerably which in turn changes the output. Similarly, when the image is degraded using a strong noise, the Gaussian kernel is able to effectively reduce the effect of the noise and more images are correctly classified. The outcome of this experiment is that if convolution kernels are trained properly to have a more concentrated frequency response it may increase the stability of the ConvNet. Finally, we investigated this assumption by augmenting the training set using noisy images. Applying the ConvNets trained using noisy sets on the noisy test sets illustrated a considerable performance boost. We analyzed the reason by computing the mean spectrum (frequency response) of the convolution filters in the first layer of the ConvNets before and after training by the noisy sets. It showed that the frequency response of the ConvNets training on noisy sets are more concentrated than the ConvNets trained on the clean set.
Hamed H. Aghdam is grateful for the support granted by Generalitat de Catalunya’s Agècia de Gestió d’Ajuts Universitaris i de Recerca (AGAUR) through FI-DGR 2015 fellowship.
- Aghdam et al. (2015) Aghdam, Hamed H, Heravi, Elnaz J, and Puig, Domenec. Recognizing Traffic Signs using a Practical Deep Neural Network. In Second Iberian Robotics Conference, Lisbon, 2015. Springer.
- Ciresan et al. (2012) Ciresan, D., Meier, Ueli, and Schmidhuber, Juergen. Multi-column deep neural networks for image classification. In ISBN 978-1-4673-1228-8. doi: 10.1109/CVPR.2012.6248110.
- Dosovitskiy & Brox (2015) Dosovitskiy, Alexey and Brox, Thomas. Inverting Convolutional Networks with Convolutional Networks. pp. 1–15, 2015. URL arxiv.org/abs/1506.02753.
- Fergus & Perona (2004) Fergus, Rob and Perona, Pietro. Learning Generative Visual Models from Few Training Examples :. In Computer Vision and Pattern Recognition (CVPR), Workshop on Generative-Model Based Vision, 2004. ISBN 0769521584.
- Girshick et al. (2014) Girshick, Ross, Donahue, Jeff, Darrell, Trevor, Berkeley, U C, and Malik, Jitendra. Rich feature hierarchies for accurate object detection and semantic segmentation. Cvpr’14, pp. 2–9, 2014. ISSN 10636919. doi: 10.1109/CVPR.2014.81. URL arxiv.org/abs/1311.2524.
- Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, Darrell, Trevor, and Eecs, U C Berkeley. Caffe : Convolutional Architecture for Fast Feature Embedding. ACM Conference on Multimedia, 2014.
- Krizhevsky et al. (2012) Krizhevsky, A, Sutskever, I, and Hinton, Ge. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, pp. 1097–1105, 2012. ISSN 10495258.
- Krizhevsky (2009) Krizhevsky, Alex. Learning Multiple Layers of Features from Tiny Images. pp. 1–60, 2009.
- Mahendran & Vedaldi (2014) Mahendran, Aravindh and Vedaldi, Andrea. Understanding Deep Image Representations by Inverting Them. 2014.
- Nguyen et al. (2015) Nguyen, a, Yosinski, J, and Clune, J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. Cvpr 2015, 2015. URL http://arxiv.org/abs/1412.1897.
- Simonyan et al. (2013) Simonyan, Karen, Vedaldi, Andrea, and Zisserman, Andrew. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv preprint arXiv:1312.6034, pp. 1–8, 2013. URL arxiv.org/abs/1312.6034.
Stallkamp et al. (2012)
Stallkamp, J, Schlipsing, M, Salmen, J, and Igel, C.
Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition.Neural Networks, 32:323–332, 2012. ISSN 08936080. doi: 10.1016/j.neunet.2012.02.016. URL dx.doi.org/10.1016/j.neunet.2012.02.016.
- Szegedy et al. (2013) Szegedy, Christian, Zaremba, W, and Sutskever, I. Intriguing properties of neural networks. arXiv preprint arXiv: …, pp. 1–10, 2013. URL http://arxiv.org/abs/1312.6199.
- Szegedy et al. (2014) Szegedy, Christian, Reed, Scott, Sermanet, Pierre, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. pp. 1–12, 2014.
- Zeiler & Fergus (2013) Zeiler, Matthew D and Fergus, Rob. Visualizing and Understanding Convolutional Networks. arXiv preprint arXiv:1311.2901, 2013. ISSN 978-3-319-10589-5. doi: 10.1007/978-3-319-10590-1_53. URL arxiv.org/abs/1311.2901.