The human visual system (HVS) has evolved to be more sensitive to contrast than absolute luminance. Thanks to this intelligent natural selection, we perceive the world steadily despite the huge changes in illumination that we experience throughout the day or across different locations. A similar feature is also vital for machine vision to perform successfully in real-world scenarios, e.g. for an autonomous car that is driving in a motorway under diverse lighting conditions.
A recent article geirhos2017comparing investigated the impact of image quality on the performance of human and machine vision. A psychophysical experiment was conducted in which the contrast of stimuli was gradually reduced. The results of this study demonstrate that the classification accuracy of VGG-16 simonyan2014very remains on a par with human subjects. A merit that GoogLeNet szegedy2015going and AlexNet krizhevsky2012imagenet fall short to meet as their performance significantly deteriorates when contrast of the input image is decreased. This raises an interesting research question regarding the mechanisms involved in VGG-16 to accomplish this desirable contrast invariant behaviour that is absent in the other two networks. Logically, there must be certain operations in place to retain variations of contrast in the input image in order to prevent its propagation to the output of the network.
It is well established, according to numerous physiological studies (e.g. carandini1997linearity ; heeger1992normalization ), that the spike activity of receptive fields (RF) in cells of the visual cortex changes considerably according to the contrast of the presented stimuli. This is also reported (e.g. shushruth2009comparison ; angelucci2013beyond ) in typical centre-surround mechanisms of RFs that have been proposed to play an important role in making a visual system independent of illuminant changes (i.e. lightness constancy) mante2005independence
. While it remains to be seen how exactly this is “implemented” in our neuronal system, a naïve equivalent of this for DNNs would be, for example, to change the parameters of a kernel depending on the contrast of neighbouring pixels. A similar approach has been demonstrated to be effective for non-learning algorithms in various computer vision tasks (e.g. arash2017ijcv ; akbarinia2017pami ). However, this certainly is not true for VGG-16 or any other DNNs. Once the network has learnt its parameters, they remain fixed in the test phase. This reinstates the original research question we formulated: how is contrast invariance achieved in DNNs?
In order to answer this question we analysed activation of kernels in convolutional layers of eight prominent networks. Figure 1 illustrates the flowchart of our experiment. We input each DNN with eleven contrast levels of the same image and measured the activation of all kernels in the first five convolutional layers. For each layer at different spatial locations we retrieved the most activated kernel. Then we computed the proportion of those kernels that remained identical for a given layer across different levels of contrast. This proportion is an indication of whether the behaviour of a layer varies according to the contrast of the stimuli (i.e. input image). We compared the corresponding proportions in each layer to examine whether there are layers that account for the contrast invariant behaviour of a DNN. It is worth mentioning that similar procedures are practised commonly to decipher mechanisms of these artificial networks with respect to certain visual features (e.g. kubilius2016deep ).
We conducted an experiment over the entire validation set of ImageNet krizhevsky2012imagenet . In our analysis a systematic pattern emerged for networks with more than one convolutional layer before the first max-pooling: the last convolutional layer before the first max-pooling showed more sensitivity to contrast by a large margin. These networks (e.g. VGG and Inception) also contained their level of accuracy more robustly across different levels of contrast. Others with a max-pooling operator right after the first convolutional layer (e.g. GoogLeNet and AlexNet) did not exhibit this pattern, while performing significantly worse in low contrast images.
Given the facts that: (i) DNN is a feedforward model and (ii) the last convolutional layer before the first max-pooling is highly variant to contrast of the input image, we can deduce that as a consequence the entire network achieves contrast invariance. In other words one of the lowest layers in these networks mitigates the variation of contrast, therefore preventing its propagation to higher layers. Conceptually this is in line with architecture of both biological hubel1962receptive and artificial networks lecun2015deep : lower areas being responsive to basic features (e.g. contrast and orientation) while higher areas encode more abstract notions (e.g. shapes and objects).
In our investigation we observed a number of curious similarities between the biological and artificial vision. For instance, the invariance to contrast only emerges for networks that have multiple convolutional layers before the first max-pooling, loosely resembling the concept of a cortical area, e.g. the primary visual cortex (V1), with its constituting multiple layers, i.e. six. Interestingly, in case of biological organisations, it has been hypothesised that contrast is a fundamental independent variable encoded by the early visual system mante2005independence . It is all not that surprising to encounter similarities between DNNs and the network inside our brains. Many fundamental hierarchical correspondences between the two have been reported previously cichy2016comparison . It seems that both systems act in accordance with the empirical theory of vision purves2011we : inference through successful behaviour. Both systems have learnt a set of visual features over a period of time (being the evolution or the training phase). Contrast certainly is among the most important low-level features for any visual system. Therefore, we believe our results can shed light on the mechanisms involved around this visual feature: contrast.
We included eight prominent networks of different architectures in our study: Inception-V3 szegedy2016rethinking , VGG-19 simonyan2014very , VGG-16 simonyan2014very , VGG-16-3C he2017channel , ResNet-50 he2016deep , ResNet-101 he2016deep , GoogLeNet szegedy2015going , and AlexNet krizhevsky2012imagenet . For all these DNNs, except VGG-16-3C, we used the pretrained networks of the MathWorks Neural Network Toolbox. For VGG-16-3C we downloaded its Caffe version from their GitHub repository and imported it to Matlab using importCaffeNetwork function.
In this article, we restricted the conducted experiments and corresponding analysis to the first five convolutional layers of each network because the shallowest architecture of all (i.e. AlexNet) only consists of five convolutional layers. In this manner, the comparison among networks would be more consistent. We only studied convolutional layers since they are the pillars of a DNN and carry most of the operations and parameters.
We conducted our experiments on a large visual database (i.e. ImageNet krizhevsky2012imagenet ) that contains one thousand different categories of objects (e.g. plants, animals, cars, bicycles, etc., to name a few). Specifically, we used the validation set of ImageNet Large-Scale Visual Recognition Challenge (ILSVRC 2012) russakovsky2015imagenet with a total number of 50,000 test images (i.e. fifty images per object category). It is worth mentioning that all the networks studied in this work have been trained on a subset of ImageNet database.
2.3 Generating images of different contrasts
Given an input image , we generated eleven images at different levels of contrast, , as follows:
where are pixel coordinates and is the contrast level that was set in our experiments to: . The same procedure has been conducted in the psychophysical experiment of geirhos2017comparing , therefore, making our findings comparable.
2.4 Experiment procedure
We have illustrated the pseudocode of our experiment in Algorithm 1. Given a network and a test image, we followed these series of steps:
Generate eleven versions of the same image at different levels of contrast.
For each contrast image, compute activation of all the kernels in the first five convolutional layers (using activations function of Matlab).
At every spatial location keep the most activated kernel.
In a pair-wise comparison across all levels of contrast, compute the percentage of the most activated kernels in a layer that remain identical at every spatial location.
In this manuscript, we have focused our principal analysis on the most activated kernels for those levels of contrast that a network predicts consistently the same object for a given input image (e.g. in both 50% and 100% levels of contrast a network predicts an “owl”). The reason for this choice is that when the prediction of a network is inconsistent under two different levels of contrast, the comparison of the most activated kernels is not very meaningful (e.g. in 50% level of contrast a network predicts a “cup” and in 100% level of contrast an “owl”). Naturally, as the output of the network belongs to two different classes of object, the most activated kernels would be different as well. (Refer to supplementary materials for a comprehensive analysis of all cases, including those in which the network output is dissimilar in different levels of contrast.)
On the right panel of Figure 2 we have depicted an exemplary image from the validation set of ILSVRC 2012 under four different levels of contrast. Naturally, as the contrast of an image is reduced, the classification task becomes increasingly more challenging. Results of a recently reported psychophysical experiment suggest that the performance of humans for the very same task remains almost intact until about 40% level of contrast geirhos2017comparing .
On the left panel of the Figure 2 we have reported the classification accuracy of eight networks under eleven levels of contrast. Each point is the average accuracy over 50,000 images. It is worth emphasising that in this study we are not investigating which network obtains the best performance in comparison to the others, whereas we are interested in understanding the behaviour of a network as the contrast level is reduced (an intra-comparison instead of inter-comparison).
If the classification accuracy of a network drops sharply as the contrast level is reduced, this implies the performance of that network is contrast variant, an undesirable behaviour. Contrary to that, if a network retains its own peak accuracy in low contrast inputs, this network posses this desired behaviour of contrast invariance. According to this criteria, we can observe that Inception-V3 exhibits the greatest invariance to contrast of the input images, while AlexNet performing the worst.
In Figure 3, for four networks, we have reported the portion of the most activated kernels that remain identical across different levels of contrast (refer to the supplementary materials for the same analysis of all other networks). Each point represents variation in the output of a layer throughout all contrast trials. If this figure is 1 for a layer, this means at each spatial location always the same kernel remains the most activated one, irrespective of contrast of the input image. Contrary to that, if this figure is 0 for a layer, it means that the most activated kernel always changes according to contrast of the input image.
The results of Inception-V3 is reported on the top left panel of Figure 3. Let us remind ourselves that the architecture of this network consists of three convolutional layers prior to the first max-pooling. We can observe that conv1_3 has the lowest percentage of identical kernels by a large margin. The results of VGG-19 is reported on the bottom left panel of Figure 3. The smallest portion of identical kernels belongs to conv1_2 (the last convolutional layer prior to the first max-pooling).
On the top right panel of Figure 3 we have reported the results of AlexNet. Let us remind ourselves that in this architecture there is a max-pooling after the first convolutional layer. We can observe that the curvatures are almost flat. On the bottom right panel of Figure 3 we have reported the results of ResNet-101. We cannot detect any clear patterns between these two networks that perform a max-pooling operation after the first convolutional layer.
In Table 1 we have reported the percentage of identical kernels for all studied networks (similar to Figure 3 however irrespective of the confidence of the network). Double vertical lines in this Table represents a max-pooling layer. All networks with more than one convolutional layer prior the first max-pooling (the first four rows) share two identical patterns:
The lowest percentage of the identical kernels always belongs to the convolutional layer right before the first max-pooling (bold figures).
The highest percentage of the identical kernels always belongs to the fifth convolutional layer (the last column of the Table).
Contrary to that, we cannot observe any clear pattern for all those networks that perform a max-pooling operation right after the first convolutional layer (the last four rows).
|Convolutional layers ordered from lower to higher|
4.1 Performance of networks
The results of our experiment across different levels of contrast (Figure 2) demonstrate that four networks (i.e. Inception-V3, VGG-19, VGG-16, and VGG-16-3C) largely retain their classification accuracy down to 40% level of contrast, which was reported to be also the case for human subjects geirhos2017comparing . This implies that these networks are invariant to contrast of the input images, similar to our very own visual system. This desired feature is more pronounced in case of Inception-V3 that maintains its performance as far as 15% level of contrast.
Contrary to this, the performance of the other four networks (i.e. ResNet-50, ResNet-101, GoogLeNet, and AlexNet) deteriorates rapidly as contrast of the input images is reduced. For instance, although at 100% level of contrast the classification accuracy of GoogLeNet is superior to VGG-16, at 40% contrast VGG-16 is significantly better than GoogLeNet. This becomes even more noticeable at lower levels of contrast (compare the pink and green curves in Figure 2). A similar trend occurs if we compare AlexNet to VGG-16-3C (the brown and cyan curves, respectively).
The distinction between the two regimes of networks cannot be traced back to their learning procedures, as all have been trained with a subset of the ImageNet dataset with no particular data augmentation regarding the contrast of input images. The difference cannot arise from their corresponding depths either. GoogLeNet is substantially deeper than the VGG networks. The topology of the networks can neither explain this distinction. The architecture of Inception-V3 is of a directed acyclic graph (DAG) while the VGG family is serial. The number of operations executed neither appears to be decisive: Inception-V3 performs much less operations in comparison to ResNet-101.
The only evident pattern is that the architecture of all contrast invariant networks consists of more than one convolutional layer prior to the first max-pooling operation: Inception-V3 has three convolutional layers before the first max-pooling, VGG-19 and VGG-16 two, and VGG-16-3C four. Contrary to this, all other four networks perform a max-pooling operation after their fist convolutional layer. This loosely, yet interestingly, resembles cortical areas of biological visual systems that are composed of multiple layers. For instance, the primary visual cortex (V1) of humans is divided into six layers douglas1998neocortex .
4.2 Analysis of layers
A clear pattern emerges for the contrast invariant networks in our analysis of the consistency of the most activated kernels across different levels of contrast (Table 1): the lowest portion of identical kernels always belongs to the convolutional layer before the first max-pooling, with a large margin. For instance, within Inception-V3 the difference between conv1_3 to all other layers is more than 20%. Similarly, this figure is about 10% lower for conv1_2 of VGG-19 and VGG-16 with respect to the other layers of those networks. This margin is less pronounced among the convolutional layers of VGG-16-3C perhaps because conv1_4 of this network considers a minimum spatial neighbourhood.
This observed pattern implies a potential hypothesis regarding the underlying strategy of these networks: in order to achieve contrast invariance, the lower convolutional layers (before the first max-pooling) mitigate the variation of contrast in the input image. Given the fact that the architecture of DNNs is feedforward, a consequence of this hypothesis would be that higher layers become less sensitive to contrast. We exactly observe this consequence in Table 1: the highest convolutional layer we studied (i.e. the fifth) always obtains the largest value of identical kernels within each of these four networks. This is a coherent strategy: contrast is a low-level feature and its variation should not be propagated to higher layers of hierarchy as their purpose is detecting more generic features.
This pattern is not determined by whether a network predicts an object correctly according to the ground-truth (compare the left and right panels of Figure 4 for VGG-16 and refer to supplementary materials for all other networks). At the same time, although, this pattern is not conditioned to the confidence of a network, it is more noticeable at higher levels of confidence. For instance, we can observe that the margin between conv1_2 to all other layers of VGG-16 is substantially larger when the confidence of the network is between 80-100% in comparison to 0-20% (compare the purple and blue curves in both panels of Figure 4). This is also the case for conv1_3 of Inception-V3 and conv1_2 of VGG-19 (compare the purple and blue curves in their corresponding panels of Figure 3).
The reason that this pattern is influenced by the confidence of the network and not by its correctness is rather logical. As far as a network is concerned, when it has a high confidence it thinks that its prediction is correct, regardless of what the ground-truth states. The fact that this pattern becomes more pronounced in higher levels of confidence suggests that this strategy (i.e. mitigating the variation of contrast before the first max-pooling) is truly aiding the network to reach a more “correct” decision.
4.3 Contrast comparison
In order to scrutinise more thoroughly our inferences, we analysed the portion of identical kernels in a pair-wise comparison between 100% level of contrast to all other levels of contrast. We have reported this analysis for two networks in Figure 5 (refer to supplementary materials for the other networks). A value of 1 for the percentage of identical kernels implies that given the input image in one level of contrast, the most activated kernels at every spatial locations remain identical to those at 100% level of contrast (i.e. the activity of a layer is not affected by contrast of the input image). Contrary to this, as this figure becomes smaller it means that the behaviour of a layer is influenced more by contrast of the input image. We should expect to observe a sharper decline in percentage of identical kernels for the layer we are proposing that mitigates the impact of contrast (i.e. the last convolutional layer before the first max-pooling) and a slower decline for other layers.
Indeed this pattern emerges in Figure 5. For instance, we can observe that in case of Inception-V3 at 75% level of contrast, all five convolutional layers obtain a high percentage of identical kernels. This is natural since the difference between an image at 80% and 100% levels of contrast is minimal, therefore the activity of kernels remains similar. As level of contrast is reduced, this figure drops more sharply for conv1_3 (the yellow curve) in comparison to all the others. This corroborates our proposal that this layer acts as a mitigator of contrast in input images. For example, the last convolutional layer we studied, conv2_2, has completely a flat line (the purple curve). Interestingly, this invariation is also true for conv1_2 (the red curve) that at the start has a lower percentage of identical kernels in comparison to conv1_3. Let us remind ourselves that the classification accuracy of Inception-V3 does not decrease substantially down until 10% level of contrast (Figure 2). At this level of contrast more than two thirds of the winner kernels undergo a change in layer conv1_3
. This suggests a possible explanation for why the network still can classify objects correctly in low contrast images.
We can observe a similar set of patterns for all the other contrast invariant networks. For instance, in case of VGG-19, the percentage of the most activated kernels drops more sharply also for the last convolutional layer before the first max-pooling, conv1_2 (refer to the red curve on the right panel of Figure 5). This is evident to a greater extent down until 30% level of contrast in which the classification accuracy of the network remains intact (Figure 2); and it is less pronounced at very low levels of contrast (i.e. less than 5%). However, since the classification accuracy of VGG-19 is very low at those levels of contrast, therefore comparison across layers is not really informative.
In this article we studied the behaviour of eight prominent DNNs under the variation of contrast in input images. We conducted our investigation on a benchmark image dataset of object recognition (ImageNet). We computed the classification accuracy of each network under eleven levels of contrast. The results of our experiments demonstrate that half of the studied networks (e.g. Inception) exhibit the desired contrast invariant feature, similar to what it has been reported previously for humans geirhos2017comparing . The performance of other networks (e.g. GoogLeNet) drops sharply as contrast of the input image is reduced. The only evident pattern to distinguish the two regimes is whether their architecture consists of more than one convolutional layer prior to the first max-pooling.
We further investigated the underlying mechanisms of this feature by analysing the most activated kernels of the first five convolutional layers. We measured the portion of the winner kernels that change as contrast of the input stimuli varies. The results of our analysis reveal that in case of the contrast invariant networks the last convolutional layer before the first max-pooling always shows the greatest alteration to contrast of the input image, while the smallest changes belong to the fifth convolutional layer (the highest layer studied in our experiment). This suggests that one layer at lower hierarchy absorbs the variation of contrast in the input images, therefore preventing its propagation to the output of the network.
One major purpose of contrast gain control mechanisms in the primary visual cortex (V1) is to construct a normalised response independent of its contrast wilson2014configural . The exact mechanisms and operations involved to achieve this feature remains to be discovered. However, in general we know that the visual response properties of cortical neurons arise to a great extent as a consequence of the organisation and function of their connections with other neurons callaway2004cell . Therefore, it can be hypothesised that neurons in higher cortical areas – responsible of detecting objects – are largely invariant to contrast as a results of pooling from a different set of winner neurons. In other words, this feature is not grounded on an intrinsic operation of theirs, but rather owing to extrinsic stream of data from lower cortical areas.
Whether this is true for biological neural networks ought to be examined, however, we observed this strategy in artificial DNNs. A sequence of convolutional layers increases the spatial region over which visual information is integrated, while pooling reduces the dimensionality of its representation. Contrast is standard deviation of intensity relative to the mean of a region. Accordingly, those networks with more than one convolutional layer before the first max-pooling ground their contrast representation on a richer set of details. In other words, this strategy allows those networks to encode contrast of the training images (environment) more precisely. Interestingly, V1 contrast adaptation mechanisms have been reported to match the statistics of the environment (natural images)mante2005independence .
This project was funded by the Deutsche Forschungsgemeinschaft SFB/TRR 135. A preliminary version of our experiments has been presented in abstract form in European Conference on Visual Perception (ECVP) 2018 arashkarlecvp2018 .
-  Arash Akbarinia and Karl R. Gegenfurtner. Contrast invariance in deep neural networks (dnn). In Perception, volume 47, 2018.
-  Arash Akbarinia and C. Alejandro Parraga. Feedback and surround modulated boundary detection. International Journal of Computer Vision (IJCV), Jul 2017.
-  Arash Akbarinia and C Alejandro Parraga. Colour constancy beyond the classical receptive field. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 40(9):2081–2094, 2018.
-  A Angelucci and S Shushruth. Beyond the classical receptive field: Surround modulation in primary visual cortex. The new visual neurosciences, pages 425–444, 2013.
-  EM Callaway. Cell types and local circuits in primary visual cortex of the macaque monkey. The visual neurosciences, 1:680–694, 2004.
-  Matteo Carandini, David J Heeger, and J Anthony Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17(21):8621–8644, 1997.
-  Radoslaw Martin Cichy, Aditya Khosla, Dimitrios Pantazis, Antonio Torralba, and Aude Oliva. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports, 6:27755, 2016.
-  Rodney Douglas and Kevan Martin. Neocortex. 1998.
-  Robert Geirhos, David HJ Janssen, Heiko H Schütt, Jonas Rauber, Matthias Bethge, and Felix A Wichmann. Comparing deep neural networks against humans: object recognition when the signal gets weaker. arXiv preprint arXiv:1706.06969, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.
-  David J Heeger. Normalization of cell responses in cat striate cortex. Visual neuroscience, 9(2):181–197, 1992.
-  David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106–154, 1962.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Jonas Kubilius, Stefania Bracci, and Hans P Op de Beeck. Deep neural networks as a computational model for human shape sensitivity. PLoS computational biology, 12(4):e1004896, 2016.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
-  Valerio Mante, Robert A Frazor, Vincent Bonin, Wilson S Geisler, and Matteo Carandini. Independence of luminance and contrast in natural scenes and in the early visual system. Nature neuroscience, 8(12):1690, 2005.
-  Dale Purves and R Beau Lotto. Why we see what we do redux: A wholly empirical theory of vision. Sinauer Associates, 2011.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  S Shushruth, Jennifer M Ichida, Jonathan B Levitt, and Alessandra Angelucci. Comparison of spatial summation properties of neurons in macaque v1 and v2. Journal of neurophysiology, 102(4):2069–2083, 2009.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. CVPR, 2015.
-  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  Hugh R Wilson, Frances Wilkinson, JS Werner, and L Chalupa. Configural pooling in the ventral pathway. The new visual neurosciences, pages 617–626, 2014.