Manifestation of Image Contrast in Deep Networks

02/12/2019 ∙ by Arash Akbarinia, et al. ∙ 0

Contrast is subject to dramatic changes across the visual field, depending on the source of light and scene configurations. Hence, the human visual system has evolved to be more sensitive to contrast than absolute luminance. This feature is equally desired for machine vision: the ability to recognise patterns even when aspects of them are transformed due to variation in local and global contrast. In this work, we thoroughly investigate the impact of image contrast on prominent deep convolutional networks, both during the training and testing phase. The results of conducted experiments testify to an evident deterioration in the accuracy of all state-of-the-art networks at low-contrast images. We demonstrate that "contrast-augmentation" is a sufficient condition to endow a network with invariance to contrast. This practice shows no negative side effects, quite the contrary, it might allow a model to refrain from other illuminance related over-fittings. This ability can also be achieved by a short fine-tuning procedure, which opens new lines of investigation on mechanisms involved in two networks whose weights are over 99.9 further analysis suggests that the optimisation algorithm is an influential factor, however with a significantly lower effect; and while the choice of an architecture manifests a negligible impact on this phenomenon, the first layers appear to be more critical.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In visual perception, contrast is defined as the difference in brightness of an object with its surroundings. This is primarily determined by the source of light. The illumining lux at bright sunlight is larger than starlight by a factor of one billion [36]. Subsequently, contrast is higher to a greater extent in a well illuminated scene in comparison to a dim environment. In addition to this global factor, local configurations of a scene’s constituents cause considerable variation in contrast of present objects by creating shadows and reflections [9].

100% 40%
Figure 1: A sample synthetic image used to test autonomous driving. The system exhibits erroneous behaviour at 40% image contrast and depicted by the red arrow. Image credit https://deeplearningtest.github.io/deepTest/.

Although we all prefer a colourful day, our visual perception is not impaired in faint contrast. This is expected from an artificially intelligent machine as well. Autonomous driving cars have been reported to cause fatal accidents in low-contrast visibility conditions

[41] (see Figure 1). This emphasises the importance of designing a contrast-invariant machine vision. Previous works in the literature have reported that the accuracy of most prominent deep networks at object classification falls noticeably at low-contrast images [10, 8, 2]. Therefore, further investigation is required to address, at least, two fundamental research questions around this topic:

  1. [label=]

  2. How a contrast-invariant network can be achieved. What is the most influential factor, i.e. the architecture of a model, the training procedure, or a combination of both?

  3. What sorts of mechanism a network learns in order to prevent variation of contrast in input image propagating to its output. Is it a specific kernel, a layer, or an interaction among them?

Although both questions are equally important, it is worth acknowledging that a resolution to the former does not necessarily reveal an explanation to the latter. For instance, perhaps networks undergone a specific training procedure produce a same set of outputs across different levels of contrast, nevertheless this leaves us with little insight about the exact mechanism in terms of neuronal operations that leads to this feature.

Preceding related investigations have been mainly focused on evaluating the performance of the deep neural networks (DNN) exposed to poor visual information. One study measured classification accuracy of four networks on

ImageNet data set under five image distortions, including contrast reduction [8]. Similarly, [10] compared the ability of three DNNs to human observers in a psychophysical experiment, in which quality of images was degraded. Results of both studies indicate that while some networks perform substantially better than others, all decay at low-contrast images (i.e. about 20% level of contrast).

Hitherto, the second question has not been inquired. This was addressed in a recent study [2]

that analysed activation map of kernels in eight DNNs. Their findings suggest that relative position of the first max-pooling is an influential factor for a network to accomplish invariance to contrast. Their methodology is in harmony with the rationale to dissect a network into human-interpretable concepts

[5]. However, networks examined by [2] were state-of-the-art models that were trained with distinct procedures. Therefore, their findings might be inconclusive in disentanglement of contrast representation in a neural networks.

In this article we attempt to contribute to both questions. The principal capacity that deep learning has provided for computational models is to learn intricate patterns from large data sets

[26], therefore, data augmentation has been actively used as a tool to increase robustness to common image transformations and avoid over-fitting [24, 37]. Along this line, we investigated the impact of contrast-augmentation on training and fine-tuning DNNs. The results of our experiments show that this is an effective approach to make networks invariant to image contrast even at extreme low levels, with no side effects.

Comprehending feature representation and its disentanglement is of great interest to the research community [6]. Consequently, in order to gain insights about the features learnt by DNNs, a large body of literature has developed techniques to visualise its internal units (e.g. [42, 28]). Others have proposed a binary segmentation task to study every neuron of a network [5]. These methods are primarily applicable to high level perceptual concepts (e.g. objects) or tangible low level features (e.g. colours). In contrary, contrast is a fundamental visual feature in natural scenes, independent of luminance [29], thus, it is cumbersome to decipher its role through above-stated approaches. Instead, we tackled the second question from two other angles:

  • We trained various networks under a completely controlled procedure: forcing all variables to be identical except one to be singled out at each experiment. This allows us to identify the influence of each factor. Additionally, we compared the learnt weights in a pair of twin networks to localise notion of contrast in an architecture. Previous studies suggest a concept is often encoded with a combination of several neurons [1, 11].

  • Analysing all neurons of the same network under multiple levels of contrast. We compared activation of kernels between inputs correctly classified at lower image contrast to the misclassified instances. This is a common practice in neuroscience due to limit of access to a large set of brains

    [21]. The hypothesis is if a specific kernel or layer represents the concept of contrast, this would be revealed by comparing the activity of the same network under successful and failure trials.

Our analysis suggest that the first few layers of a network are protagonists of a mechanism to achieve invariance to image contrast. However, we did not discover any individual kernel, layer, or a combination of both that is capable of explaining how two networks whose weights are 99.9% correlated produce utterly different results.

2 Methodology

2.1 Setup

2.1.1 Data set

We conducted our experiments on ImageNet data set [24] which is collection of one thousand object categories. The training-set contains 1.3 million images (i.e. 1300 per category) and the validation-set consists of 50 thousands images (i.e. 50 per category).

2.1.2 Networks

We studied thirteen distinguished networks in state-of-the-art: VGG16 [37], VGG19 [37], ResNet50 [13], InceptionV3 [40], Xception [7], InceptionResNetV2 [38], MobileNet [15], MobileNetV2 [35], DenseNet121 [16], DenseNet169 [16], DenseNet201 [16], NASNetMobile [43], and NASNetLarge [43].

The weights of all these models were obtained from Keras platform

111https://keras.io/applications/. It is worth emphasising that each of the aforementioned networks have been trained on ImageNet

data set, however each pretrained network has experienced a different procedure: such as in the choice of optimiser, number of epochs, or types of image augmentation.

2.1.3 Contrast manipulation

There are many possible formulations to change contrast of an image. One approach would be to blend the input image with a grey image [8]. Another is to modulate Michelson contrast [30]. We opted for the latter through this equation:

(1)

where is the input image, are pixel coordinates and is the contrast level. In this way, our findings are comparable to previous studies on this topic [10, 2].

2.2 Experiments

2.2.1 Evaluating pretrained networks

We evaluated the classification accuracy of each of those thirteen prominent pretrained networks on the validation-set of ImageNet data set under seven levels of contrast: namely in Eq. 1. See Figure 2 for an exemplary input image.

5% 15% 50% 100%
Figure 2: An exemplary image under different levels of contrast.

Prior to feeding a network with an image:

  • First, it was resized to its smaller edge and the central square was cropped according to the input size of each network (i.e. NASNetLarge receives images of size , Xception, InceptionV3, and InceptionResNetV2 , and all the rest ).

  • Second, it was preprocessed with the same function utilised during the original training procedure of that network.

2.2.2 Training networks from scratch

The experiment described above sheds light on state-of-the-art in image classification across different levels of contrast. However, each model is of a completely distinct nature (e.g. different architectures, number of parameters, type of optimisation, the objective function, training procedure, preprocessing function etc.). Therefore, studying pretrained DNNs alone does not allow us to disentangle the impact of each factor to the role of image contrast, and subsequently to discover the mechanism a network has learnt to become invariant to this feature. In other words to generalise along its dimension.

Accordingly, we studied three of those networks (InceptionV3, ResNet50 and DenseNet201) in a greater details by means of training various instances of them from scratch under identical conditions. We chose these models because: (i) their architectures consist of a similar number of parameters, (ii) their classification accuracy at 100% image contrast is comparable, (iii) their accuracy at lower levels of contrast is very different.

We investigated four factors: (i) whether a specific architecture results in invariance to image contrast, (ii) effect of relative position of the first max-pooling [2], (iii) choice of the optimisation algorithm, and (iv) exposure to multiple levels of contrast during the training phase.

We trained each instance of these networks in ten epochs with a batch size of 32 on a single GPU. In this experiment, we excluded all standard “augmentation” (e.g. flipping, random cropping, scaling, etc.) in our training phase, due to the random nature of these procedures that makes the comparison more complicated and less accurate.

2.2.3 Fine-tuning pretrained networks

We selected five of those pretrained networks with the criteria to cover various performances proportional to their number of parameters (i.e. MobileNetV2, ResNet50, VGG16, NASNetMobile, and InceptionV3) and fine-tuned their weights with “contrast-augmentation”. This procedure is not augmentation in the sense of increasing the number of training images. It merely refers to the manipulation of the image contrast during the training with a random contrast level in the range of 1 to 100% according to Eq. 1. Therefore, each network is essentially exposed to the same number of training images at each epoch.

Irrespective of the optimisation configurations of a network at its original training procedure, we fine-tuned all instances with an Adam optimiser [22] and a categorical cross-entropy objective function. We set the learning rate and decay parameters both equal to for all our experiments. This fine-tuning consists of five epochs of retraining with “contrast-augmentation” with a batch size of 32 on a single GPU. During the fine-tuning, we included the following standard “augmentation” procedures: i.e. random horizontal flipping, zooming (within a 20% scale), and shifting (within a 20% range).

3 Results222Source code and experimental materials are available at https://goo.gl/GkdZQt.

3.1 Pretrained networks

The top-1 classification accuracy of all examined networks on the validation-set of ImageNet at different levels of contrast are illustrated in Figure 9. It is important to emphasise that each curve is divided by its value at 100% image contrast to facilitate the comparison. The objective of this experiment is not to evaluate the absolute accuracy of each network, but rather to investigate which network can retain its performance at lower contrast levels. The absolute accuracy (top-1 and top-5) of fine-tuned networks are reported in Table 1.

Figure 3: The top-1 classification accuracy of various networks on the validation-set of ImageNet data set. The curves are normalised to perfect accuracy on 100% level of contrast. On the left panel: those curves with a triangle shape have gone through contrast-augmented fine-tuning, initialised with the weights of the square shaped curves. On the right panel: all the rest of the pretrained models obtained from Keras platform. Interested readers are encouraged to refer to supplementary materials for a similar figure of top-5 classification accuracy.

It is evident from Figure 9 that all networks retain their peak performance at 75% level of contrast. After that, the accuracy of ResNet50 and MobileNetV2 deteriorates at a sharper rate (note the red and magenta curves at the bottom of the left panel) followed by VGG16 and VGG19 (note the purple curves in the middle of both panels). Others perform very well down till 30% level of contrast. However, they experience a large drop by the time image contrast reaches 15 and 5%. And finally, at 1% level of contrast all pretrained networks are at chance level.

Contrary to this, all the five contrast-augmented networks retain almost perfectly their peak accuracy down till 5% image contrast, and at 1% level of contrast, on average, they score about 70% of their original performance (note the triangle shaped curves at the top of the left panel in Figure 9). The difference among fine-tuned networks at 1% image contrast could be due to the number of parameters a model is consist of (i.e. MobileNetV2 and NASNetMobile are in the order of five millions parameters, substantially less than the others).

Top-1 Top-5
Contrast level 1% 5% 15% 50% 100% 1% 5% 15% 50% 100%
InceptionV3 Original 0.00 0.49 0.74 0.77 0.77 0.17 0.72 0.91 0.93 0.94
Fine-tuned 0.60 0.75 0.77 0.77 0.77 0.82 0.93 0.94 0.94 0.94
ResNet50 Original 0.00 0.07 0.40 0.69 0.74 0.01 0.17 0.65 0.89 0.92
Fine-tuned 0.52 0.71 0.74 0.74 0.74 0.76 0.90 0.92 0.93 0.92
VGG16 Original 0.01 0.15 0.53 0.69 0.70 0.03 0.36 0.78 0.89 0.89
Fine-tuned 0.50 0.65 0.70 0.71 0.71 0.70 0.87 0.90 0.90 0.90
NASNetMobile Original 0.00 0.27 0.66 0.73 0.73 0.08 0.48 0.86 0.91 0.91
Fine-tuned 0.45 0.70 0.71 0.73 0.73 0.59 0.88 0.90 0.91 0.91
MobileNetV2 Original 0.00 0.05 0.38 0.68 0.71 0.08 0.12 0.64 0.88 0.90
Fine-tuned 0.40 0.66 0.69 0.71 0.71 0.55 0.88 0.90 0.90 0.90
Table 1: The classification accuracy comparison of various networks with their fine-tuned contrast-augmented offspring.

3.2 Networks trained from scratch

The top-1 classification accuracy of networks trained from scratch on validation-set of ImageNet at different levels of contrast are illustrated in Figure 16. It is important to emphasise that these curves, similar to those in Figure 9, are normalised to obtain perfect accuracy at full image contrast. Furthermore, let us remind ourselves that these networks have been trained only for ten epochs with no image augmentation. This explains why their overall performance is lower in comparison to the pretrained networks reported in Figure 9.

In order to accurately identify influential factors on image contrast, we controlled every aspect of the training procedures to make them as identical as possible for all networks: e.g. preprocessing functions, weights initialisation, no shuffling in the order of images, batch size, and input size. No image augmentation was used except for those labelled as “contrast-augmented” in which the contrast of training images were randomly adjusted within the range of 1 to 100%.

We trained an instance of each architecture with two types of optimisation algorithm: Adam (learning rate set to and decay is equal to ) and SGD (learning rate set to and decay is equal to ). We used categorical cross-entropy as the objective function for all our experiments.

Within each class of architecture, all instances have approximately the same number of parameters. The “Area1” labels refers to the distribution of convolutional kernels prior to the first max-pooling layer. “Area1_1” means there is one convolutional layer before the first max-pooling. “Area1_2” and “Area1_3” refer to two and three convolutional layers, correspondingly.

There are three phenomena emerging from Figure 16 that are worth to be highlighted:

  1. [label=]

  2. All contrast-augmented networks retain their peak performance perfectly down till 5% image contrast; and at 1% level of contrast they obtain, on average, half of their accuracy at full contrast (see the triangle shaped curves at the top of the figure lying on the line ).

  3. Among the rest, those trained with Adam optimiser perform better at lower levels of contrast in comparison to their identical twins trained with SGD. This is true irrespective of the network architecture (compare the square shaped curves to the circle ones).

  4. Overall, the architecture of a network appears to be of minor importance, since InceptionV3, ResNet50 and DenseNet201 produce comparable results under identical conditions. However, a tiny anecdote appears within each family of the networks: those with multiple convolutional layers prior to the first max-pooling perform slightly better in comparison to their counterparts with a single convolutional layer [2] (e.g. compare the dark and light squares or circles).

Figure 4: The top-1 classification accuracy of various networks on the validation-set of ImageNet. The curves are normalised to perfect accuracy on 100% level of contrast. Each legend starts with an abbreviation, the first corresponds to network (I: InceptionV3Blue, R: ResNet50Red, D: DenseNet201Green) and the second to optimiser (A: AdamSquare, S: SGDCircle). Those with a triangle shape have gone through a contrast-augmented training procedure. Interested readers are encouraged to refer to supplementary materials for top-5 classification accuracy.

4 Discussion

4.1 The how question

The results of the fine-tuning experiment reported in Figure 9 and Table 1 indicate three significant phenomena:

  1. [label=]

  2. Pretrained networks in state-of-the-art, irrespective of their architecture or original training procedure, can become almost perfectly invariant to image contrast with a few epochs of contrast-augmented fine-tuning (compare the triangle shaped curves in the left panel of Figure 9 to the square ones).

    Most networks experience a dramatic change in their performance across multiple levels of contrast. For instance, the original ResNet50 barely retains 10% of its peak performance at 5% image contrast. However, the fine-tuned offspring maintains 96% of its accuracy under the same condition (see Figure 5 for an example).

  3. An obvious objection could be raised that the contrast-augmented fine-tuning would harm the absolute accuracy of a network at 100% level of contrast. The verdict is no, as it is evident in Table 1. All fine-tuned networks match or exceed their original accuracy at full contrast image, till the second decimal point (for both measures of top-1 and top5).

    At the same time, the gain is striking for all networks at many lower image contrasts. Compare red and green figures in Table 1. To name one as an example, the original ResNet50 obtains 0 and 7% classification accuracy at contrast levels 1 and 5%, while the fine-tuned version scores 52 and 71%, respectively.

    100% 15% 100% 50%
    Figure 5: Two examples in which at 100% image contrast both the original ResNet50 and its fine-tuned offspring correctly classify the present objects. At a lower level of contrast the original model fails while the fine-tuned version succeeds.
  4. This boost in performance of object classification for images of reduced contrast, appears to be little influenced by the original accuracy of a network at those conditions. For instance, the original ResNet50 performs extremely poor at low levels of contrast, while InceptionV3 performs relatively well in those conditions. Nevertheless, the corresponding contrast-augmented versions of both obtain close to perfect accuracy down till 5% level of contrast.

The results of the training networks in identical conditions (Figure 16) support these findings:

  1. [label=]

  2. Exposure to multiple image contrasts during the training phase is the crucial element.

  3. The choice of optimisation is a secondary factor, which also technically refers to the training procedure.

  4. The overall impact of a network architecture is diminutive, although lower layers appear somehow pivotal.

These findings suggest that exposure to multiple levels of contrast allows a network to adequately learn a set of parameters to essentially accomplish object classification invariance to image contrast, which can notoriously change at all time [9]. This could present potential implications for distinct vision related disciplines:

  • From a machine vision perspective, one interpretation could be that a network consists of millions of parameters is probably capable of encoding many more features within the same architecture

    [27, 23].

    Alternatively, looking at this inversely suggests that it should be possible to achieve the same level of accuracy at a single level of image contrast with smaller networks [19, 33].

  • From a visual perception perspective, this empowers the empirical theory of vision, inference through successful behaviour [32], by providing more evidence that past experience indeed is an essential part of a visually intelligent system.

4.1.1 The amount of training required

Analysing the evolution of the fine-tuning procedure in a greater details (see Figure 15 for RestNet50) suggests that essentially the first epoch is the most influential one, while epochs two to five allow the network to adjust its parameters more adequately to boost the performance at very low levels of image contrast (i.e. 1 and 5%).

Figure 6: The top-1 classification accuracy of fine-tuned ResNet50 in various epochs on the validation-set of ImageNet data set.

4.1.2 Degradation in other visual information

It can be reasonably objected that contrast-augmentation could cause undesired side effects on robustness of the network to degradation of other visual features. A contrary school of thought could argue that exposure to random image contrast compels the network to rely on more abstract representation, and consequently less impaired by other image transformations. We scrutinised this opposite views by comparing the performance of a network to its fine-tuned version over validation-set of ImageNet under a diverse set of image manipulations (refer to supplementary materials):

  • Gaussian blurring with square windows of side 3, 9 and 27 pixels. We did not observe any systematical evidence for either side of the argument and essentially the classification accuracy of original networks and their fine-tuned versions were on-par.

  • Salt & Pepper noise of 1, 5 and 10%, or uniformly distributed noise. Similar verdict as Gaussian blurring.

  • Gamma correction with . Fine-tuned networks performed substantially better (maximally in the order of 20%) in case of gamma compression while performing slightly worse (maximally in the order of 2%) in case of gamma expansion. This suggests that the linear contrast-augmentation has a positive effect on nonlinear contrast manipulation as well.

  • Varying illumination conditions by multiplying two colour channels at a time with a constant of values 0.25, 0.50, and 0.75 (i.e. making the image reddish, greenish, or bluish respectively). We observed that the fine-tuned network perform better at this experiment. This is in line with previously reported importance of contrast in computational colour constancy [4].

The results of these experiments suggest that contrast-augmentation could potentially make DNNs more robust towards other changes that occur in illumination of a scene. This should cause no surprise as contrast aware non-learning algorithms have been incorporated into a wide range of computer vision applications with encouraging results (

e.g. [3, 20, 31]).

4.2 The what question

4.2.1 Twin networks comparison

In order to better grasp what is internally altered that allows a model to become invariant to image contrast, first, we compared raw weights of the original networks to their fine-tuned offspring (see Table 2):

  • The raw weights are more than 99.9% linearly correlated with a negligible variation across layers.

  • The absolute difference between raw weights is in the order of . Given that the average range of weights in these networks are typically in the the order of , this implies that the absolute difference is merely 1%.

  • There is more than 99.9% mutual dependence between raw weights of the original and fine-tuned networks.

PCC Difference NMI
InceptionV3
ResNet50
VGG16
NASNetMobile
MobileNetV2
Table 2: The comparison of raw weights of an original network to its fine-tuned offspring. Metrics from left to right: Pearson correlation coefficient (PCC), the absolute Difference, and Normalised Mutual Information (NMI

). Values are averaged out over all layers of a network. Standard deviations are not shown for PCC and NMI since they were 0 till the third decimal.

Inspecting the absolute difference and mutual information at each layer separately, exhibits a tendency among all examined networks: first convolutional layers have a smaller mutual information and a larger absolute difference. This relation inverts itself as we progress towards the last convolutional layer. See Figure 7 for absolute difference and refer to supplementary materials for mutual information. This suggests that the biggest difference between a network and its fine-tuned offspring is at the first few layers. This phenomenon sounds logical, given contrast is a low level visual feature, subsequently, no account of its variation at the start of a feed-forward model would cause a propagation of larger impact to higher layers [26]. This is in accordance with the reported contrast-invariant cells of lower cortical areas in biological vision [18].

Figure 7: The absolute difference of kernel weights between an original network and its fine-tuned offspring.

In the analysis discussed above, two networks are compared at corresponding layers represented through the average of all their constituent kernels. This assumes a particular layer is entirely capturing contrast of an image. A more realistic alternative could hypothesise that the difference occur at a finer level among individual kernels, whose impact is vanished at the analysis above due to the limitations of an average representation.

One approach to highlight true differences between kernels of both networks is nominal thresholds, similar to the idea of binary classification [5]. We examined this by counting the number of kernels pairs whose correlation is outside of standard variation. No systematic pattern emerged generic to all networks, however we discovered discernible differences at each network. For instance, all “branch_2c” convolutional layers of ResNet50 stand out with many constituent kernels having a correlation smaller than between the original network and its fine-tuned version (see blue bars in Figure 17). No other layer holds a single kernel of this criteria. It is worth reminding ourselves that the average correlation among all kernels is greater than .

Figure 8: The comparison of the original and fine-tuned ResNet50. Blue bars refer to the total number of uncorrelated kernels’ weights (i.e. ) between networks. Red bars refer to the total number of images in validation-set of ImageNet with uncorrelated kernels’ activation (i.e. ) between both networks.

Another approach to compare a pair of twin networks is to first feed each with an image, in order to compute the activation map of all kernels, and then compare those correspondingly. We conducted this experiment for all layers of the original and fine-tuned ResNet50, inputting them with all images in validation-set of ImageNet. Representing each layer through its average across all 50K images exhibits an overall tendency between lower and higher layers, similar to curves of Figure 7. Alternatively, each image can be examined conditioned to a nominal threshold. We observed interesting correspondences between this analysis and the one for kernels’ weights, refer to Figure 17. Most “branch_2c” convolutional layers stand out as completely different from other layers (note red bars in this figure).

Kernels of “branch_2c” perform convolution only along the dimension of feature maps without integration over a spatial vicinity (i.e. kernels of size ). This construction is a reminiscence of hypercolumns in biological vision [17]. Perhaps, these kernels that expand or contract dimensionality of feature maps allow the network to learn an account of image contrast. Therefore, it can be contemplated that future investigations concentrated on these kernels could shed more light on the exact operations that lead to contrast invariance, at least in case of ResNet50.

4.2.2 Successful versus failure comparison

Ultimately, any attempt to describe the mechanism of contrast invariance in DNNs (or any other feature as a matter of fact) should be able to explain differences at the level of individual images as well. We inspected all the discovered patterns in addition to the one proposed in [2] with this criteria. For instance, whether kernels of “branch_2c” in ResNet50 exhibit distinct activation maps for successful and failure trials. We did not spot any systematic differences between the two conditions: in other words the shape of red bars in Figure 17 would be almost identical regardless of network output (refer to supplementary materials).

5 Conclusion

In this work we enquired into the function of image contrast in deep neural networks (DNN). We argued that contrast is a pillar of a visual system as manifested in evolution of the biological vision [29]. We further reasoned that invariance to this feature is a necessary asset for machine vision, evident in case of autonomous cars [41]. We addressed two research questions: (i) how to accomplish a contrast-invariant model, and (ii) what mechanism allows a DNN to possess this feature. We approached these questions by conducting experiments on ImageNet: a large visual data set of diverse objects. We thoroughly studied thirteen prominent networks in the literature by fine-tuning their weights with contrast-augmentation, as well as training new instances of each architecture under completely controlled conditions.

The results of our experiments report that state-of-the-art object classification networks fail to retain their peak performance for images of poor contrast, some even suffer at higher levels as much as 50% (refer to Figure 9 and Table 1). We demonstrated that a simple fine-tuning procedure of contrast-augmentation offers a robust solution to the how question: by allowing a model to adequately adjust its parameters to perform almost perfectly at extreme low-contrast. Training new instances from scratch supported these findings by exhibiting that exposure to multiple levels of contrast is indeed the key factor (see Figure 16).

We tackled the what question by comparing weights and activation maps of every layer and its constituent kernels between an original network and its fine-tuned offspring. These networks are over 99.9% correlated, however one is invariant to image contrast while another heavily impaired by it. We observed a general tendency that the largest difference appears to occur at the first few layers. Other pronounced patterns emerged uniquely for each architecture, however none explained the difference between successful and failure trials. A more profound study of those would allow future works to answer the what question.

Deciphering the exact mechanisms of a system constitutes of millions of parameters is complex due to the convoluted nature of its inner operations. DNN is a hyper-dimensional model, however for simplicity it can be visualised as a prism of multiple areas (i.e. a block of repetitive layers). Each of these areas could be imagined as a cuboid decomposed into layers of different nature. Each layer is a volumetric shape sliced to kernels that are the smallest comprehensible dimension of a network due to their three dimensional nature. In this article (similar to previous works [42, 28, 5]) we limited our analysis to the kernel and layer dimension. Given hierarchical depth is demonstrated to be of significant importance both in artificial [37, 39, 12] and biological vision [34, 14, 25]. Therefore, future works should focus more on the analysis of hyper dimensions (i.e. the depth of a neural network, across layers and areas).

Acknowledgements

This project was funded by the Deutsche Forschungsgemeinschaft SFB/TRR 135.

References

  • [1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of multilayer neural networks for object recognition. In European conference on computer vision, pages 329–344. Springer, 2014.
  • [2] A. Akbarinia and K. R. Gegenfurtner. How is contrast encoded in deep neural networks? arXiv preprint arXiv:1809.01438, 2018.
  • [3] A. Akbarinia and C. A. Parraga. Feedback and surround modulated boundary detection. International Journal of Computer Vision, pages 1–14, 2017.
  • [4] A. Akbarinia and C. A. Parraga. Colour constancy beyond the classical receptive field. IEEE transactions on pattern analysis and machine intelligence, 40(9):2081–2094, 2018.
  • [5] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. arXiv preprint arXiv:1704.05796, 2017.
  • [6] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [7] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.
  • [8] S. Dodge and L. Karam. Understanding how image quality affects deep neural networks. In Quality of Multimedia Experience (QoMEX), 2016 Eighth International Conference on, pages 1–6. IEEE, 2016.
  • [9] R. A. Frazor and W. S. Geisler. Local luminance and contrast in natural images. Vision research, 46(10):1585–1598, 2006.
  • [10] R. Geirhos, D. H. Janssen, H. H. Schütt, J. Rauber, M. Bethge, and F. A. Wichmann. Comparing deep neural networks against humans: object recognition when the signal gets weaker. arXiv preprint arXiv:1706.06969, 2017.
  • [11] A. Gonzalez-Garcia, D. Modolo, and V. Ferrari.

    Do semantic parts emerge in convolutional neural networks?

    International Journal of Computer Vision, 126(5):476–494, 2018.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [14] S. Hochstein and M. Ahissar. View from the top: Hierarchies and reverse hierarchies in the visual system. Neuron, 36(5):791–804, 2002.
  • [15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
  • [17] D. H. Hubel and T. Wiesel. Shape and arrangement of columns in cat’s striate cortex. The Journal of physiology, 165(3):559–568, 1963.
  • [18] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106–154, 1962.
  • [19] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • [20] L. Itti and C. Koch. Computational modelling of visual attention. Nature reviews neuroscience, 2(3):194, 2001.
  • [21] E. R. Kandel, J. H. Schwartz, T. M. Jessell, D. of Biochemistry, M. B. T. Jessell, S. Siegelbaum, and A. Hudspeth. Principles of neural science, volume 4. McGraw-hill New York, 2000.
  • [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [23] I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, volume 2, page 8, 2017.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [25] N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. J. Rodriguez-Sanchez, and L. Wiskott. Deep hierarchies in the primate visual cortex: What can we learn for computer vision? IEEE transactions on pattern analysis and machine intelligence, 35(8):1847–1871, 2013.
  • [26] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [27] Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [28] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
  • [29] V. Mante, R. A. Frazor, V. Bonin, W. S. Geisler, and M. Carandini. Independence of luminance and contrast in natural scenes and in the early visual system. Nature neuroscience, 8(12):1690, 2005.
  • [30] A. A. Michelson. Studies in optics. Courier Corporation, 1995.
  • [31] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung. Saliency filters: Contrast based filtering for salient region detection. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 733–740. IEEE, 2012.
  • [32] D. Purves and R. B. Lotto. Why we see what we do redux: A wholly empirical theory of vision. Sinauer Associates, 2011.
  • [33] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
  • [34] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019, 1999.
  • [35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  • [36] P. Schlyter. Radiometry and photometry in astronomy. Available: stjarnhimlen. se/comp/radfaq. html, 2009.
  • [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [38] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In AAAI, volume 4, page 12, 2017.
  • [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [41] Y. Tian, K. Pei, S. Jana, and B. Ray. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pages 303–314. ACM, 2018.
  • [42] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  • [43] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.

Appendix A Pretrained networks

a.1 Contrast reduction

Classification accuracy of various networks under seven levels of image contrast – – are reported in Figure 9. Refer to Section 3.1 and 4.1 of the principal manuscript for relevant discussion.

Figure 9: The classification accuracy of various networks on the validation-set of ImageNet data set. The first row corresponds to top-1 and the second row corresponds to top-5. The curves are normalised to perfect accuracy on 100% level of contrast. On the left panel: those curves with a triangle shape have gone through contrast-augmented fine-tuning, initialised with the weights of the square shaped curves. On the right panel: all the rest of the pretrained models obtained from Keras platform.

a.2 Illumination manipulation

Classification accuracy of various networks under three conditions of illumination manipulation are reported in Figure 10. The reported results are average of three conditions: i.e. reddish, greenish, or bluish (obtained by keeping one colour channel constant and multiplying the other two channels with a constant of 0.25, 0.50, and 0.75). There is no significant difference between VGG16, MobileNetV2, NasnetMobile and their corresponding contrast-augmented fine-tuned offspring. However, the fine-tuned InceptionV3 and ResNet50 score significantly better than their original networks.

Although not a topic of this article, it is worth highlighting that certain networks (DenseNet and VGG family) perform substantially better in this task in comparison to the others. This should be investigated in future studies.

Figure 10: The classification accuracy of various networks on the validation-set of ImageNet data set. The first row corresponds to top-1 and the second row corresponds to top-5. On the left panel: those curves with a triangle shape and dotted pattern have gone through contrast-augmented fine-tuning, initialised with the weights of the square shaped curves. On the right panel: all the rest of the pretrained models obtained from Keras platform.

a.3 Gamma correction

Classification accuracy of various networks under seven Gamma correction – – are reported in Figure 11. All contrast-augmented fine-tuned networks perform considerably better in . This is more pronounced for VGG16 and ResNet50. For there is no clear pattern between original and fine-tuned offspring.

Although not a topic of this article, it is worth highlighting that certain networks (NasnetLarge) perform substantially better in this task in comparison to the others. This should be investigated in future studies.

Figure 11: The classification accuracy of various networks on the validation-set of ImageNet data set. The first row corresponds to top-1 and the second row corresponds to top-5. On the left panel: those curves with a triangle shape and dotted pattern have gone through contrast-augmented fine-tuning, initialised with the weights of the square shaped curves. On the right panel: all the rest of the pretrained models obtained from Keras platform.

a.4 Gaussian blurring

Classification accuracy of various networks under three levels of Gaussian blurring – convolutional window of side 3, 9 and 27 pixels – are reported in Figure 12. There is no significant difference between original and fine-tuned offspring.

Although not a topic of this article, it is worth highlighting that certain networks (NasnetLarge) perform substantially better in this task in comparison to the others. This should be investigated in future studies.

Figure 12: The classification accuracy of various networks on the validation-set of ImageNet data set. The first row corresponds to top-1 and the second row corresponds to top-5. On the left panel: those curves with a triangle shape and dotted pattern have gone through contrast-augmented fine-tuning, initialised with the weights of the square shaped curves. On the right panel: all the rest of the pretrained models obtained from Keras platform.

a.5 Uniform noise

Classification accuracy of various networks under three levels of uniform noise – 5, 10, and 20% noise – are reported in Figure 13. There is no significant difference between original and fine-tuned offspring.

Although not a topic of this article, it is worth highlighting that certain networks (NasnetLarge, InceptionV3, Xception, and InceptionResNetV2) perform substantially better in this task in comparison to the others. This should be investigated in future studies.

Figure 13: The classification accuracy of various networks on the validation-set of ImageNet data set. The first row corresponds to top-1 and the second row corresponds to top-5. On the left panel: those curves with a triangle shape and dotted pattern have gone through contrast-augmented fine-tuning, initialised with the weights of the square shaped curves. On the right panel: all the rest of the pretrained models obtained from Keras platform.

a.6 Salt & Pepper Noise

Classification accuracy of various networks under three levels of salt & pepper noise – 1, 5, and 10% noise – are reported in Figure 14. There is no significant difference between original and fine-tuned offspring.

Although not a topic of this article, it is worth highlighting that certain networks (NasnetLarge) perform substantially better in this task in comparison to the others. This should be investigated in future studies.

Figure 14: The classification accuracy of various networks on the validation-set of ImageNet data set. The first row corresponds to top-1 and the second row corresponds to top-5. On the left panel: those curves with a triangle shape and dotted pattern have gone through contrast-augmented fine-tuning, initialised with the weights of the square shaped curves. On the right panel: all the rest of the pretrained models obtained from Keras platform.

a.7 Amount of required training

Figure 15: The top-1 and top-5 classification accuracy of fine-tuned ResNet50 in various epochs on the validation-set of ImageNet data set.

Appendix B Training networks from scratch

b.1 Contrast reduction

Classification accuracy of various networks trained in controlled environment under seven levels of image contrast – – are reported in Figure 16. Refer to Section 3.2 of the principal manuscript for relevant discussion.

Figure 16: The top-1 and top-5 classification accuracy of various networks on the validation-set of ImageNet. The left panel corresponds to top-1 and the right panel corresponds to top-5 The curves are normalised to perfect accuracy on 100% level of contrast. Each legend starts with an abbreviation, the first corresponds to network (I: InceptionV3Blue, R: ResNet50Red, D: DenseNet201Green) and the second to optimiser (A: AdamSquare, S: SGDCircle). Those with a triangle shape have gone through a contrast-augmented training procedure.

Appendix C The what question

c.1 Weights comparison

The absolute difference and mutual information between weights of all convolutional layers of an original network and its fine-tuned offspring is reported in Figure 17. Refer to the Section 4.2 of the principal manuscript for relevant discussion.

Figure 17: Left panel: the absolute difference of kernel weights between an original network and its fine-tuned offspring. Right panel: the mutual information between kernels weights of an original network and its fine-tuned offspring

c.2 Activation maps comparison

Figure 18 corresponds to the red bars of Figure 8 in the principal manuscript divided into two categories: (i) the green bars are the average of all images where the original ResNet50 and its fine-tuned offspring both correctly classify an image at 15% level of contrast, (ii) red bars refer to those instances where the original ResNet50 fails to correctly classify while the fine-tuned version does classify correctly. As it can be observed there is no clear difference between the two set of bars. This suggests that although “branch_2c” convolutional layers clearly exhibit a different behaviour in comparison to the other layers, they cannot explain the difference between successful and failure trials. This is a necessary condition and should be answered in future works.

Figure 18: The total number of images in validation-set of ImageNet with uncorrelated kernels’ activation (i.e. ) between ResNet50 and its contrast-augmented fine-tuned offspring. Green bars correspond to the images where both networks have a correct output. Red bars correspond to the instances where the fine-tuned network is successful while the original network fails.

Figure 19 shows the correlation between activation maps of the all layers in the original ResNet50 and its fine-tuned offspring. As it can be observed the green and red lines are very similar. This suggests correlation between activation maps of these twin networks cannot explain the difference between them.

Figure 19: Correlation between activation maps of ResNet50 and its contrast-augmented fine-tuned offspring for all layer. The continuous lines correspond to images at 100% level of contrast. The dashed lines correspond to images at 15% level of contrast. Green colour refers to the images where both networks have a correct output. Red colour refers to the instances where the fine-tuned network is successful while the original network fails.

c.3 Successful versus failure comparison

Figure 20 shows the percentage of the most activated kernels that remain identical at every convolutional layer when the image contrast is reduced from 100% to a lower level. As it can be observed from this figure, as contrast of image is reduced the percentage of identical kernels to 100% image contrast is reduced. This can explain why VGG16 performs worse when contrast of an image is poor. However there is no difference between green and red lines regardless of the contrast examined. This suggests similar to above that percentage of the most activated kernels cannot explain why a network fails for certain images at low contrast, while it succeeds for others. This should be studied in future works.

Figure 20: Comparison of the most activated kernels in activation maps of VGG16 between images at lower levels of contrast to 100% contrast. Green colour correspond to the images where VGG16 is correct at both 100% image contrast and lower image contrast. Red colour corresponds to the images where VGG16 is correct at 100% image contrast but it fails at a lower image contrast.