DeepAI
Log In Sign Up

What Do Deep CNNs Learn About Objects?

04/09/2015
by   Xingchao Peng, et al.
UMass Lowell
0

Deep convolutional neural networks learn extremely powerful image representations, yet most of that power is hidden in the millions of deep-layer parameters. What exactly do these parameters represent? Recent work has started to analyse CNN representations, finding that, e.g., they are invariant to some 2D transformations Fischer et al. (2014), but are confused by particular types of image noise Nguyen et al. (2014). In this work, we delve deeper and ask: how invariant are CNNs to object-class variations caused by 3D shape, pose, and photorealism?

READ FULL TEXT VIEW PDF
03/21/2015

Boosting Convolutional Features for Robust Object Proposals

Deep Convolutional Neural Networks (CNNs) have demonstrated excellent pe...
03/15/2018

Studying Invariances of Trained Convolutional Neural Networks

Convolutional Neural Networks (CNNs) define an exceptionally powerful cl...
08/31/2016

Convolutional Neural Networks for Text Categorization: Shallow Word-level vs. Deep Character-level

This paper reports the performances of shallow word-level convolutional ...
05/30/2018

Why do deep convolutional networks generalize so poorly to small image transformations?

Deep convolutional network architectures are often assumed to guarantee ...
11/19/2018

Deeper Interpretability of Deep Networks

Deep Convolutional Neural Networks (CNNs) have been one of the most infl...
06/08/2022

Neural Collapse: A Review on Modelling Principles and Generalization

With a recent observation of the "Neural Collapse (NC)" phenomena by Pap...
04/23/2020

The Weighted Euler Curve Transform for Shape and Image Analysis

The Euler Curve Transform (ECT) of Turner et al. is a complete invariant...

1 Introduction

Deep convolutional neural networks learn extremely powerful image representations, yet most of that power is hidden in the millions of deep-layer parameters. What exactly do these parameters represent? Recent work has started to analyse CNN representations, finding that, e.g., they are invariant to some 2D transformations Fischer et al. (2014), but are confused by particular types of image noise Nguyen et al. (2014). In this work, we delve deeper and ask: how invariant are CNNs to object-class variations caused by 3D shape, pose, and photorealism?

To analyse deep representations, we treat the activations of a hidden layer as input features to a linear classifier, and test how well the classifier generalizes across intra-class variations due to the above factors. The hypothesis is that, if the representation is invariant to a certain factor, then similar neurons will activate whether or not that factor is present in the input image. For example, if the network is invariant to “cat” texture, then it will have similar activations on cats with and without texture, i.e. it will “hallucinate” the right texture when given a texureless cat shape. Then the classifier will learn equally well from both sets of training data. If, on the other hand, the network is not invariant to texture, then the feature distributions will differ. As a consequence, the classifier trained on textureless cat data will perform worse. Because factors such as object texture and background scene are difficult to isolate using 2D image data, we rely on computer graphics to generate synthetic images from 3D object models.

2 Exploring the Invariances of CNN features

We design a series of experiments to probe CNN invariance in the context of object detection. For each experiment, we follow these steps: 1) select image rendering parameters, 2) generate a batch of synthetic 2D images with those parameters, 3) sample positive and negative patches for each object class, 4) extract hidden CNN layer activations from the patches as features, 5) train a classifier for each object category, 6) test the classifiers on real images.

CNN Model, Training, and Synthetic Data Generation. We adopt the detection method of Girshick et al. (2013), which uses the eight-layer “AlexNet” architecture with over 60 million parameters Krizhevsky et al. (2012)

. Our hypothesis is that the network will learn different invariances, depending on how it is trained. Therefore, we evaluate two different variants of the network: one trained on the ImageNet ILSVRC 1000-way classification task, which we call

IMGNET, and the same network also fine-tuned for the PASCAL detection task, which we call PASC-FT. For both networks, we extract the last hidden layer (fc7) as the feature representation. We choose to focus on the last hidden layer as it is the most high-level representation and has learned the most invariance.

We choose a subset of factors that can easily be modeled using simple computer graphics techniques, namely, object texture and color, context/background appearance and color, 3D pose and 3D shape. We study the invariance of the CNN representation to these parameters using synthetic data. We also study the invariance to 3D rigid rotations using real data.

Object Color, Texture and Context. We begin by investigating various combination of object colors and textures placed against a variety of background scene colors and textures. Examples of our texture and background generation settings are shown in Table 1.

We trained a series of detectors with each of the above background and object texture configurations and tested them on the PASCAL VOC test set, reporting the average precision (AP) across categories. The somewhat unexpected result is that the generation settings RR-RR(28.9%), W-RR(31.2%), W-UG(30.1%), RG-RR(31.2%) with PASC-FT all achieve comparable performance, despite the fact that W-UG has no texture and no context. Results with real texture but no color in the background (RG-RR, W-RR) are the best. This indicates that the network has learned to be invariant to the color and texture of the object and its background.

RR-RR W-RR W-UG RR-UG RG-UG RG-RR
BG Real RGB White White Real RGB Real Gray Real Gray
TX Real RGB Real RGB Unif. Gray Unif. Gray Unif. Gray Real RGB
Table 1: Different configuration of background, color, texture

Image Pose. We also test view invariance on real images. We are interested here in objects whose frontal view presentation differs significantly (ex: the side-view of a horse vs a frontal view). To this end, we selected 12 categories from the PASCAL VOC training set which match this criteria. Held out categories included rotationally invariant objects such as bottles or tables. Next, we split the training data for these 12 categories to prominent side-view and front-view, as shown in Table 2.

We train classifiers exclusively by removing one view (say front-view) and test the resulting detector on the PASCAL VOC test set containing both side and front-views.We also compare with random view sampling. Results, shown in Table 2, point to important and surprising conclusions regarding the representational power of the CNN features. Note that mAP drops by less than when detectors exclusively trained by removing either view are tested on the PASCAL VOC test set.


Net
Views aero bike bird bus car cow dog hrs mbik shp trn tv mAP
PASC-FT all 64.2 69.7 50 62.6 71 58.5 56.1 60.6 66.8 52.8 57.9 64.7 61.2
PASC-FT -random 62.1 70.3 49.7 61.1 70.2 54.7 55.4 61.7 67.4 55.7 57.9 64.2 60.9
PASC-FT -front 61.7 67.3 45.1 58.6 70.9 56.1 55.1 59.0 66.1 54.2 53.3 61.6 59.1
PASC-FT -side 62.0 70.2 48.9 61.2 70.8 57.0 53.6 59.9 65.7 53.7 58.1 64.2 60.4

PASC-FT(-front)
-front 59.7 63.1 42.7 55.3 64.9 54.4 54.0 56.1 64.2 55.1 47.4 60.1 56.4
Table 2: Results of training on different real image views. ’-’ represent removing a certain view.

3D Shape. Finally, we experiment with reducing intra-class shape variation by using fewer CAD models per category. We otherwise use the same settings as in the RR-RR condition with PASC-FT. From our experiments, we find that the mAP decreases by about 5.5 points from 28.9% to 23.53% when using only a half of the 3D models. This shows a significant boost from adding more shape variation to the training data, indicating less invariance to this factor.

3 Conclusion

We investigated the sensitivity of convnets to various factors in the training data: 3D pose, foreground texture and color, background image and color. To simulate these factors we used synthetic data generated from 3D CAD models and a few real images.

Our results demonstrate that the popular deep convnet of Krizhevsky et al. (2012) fine-tuned for detection on real images for a set of categories is indeed invariant to these factors.

For more details and results, we refer the reader to the following paper Peng et al. (2014).

References