What Happened to My Dog in That Network: Unraveling Top-down Generators in Convolutional Neural Networks

11/23/2015 ∙ by Patrick W. Gallagher, et al. ∙ University of California, San Diego 0

Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".



There are no comments yet.


page 6

page 9

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has many recent successes; for example, deep learning approaches have made strides in automatic speech recognition (Hinton et al., 2012), in visual object recognition (Krizhevsky et al., 2012), and in machine translation (Sutskever et al., 2014). While these successes demonstrate the wide-ranging effectiveness of deep learning approaches, there yet remains useful information that current deep learning is less able to bring to bear.

To take a specific example, consider that much of current deep learning practice is dominated by approaches that proceed from input to output in a fundamentally bottom-up fashion. While current performance is extremely impressive, these strongly bottom-up characteristics leave room for one to ask whether providing deep learning with the ability to also incorporate top-down information might open a path to even better performance.

The demonstrated role of top-down information in human perception (Stroop, 1935; Cherry, 1953; Hill & Johnston, 2007; Ames Jr, 1951) provides a suggestive indication of the role that top-down information could play in deep learning. Visual illusions (such as the “Chaplin mask”) provide the clearest examples of the strong effect that top-down/prior information can have on human perception; the benefits of top-down information in human perception are widespread but subtler to notice: prominent examples include color constancy (Kaiser & Boynton, 1996) and the interpretation of visual scenes that would otherwise be relatively meaningless (e.g. the “Dalmatian” image (Marr, 1982)). Another particularly common experience is the human ability to focus on some specific conversation in a noisy room, distinguishing the relevant audio component among potentially overwhelming interference.

Motivated by the importance of top-down information in human perception, as well as by the successful incorporation of top-down information in non-deep approaches to computer vision

(Borenstein & Ullman, 2008; Tu et al., 2005; Levin & Weiss, 2009), we pursue an approach to bringing top-down information into current deep network practice. The potential benefits from incorporating top-down information in deep networks include improved prediction accuracy in settings where bottom-up information is misleading or insufficiently distinctive as well as generally improved agreement when multiple classification predictions are made in a single image (such as in images containing multiple objects). A particularly appealing direction for future work is the use of top-down information to improve resistance to “adversarial examples” (Nguyen et al., 2015; Szegedy et al., 2013).

1.1 Related work

The incorporation of top-down information in visual tasks stands at the intersection of three fields: cognitive science, computer vision, and deep learning. Succinctly, we find our inspiration in cognitive science, our prior examples in computer vision, and our actual instantiation in deep learning. We consider these each in turn.

Cognitive science

Even before Stroop’s work (Stroop, 1935) it has been noted that human perception of the world is not a simple direct path from, e.g., photons reaching the retina to an interpretation of the world around us. Researchers have established a pervasive and important role for top-down information in human perception (Gregory, 1970). The most striking demonstrations of the role of top-down information in human perception come in the form of “visual illusions”, such as incorrectly perceiving the concave side of a plastic Chaplin mask to be convex (Hill & Johnston, 2007).

The benefits of top-down information are easy to overlook, simply because top-down information is often playing a role in the smooth functioning of perception. To get a sense for these benefits, consider that in the absence of top-down information, human perception would have trouble with such useful abilities as the establishment of color constancy across widely varying illumination conditions (Kaiser & Boynton, 1996) or the interpretation of images that might otherwise resemble an unstructured jumble of dots (e.g., the “Dalmatian” image (Marr, 1982)).

Non-deep computer vision

Observations of the role of top-down information in human perception have inspired many researchers in computer vision. A widely-cited work on this topic that considers both human perception and machine perception is (Kersten et al., 2004). The chain of research stretches back even to the early days of computer vision research, but more recent works demonstrating the performance benefits of top-down information in tasks such as object perception include (Borenstein & Ullman, 2008; Tu et al., 2005; Levin & Weiss, 2009).

Deep computer vision

Two recent related works in computer vision are (Cohen & Welling, 2015; Jaderberg et al., 2015)

. There are distinct differences in goal and approach, however. Whereas spatial transformer networks

(Jaderberg et al., 2015)

pursue an architectural addition in the form of what one might describe as “learned standardizing preprocessing” inside the network, our primary focus is on exploring the effects (within an existing CNN) of the types of transformations that we consider. We also investigate a method of using the explored effects (in the form of learned generators) to improve vanilla AlexNet performance on ImageNet. On the other hand,

(Cohen & Welling, 2015) state that their goal is “to directly impose good transformation properties of a representation space” which they pursue via a group theoretic approach; this is in contrast to our approach centered on effects on representations in an existing CNN, namely AlexNet. They also point out that their approach is not suitable for dealing with images much larger than 108x108, while we are able to pursue an application involving the entire ImageNet dataset. Another recent work is (Dai & Wu, 2014), modeling random fields in convolutional layers; however, they do not perform image synthesis, nor do they study explicit top-down transformations.

Image generation from CNNs

As part of our exploration, we make use of recent work on generating images corresponding to internal activations of a CNN. A special purpose (albeit highly intriguing) method is presented in (Dosovitskiy et al., 2015). The method of (Mahendran & Vedaldi, 2014) is generally applicable, but the specific formulation of their inversion problem leads to generated images that significantly differ from the images the network was trained with. We find the technique of (Dosovitskiy & Brox, 2015) to be most suited to our purposes and use it in our subsequent visualizations.

Feature flows

One of the intermediate steps of our process is the computation of “feature flows” — vector fields computed using the SIFTFlow approach

(Liu et al., 2011), but with CNN features used in place of SIFT features. Some existing work has touched on the usefulness of vector fields derived from “feature flows”. A related but much more theoretical diffeomorphism-based perspective is (Joshi et al., 2000). Another early reference touching on flows is (Simard et al., 1998); however, the flows here are computed from image pixels rather than from CNN features. (Taylor et al., 2010) uses feature flow fields as a means of visualizing spatio-temporal features learned by a convolutional gated RBM that is also tasked with an image analogy problem. The “image analogy” problem is also present in the work (Memisevic & Hinton, 2007)

focusing on gated Boltzmann machines; here the image analogy is performed by a “field of gated experts” and the flow-fields are again used for visualizations. Rather than pursue a special purpose re-architecting to enable the performance of such “zero-shot learning”-type “image analogy” tasks, we pursue an approach that works with an existing CNN trained for object classification: specifically, AlexNet

(Krizhevsky et al., 2012).

2 Generator learning

We will focus our experiments on a subset of affine image operations: rotation, scaling, and translation. In order to avoid edge effects that might arise when performing these operations on images where the object is too near the boundary of the image, we use the provided meta information to select suitable images.

2.1 Pipeline for generator learning

We select from within the 1.3M images of the ILSVRC2014 CLS-LOC task (Russakovsky et al., 2014). In our experiments, we will rotate/scale/translate the central object; we wish for the central object to remain entirely in the image under all transformations. We will use bounding box information to ensure that this will be the case: we select all images where the bounding box is centered, more square than rectangular, and occupies 40-60% of the pixels in the image. We find that 91 images satisfy these requirements; we will subsequently refer to these 91 images as our “original images”.

2.1.1 Generating transformed image pairs

We use rotation as our running example. For ease of reference, we will use the notation to denote a transformed version of original image in which the central object has been rotated to an orientation of degrees; the original image is . Using this notation, we can consider image pairs in which the difference between one image and the other is the amount of rotation of the central object. For example, in the pair , the central object is at orientation in the first image and at orientation in the second.

Figure 1: Illustration of AlexNet feature flow fields associated with the specified transformations. Best viewed in color, on-screen.

To begin, we will consider 72 regularly-spaced values of the initial orientation angle, but only one value of rotation amount, This means that for each of the 91 original images, we will have 72 pairs of the form . These 6,552 total pairs will be the focus of our subsequent processing in the rotation experiments.

2.1.2 Computing AlexNet features

Next, for each transformed image pair, we use the Caffe

(Jia et al., 2014) library’s pretrained reference AlexNet model to compute the AlexNet features associated with each image in the pair. Using the notation to denote the collection of all AlexNet feature values resulting when image is the input, this means that we now have 6,552 “collected AlexNet features” pairs of the form . AlexNet has 8 layers with parameters: the first 5 of these are convolutional layers (henceforth referred to as conv1, conv2, , conv5); the final 3 are fully connected layers (henceforth referred to as fc6, fc7, fc8). Our attention will be focused on the convolutional layers rather than than fully connected layers, since the convolutional layers retain “spatial layout” that corresponds to the original image while the fully connected layers lack any such spatial layout.

2.1.3 Computing per-layer feature flows

For ease of reference, we introduce the notation to refer to the AlexNet features at layer when the input image is From the “entire network image features” pair , we focus attention on one layer at a time; for layer the relevant pair is then In particular, at each convolutional layer, for each such pair we will compute the “feature flow” vector field that best describes the “flow” from the values to the values .

We compute these feature flow vector fields using the SIFTFlow method (Liu et al., 2011) — however, instead of computing “flow of SIFT features”, we compute “flow of AlexNet features”. See Figure 1 for an illustration of these computed feature flow vector fields. For a layer feature pair we refer to the corresponding feature flow as . Recalling that we only compute feature flows for convolutional layers, collecting the feature flow vector fields for results in a total111 of values; we collectively refer to the collected-across-conv-layers feature flow vector fields as . If we flatten/vectorize these feature flow vector field collections for each pair and then row-stack these vectorized flow fields, we obtain a matrix with 6,552 rows (one row per image pair) and 8,522 columns (one column per feature flow component value in conv1 through conv5).

2.1.4 Feature flow PCA

In order to characterize the primary variations in the collected feature flow vector fields, we perform PCA on this matrix of 6,552 rows/examples and 8,522 columns/feature flow component values. We retain the first 10 eigenvectors/principal component directions (as “columns”); each of these contains 8,522 feature flow vector field component values.

PC 1
PC 2
conv1 conv2 conv3 conv5
Figure 2: PCA components of the CNN feature flow fields associated with 10 of rotation. The first, second, and third rows show, respectively, the mean, first, and second principal components. The first, second, third, and forth columns show, respectively, the results in conv1, conv2, conv3, and conv5. Best viewed on screen.

We denote these “eigen” feature flow fields as with each Here the use of an upper case letter is intended to recall that, after reshaping, we can plot these flow fields in a spatial layout corresponding to that of the associated AlexNet convolutional layers. We also recall that these “eigen” feature flow fields were computed based on feature pairs with a rotation. Together with the mean feature flow field, subsequently denoted , these 10 “stacked and flattened/vectorized ‘eigen’ feature flow fields” (as “columns”) of 8,522 feature flow component values provide us the ability to re-represent each of the 6,552 “per-pair stacked and flattened/vectorized feature flow fields” in terms of 11 = 1+10 coefficients. When we re-represent the 6,552 example “stacked and flattened/vectorized ’pair’ feature flow fields”, the coefficient associated with the mean will always be equal to 1; however, we will shortly consider a setting in which we will allow the coefficient associated with the mean to take on values other than 1.

2.1.5 Using feature flow PCA to obtain bases for expressing generators

Taken together, the mean feature flow field and the ‘eigen’ feature flow fields provide us with the ability to produce (up to some minimized value of mean squared error) “re-representations” of each of the 6,552 example “stacked and flattened/vectorized ‘ rotation’ feature flow fields”.

These 11 vectors were determined from the case of feature flow fields associated with 10 rotations. We next seek to use these 11 vectors (together with closely related additions described shortly) as bases in terms of which we seek to determine alternative representations of flow fields associated with other amounts of rotation. In particular, we will seek to fit regression coefficients for the representation of flow fields associated with feature pairs derived when there has been rotation of varying amounts of the central object. Specifically, we will follow the steps of the feature flow computation process detailed earlier in this Section, but now using together with the previous

The regression equation associated with each feature flow example will be of the form


where is a matrix containing 11 groups of 3 columns, each of the form there is one such group for each of We do this so that the basis vectors provided in will be different in different rotation conditions, enabling better fits. The vector can similarly be regarded as containing 11 groups of 3 coefficient values, say Finally, the right hand side is an instance of the collected-across-conv-layers feature flow vector fields described earlier. We have one of the above regression expressions for each “transform image” pair ; since in our current setting have 91 original images, 72 initial orientation angles , and 6 rotation amounts , we have a total of such pairs. For ease of reference, we will refer to the vertically stacked basis matrices (each of the form with being the value used in computing the associated “transform image” pair as in the example regression described in Eqn. 1) as Similarly, we will refer to the vertically stacked “feature flow vector field” vectors, each of the form as

Our “modified basis” regression problem222For specificity, we describe the row dimension of : A total of rows that come from vertically stacked one-per-image-pair matrices, each with 8,522 rows. Thus, the total number of rows is original images 72 initial orientations 6 rotation amounts 8,522 entries in the feature flow field collection per image pair. is thus succinctly expressed as


We will refer to the minimizing argument as

3 Illustration of use of learned generators

3.1 Transformations via learned generators

We can use these “least squares” coefficient values to “predict” feature flow fields associated with a specified number of degrees of rotation. More particularly, we can do this for rotation degree amounts other than the degree amounts used when we generated the “transform image training pairs” used in our least-squares regression calibration Eqn. 2. To obtain the desired generator, we decide what specific “number of degrees of rotation” is desired; using this specified degree amount and the 11 basis vectors (learned in the 10 degree rotation case we performed PCA on previously), generate the corresponding “33 column effective basis matrix” Our sought-for generator is then an element of For specificity, we will refer to the generator arising from a specified rotation angle of as We could describe generators as “predicted specified feature flows”; however, since we use these to generate novel feature values in the layers of the network (and since this description is somewhat lengthy), we refer to them as “generator flow fields”, or simply “generators”.

Figure 3: Illustration of applying learned generators to CNN features of a novel input image. (a) Input image. (b) “Inverted image” (Dosovitskiy & Brox, 2015) from conv1 features of (a). (c), (d), (e), (f), (g), and (h) show, respectively, “inverted images” from conv1 features obtained by applying learned generator flow fields associated with - rotation, rotation, scaling by a factor of , scaling by a factor of , translation pixels to the left, and translation pixels up.

We may describe the use of these learned generators follows: given a collection of CNN features, we can apply a learned generator to obtain (approximations to) the feature values that would have arisen from applying the associated transformation to the input image.

We now seek to investigate the use of generator flow fields, generically , in order to produce an approximation of the exact CNN features that would be observed if we were to e.g. rotate an original image and compute the resulting AlexNet features. As a specific example, consider a “transform image pair” . In our notation, the corresponding AlexNet feature response map pair is We seek to use our generator,

to provide an estimate of

given only

Visualizations of the CNN features are often difficult to interpret. To provide an interpretable evaluation of the quality of the learned generators, we use the AlexNet inversion technique of (Dosovitskiy & Brox, 2015). Applying our learned generators, the results in Fig. 3 indicate that the resulting CNN features (arrived at using information learned in a top-down fashion) closely correspond to those that would have been produced via the usual bottom-up process. As a more quantitative evaluation, we also check the RMS error and mean absolute deviation between network internal layer features “generated” using our learned generators and the corresponding feature values that would have arisen through “exact” bottom-up processing. For example, when looking at the 256 channels of AlexNet conv5, the RMS of the difference between generator-produced features and bottom-up features associated with “translate left by 30” is 4.69; the mean absolute deviation is 1.042. The RMS of the difference between generator-produced and bottom-up associated with “scale by 1.3x” is 1.63; the mean absolute deviation is 0.46

3.2 Zero-shot learning

We have seen that the learned generators can be used to produce CNN features corresponding to various specified transformations of provided initial CNN features. We next seek to explore the use of these learned generators in support of a zero-shot learning task. We will again use our running example of rotation.

A typical example of “zero-shot learning”: “If it walks like a duck…”

We first describe a typical example of zero-shot learning. Consider the task of classifying central objects in e.g. ImageNet images of animals. A standard zero-shot learning approach to this task involves two steps. In the first step, we learn a mapping from raw input data (for example, a picture of a dog) to some intermediate representation (for example, scores associated with “semantic properties” such as “fur is present”, “wings are present”, etc.). In the second step, we assume that we have (from Wikipedia article text, for example) access to a mapping from the intermediate “semantic property” representation to class label. For example, we expect Wikipedia article text to provide us with information such as “a zebra is a hoofed mammal with fur and stripes”.

If our training data is such that we can produce accurate semantic scores for “hoofs are present”, “stripes are present”, “fur is present”, we can potentially use the Wikipedia-derived association between “zebra” and its “semantic properties” to bridge the gap between the “semantic properties” predicted from the raw input image and the class label associated with “zebra”; significantly, so long as the predicted “semantic properties” are accurate, the second part of the system can output “zebra” whether or not the training data ever contained a zebra. To quote the well-known aphorism: “If it walks like a duck, swims like a duck, and quacks like a duck, then I call that thing a duck.”

Zero-shot learning in our context

In the typical example of zero-shot learning described above, the task was to map raw input data to a vector of predicted class label probabilities. This task was broken into two steps: first map the raw input data to an intermediate representation (“semantic property scores”, in the animal example), then map the intermediate representation to a vector of predicted class label probabilities. The mapping from raw input data to intermediate representation is learned during training; the mapping from intermediate representation to class label is assumed to be provided by background information or otherwise accessible from an outside source (determined from Wikipedia, in the animal example).

In our setting, we have (initial image, transformed image) pairs. Our overall goal is to determine a mapping from (initial image, transformed image) to “characterization of specific transformation applied”. A specific instance of the overall goal might be: when presented with, e.g., an input pair (image with central object, image with central object rotated 35) return output “35”. Analogous to the animal example discussed above, we break this overall mapping into two steps: The first mapping step takes input pairs (initial image, transformed image) to “collected per-layer feature flow vector fields”. The second mapping step takes an input of “collected per-layer feature flow vector fields” to an output of “characterization of specific transformation applied”. Note that, in contrast to the animal example, our context uses learning in the second mapping step rather than the first. A specific example description of this two-part process: Take some previously never-seen new image with a central object. Obtain or produce another image in which the central object has been rotated by some amount. Push the original image through AlexNet and collect the resulting AlexNet features for all layers. Push the rotated-central-object image through AlexNet and collect the resulting AlexNet features for all layers. For each layer, compute the “feature flow” vector field; this is the end of the first mapping step. The second mapping step takes the collection of computed “per-layer feature flow vector fields” and predicts the angle of rotation applied between the pair of images the process started with. In our context, we use our learned generator in this second mapping step. We now discuss the details of our approach to “zero-shot learning”.

Details of our “zero-shot learning” task

The specific exploratory task we use to evaluate the feasibility of zero-shot learning (mediated by top-down information distilled from the observed behavior of network internal feature flow fields) can be described as follows: We have generated (image-with-central-object, image-with-central-object-rotated) pairs. We have computed feature flows for these pairs. We have performed PCA on these feature flows to determine , an “effective basis matrix” associated with a rotation angle of . We have fit a calibration regression, resulting in , the vector of least-squares coefficients with which we can make feature flow predictions in terms of the “effective basis matrix”. Our initial “zero-shot learning” prediction task will be to categorize the rotation angle used in the image pair as “greater than 60” or “less than 60”.

Figure 4: Desired categorization output: “Rotated less than 60”.

We compare the performance of our approach to a more standard approach: We train a CNN in a “Siamese” configuration to take as input pairs of the form (image, image-with-central-object-rotated) and to produce as output a prediction of the angle of rotation between the images in the pair. One branch of the “Siamese” network receives the initial image as input; the other branch receives input with the central object rotated. Each branch is structured to match AlexNet layers from conv1 up to pool5 — that is, up to but not including fc6. We then stack the channels from the respective pool5 layers from each branch. The resulting stack is provided as input to a fully-connected layer, fc6, with 4096 units; fc7 takes these 4096 units of input and produces 4096 units of output; finally, fc8 produces a single scalar output — probability the rotation angle used in the image pair was “greater than 60” or “less than 60”.

On a test set of 1,600 previously unseen image pairs with orientation angles ranging through 360, our initial zero-shot learning approach yields correct categorization 74% of the time. We structure our comparison question as “How many image pairs are required to train the ‘Siamese’ network to a level of prediction performance comparable to the zero-shot approach?” We observe that with 500 training pairs, the “Siamese” network attains 62% correct categorization; with 2,500 pairs performance improves to 64%; with 12,500 pairs, to 86%; and finally with 30,000 pairs, to 96%.

3.3 (Network internal) “data augmentation”

We have previously illustrated our ability to use learned generators to produce a variety of “predicted” CNN feature response maps, each of which corresponds to some exact CNN feature response map that would have arisen in a standard bottom-up approach; we will now describe how we can these learned generators to perform (network internal) “data augmentation”. To ground our discussion, consider an initial image If one were to perform standard “data augmentation”, one might apply a variety of rotations to the initial image, say from a possible collection of rotation angle amounts where we have chosen our notation to emphasize that the index is over possible “” rotation angle values. The “data augmentation” process would involve corresponding images, Network training then proceeds in the usual fashion: for whichever transformed input image, the corresponding AlexNet feature collection in

would computed and used to produce loss values and backpropagate updates to the network parameters.

Our observation is that we can use our learned generators to produce, in a network-internal fashion, AlexNet internal features akin to those listed in . Specifically, in our running rotation example we learned to produce predictions (for each layer of the network) of the flow field associated with a specified rotation angle. As mentioned previously, we refer to the learned generator at a layer associated with a rotation angle of as We regard the process of applying a learned generator, for example , to the layer AlexNet features, , as a method of producing feature values akin to To emphasize this notion, we will denote the values obtained by applying the learned generator flow field to the layer AlexNet features, as (with the entire collection of layers denoted as ). Using our newly-established notation, we can express our proposed (network internal) “data augmentation” as follows: From some initial image compute AlexNet features For any desired rotation, say

, determine the associated learned generator flow field

Apply this generator flow field to, for example, to obtain “predicted feature values” The standard feedforward computation can then proceed from layer to produce a prediction, receive a loss, and begin the backpropagation process by which we can update our network parameters according to this (network internal) “generated feature example”.

Method top-1 m-view top-5 m-view
AlexNet (Krizhevsky et al., 2012) 60.15 83.93

AlexNet after 5 additional epochs of generator training

60.52 84.35
Table 1: ImageNet validation set accuracy (in %).

The backpropagation process involves a subtlety. Our use of the generator to means that forward path through the network experiences a warp in the features. To correctly propagate gradients during the backpropagation process, the path that the gradient values follow should experience the “(additive) inverse” of the forward warp. We can describe the additive inverse of our gridded vector field fairly simply: Every vector in the initial field should have a corresponding vector in the inverse field; the component values should be negated and the root of the “inverse vector” should be placed at the head of the “forward vector”. The “inverse field” thus cancels out the “forward field”. Unfortunately, the exact “inverse vector” root locations will not lie on the grid used by the forward vector field. We obtain an approximate inverse vector field by negating the forward vector field components. In tests, we find that this approximation is often quite good; see Fig. A2. Using this approximate inverse warp, our learned generator warp can be used in the context of network internal data augmentation during training. See the Fig. 5 for an illustration.

Figure 5: Illustration of network internal data augmentation, more succinctly described as “generator training”. On the left, we show a schematic of the modified AlexNet architecture we use. The primary difference is the incorporation at conv5 of a module applying a randomly selected learned generator flow field. On the right, we provide comparison between five selected conv5 channels: (lower row) before applying the “scale 1.3x” learned generator flow field; (upper row) after applying the generator flow field.

We now discuss the use of our proposed network internal data augmentation to improve the performance of AlexNet on ImageNet. We train using the 1.3M images of the ILSVRC2014 CLS-LOC task. For each batch of 256 images during training, we randomly select one of six of our learned generator flow fields to apply to the initial features in conv5. Specifically, we randomly select from one of +30 rotation, -30 rotation, 1.3x scaling, 0.75x scaling, translation 30 pixels left, or translation 30 pixels up. We apply the (approximate) inverse warp when backpropagating through conv5. We evaluate performance on the 50k images of ILSVRC2014 CLS-LOC validation set; see Table 1.


This work is supported by NSF IIS-1216528 (IIS-1360566), NSF award IIS-0844566 (IIS-1360568), and a Northrop Grumman Contextual Robotics grant. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research. We thank Chen-Yu Lee and Jameson Merkow for their assistance, and Saining Xie and Xun Huang for helpful discussion.


  • Ames Jr (1951) Ames Jr, Adelbert. Visual perception and the rotating trapezoidal window. Psychological Monographs: General and Applied, 65(7):i, 1951.
  • Borenstein & Ullman (2008) Borenstein, Eran and Ullman, Shimon. Combined top-down/bottom-up segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(12):2109–2125, 2008.
  • Cherry (1953) Cherry, E Colin. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25(5):975–979, 1953.
  • Cohen & Welling (2015) Cohen, T and Welling, M. Transformation Properties of Learned Visual Representations. In ICLR2015, 2015.
  • Dai & Wu (2014) Dai, Jifeng and Wu, Ying-Nian. Generative Modeling of Convolutional Neural Networks. arXiv preprint arXiv:1412.6296, 2014.
  • Dosovitskiy & Brox (2015) Dosovitskiy, Alexey and Brox, Thomas. Inverting convolutional networks with convolutional networks. arXiv preprint arXiv:1506.02753, 2015.
  • Dosovitskiy et al. (2015) Dosovitskiy, Alexey, Tobias Springenberg, Jost, and Brox, Thomas. Learning to Generate Chairs With Convolutional Neural Networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 1538–1546, 2015.
  • Gregory (1970) Gregory, Richard Langton. The intelligent eye. 1970.
  • Hill & Johnston (2007) Hill, Harold C and Johnston, Alan. The hollow-face illusion: Object specific knowledge, general assumptions or properties of the stimulus. 2007.
  • Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, Mohamed, Abdel-Rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, and Kingsbury, Brian. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. In IEEE Signal Processing Magazine, 2012.
  • Jaderberg et al. (2015) Jaderberg, Max, Simonyan, Karen, Zisserman, Andrew, and Kavukcuoglu, Koray. Spatial transformer networks. arXiv preprint arXiv:1506.02025, 2015.
  • Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe. In ACM MM, 2014.
  • Joshi et al. (2000) Joshi, Sarang C, Miller, Michael, et al. Landmark matching via large deformation diffeomorphisms. Image Processing, IEEE Transactions on, 9(8):1357–1370, 2000.
  • Kaiser & Boynton (1996) Kaiser, Peter K and Boynton, Robert M. Human color vision. 1996.
  • Kersten et al. (2004) Kersten, Daniel, Mamassian, Pascal, and Yuille, Alan.

    Object perception as Bayesian inference.

    Annu. Rev. Psychol., 55:271–304, 2004.
  • Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
  • Levin & Weiss (2009) Levin, Anat and Weiss, Yair. Learning to combine bottom-up and top-down segmentation. International Journal of Computer Vision, 81(1):105–118, 2009.
  • Liu et al. (2011) Liu, Ce, Yuen, Jenny, and Torralba, Antonio. Sift flow: Dense correspondence across scenes and its applications. PAMI, 33(5):978–994, 2011.
  • Mahendran & Vedaldi (2014) Mahendran, Aravindh and Vedaldi, Andrea. Understanding Deep Image Representations by Inverting Them. IJCV, 2(60):91–110, 2014.
  • Marr (1982) Marr, David. Vision: A computational approach, 1982.
  • Memisevic & Hinton (2007) Memisevic, Roland and Hinton, Geoffrey. Unsupervised learning of image transformations. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. IEEE, 2007.
  • Nguyen et al. (2015) Nguyen, Anh, Yosinski, Jason, and Clune, Jeff. Deep Neural Networks Are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427–436, 2015.
  • Russakovsky et al. (2014) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2014.
  • Simard et al. (1998) Simard, Patrice Y, LeCun, Yann A, Denker, John S, and Victorri, Bernard. Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Neural networks: tricks of the trade, pp. 239–274. Springer, 1998.
  • Stroop (1935) Stroop, J Ridley. Studies of interference in serial verbal reactions. Journal of experimental psychology, 18(6):643, 1935.
  • Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to Sequence Learning with Neural Networks. In NIPS, 2014.
  • Szegedy et al. (2013) Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian, and Fergus, Rob. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Taylor et al. (2010) Taylor, Graham W, Fergus, Rob, LeCun, Yann, and Bregler, Christoph. Convolutional learning of spatio-temporal features. In Computer Vision–ECCV 2010, pp. 140–153. Springer, 2010.
  • Tu et al. (2005) Tu, Zhuowen, Chen, Xiangrong, Yuille, Alan L, and Zhu, Song-Chun. Image parsing: Unifying segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.

A1 Supplementary Materials

Figure A1: Visualizations of the mean “feature flow” as computed across the 91 “original images” whose selection is described in Section 2.1. The leftmost column contains visualizations computed from features arrived at with input image pairs that differ by 10 rotation of the central object; the center column, with input image pairs that differ by the central object being scaled by 1.3x; the rightmost column, with input image pairs that differ by the central object being scaled by 0.75x. Moving from top to bottom within each column, the feature flow fields are shown, respectively, for conv1, then pool1, then conv2.
Figure A2: Here we confirm that the “negation” of a generator flow field is (both qualitatively and quantitatively) a good approximation to the additive inverse of that generator flow field. Since the inverted images from conv1 have more detail, we perform our qualitative evaluation with conv1. Since our actual training uses conv5, we perform our quantitative evaluation with conv5. In each entry above, we begin with AlexNet conv1 features from the original “taco cat” image. The “inverted image” (Dosovitskiy & Brox, 2015) corresponding to these untouched features is found in the top left. In each other entry we apply a different learned generator followed by its “negation”. The close correspondence between the images “inverted” from the resulting features and the image “inverted” from the untouched features confirms the quality of the approximation. Moving on to quantitative evaluation, we find that the feature values arising from applying a generator flow field followed by its negation differs from the original AlexNet conv5 feature values across the 256 channels of conv5 as follows: approximate inverse of “rotate -30” yields 0.37 RMS (0.09 mean absolute difference); approximate inverse of “scale by 1.3x” yields 0.86 RMS (0.19 mean absolute difference); for “translation 30 left”, the approximation incurs error at the boundaries of the flow region, yielding 3.96 RMS (but 0.9 mean absolute deviation).