Deep Fishing: Gradient Features from Deep Nets

by   Albert Gordo, et al.

Convolutional Networks (ConvNets) have recently improved image recognition performance thanks to end-to-end learning of deep feed-forward models from raw pixels. Deep learning is a marked departure from the previous state of the art, the Fisher Vector (FV), which relied on gradient-based encoding of local hand-crafted features. In this paper, we discuss a novel connection between these two approaches. First, we show that one can derive gradient representations from ConvNets in a similar fashion to the FV. Second, we show that this gradient representation actually corresponds to a structured matrix that allows for efficient similarity computation. We experimentally study the benefits of transferring this representation over the outputs of ConvNet layers, and find consistent improvements on the Pascal VOC 2007 and 2012 datasets.



There are no comments yet.


page 2


End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks

Most phoneme recognition state-of-the-art systems rely on a classical ne...

Structured Convolution Matrices for Energy-efficient Deep learning

We derive a relationship between network representation in energy-effici...

WaveBeat: End-to-end beat and downbeat tracking in the time domain

Deep learning approaches for beat and downbeat tracking have brought adv...

Audio Impairment Recognition Using a Correlation-Based Feature Representation

Audio impairment recognition is based on finding noise in audio files an...

Deep Image Homography Estimation

We present a deep convolutional neural network for estimating the relati...

End-to-End Learning for Structured Prediction Energy Networks

Structured Prediction Energy Networks (SPENs) are a simple, yet expressi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image classification involves describing images with pre-determined labels. One of the first breakthroughs towards solving this problem was the bag-of-visual-words (BOV) [Sivic and Zisserman(2003), Csurka et al.(2004)Csurka, Dance, Fan, Willamowski, and Bray]. While the BOV simply involves counting the number of occurrences of quantized local features, approaches that encode higher order statistics such as the the Fisher Vector (FV) [Perronnin and Dance(2007), Perronnin et al.(2010)Perronnin, Sánchez, and Mensink] led to state-of-the-art image classification results [Chatfield et al.(2011)Chatfield, Lempitsky, Vedaldi, and Zisserman, Sánchez et al.(2013)Sánchez, Perronnin, Mensink, and Verbeek]

. Especially, such higher-order encodings were used by the leading teams in the 2010 and 2011 editions of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 

[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei, Russakovsky et al.(2015)]. FV-based approaches were however outperformed in 2012 by the work of Krizhevsky et al [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] based on Convolutional Networks (ConvNets) [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] trained in a supervised fashion on large amounts of labeled data. These models are feed-forward architectures involving multiple computational layers that alternate linear operations, e.gconvolutions, and non-linear operations, e.grectified linear units (ReLU). The end-to-end training of the large number of parameters inside ConvNets from pixels to the specific end-task is a key to their success. Since then, ConvNets, including improved architectures [Zeiler and Fergus(2014), Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun, Simonyan and Zisserman(2015)], have consistently outperformed all other alternatives in subsequent editions of ILSVRC. Also, ConvNets have remarkable transferability properties when used as “universal” feature extractors [Yosinski et al.(2014)Yosinski, Clune, Bengio, and Lipson]

: if one feeds an image to a ConvNet, the output of intermediate layers might be used as a representation of this image and typically fed to linear classifiers. To the best of our knowledge, this heuristic is not based on a strong theoretical ground, but has been experimentally shown to work well in practice 

[Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell, Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic, Zeiler and Fergus(2014), Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman, Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson].

Figure 1: AlexNet architecture [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton].

are the parameters (4D tensors) of the convolutional layers.

are the parameters (matrices) of the fully connected layers. Black (resp. red) arrows represent the information flow during the forward (resp. backward) pass. Inspired by the Fisher Kernel [Jaakkola and Haussler(1998)], we study the use of gradient-related information (the blue matrices) as transferable representations.

Although ConvNets and FV approaches differ significantly, several works tried to combine their benefits [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman, Sydorov et al.(2014)Sydorov, Sakurada, and Lampert, Gong et al.(2014)Gong, Wang, Guo, and Lazebnik, Perronnin and Larlus(2015)]. Our work also attempts to get the best of both FV and ConvNet worlds. Our primary contribution is a novel approach to extract a transferable representation of an image given a pre-trained ConvNet. We draw inspiration from the FV, which is based on the theoretically well-founded Fisher Kernel (FK) proposed by Jaakkola and Haussler [Jaakkola and Haussler(1998)]. The FK involves deriving a kernel from an underlying generative model of the data by taking the gradient of the log-likelihood with respect to the model parameters. In a similar manner, given an unlabeled image, we propose to compute the

gradient of a cross-entropy criterion measured between the predicted class probabilities and an equal probability output

. This gradient with respect to the parameters of the fully connected layers yields very high-dimensional representations (cfFigure 1). Our second contribution consists in leveraging the special structure of this gradient representation to design an efficient kernel. We show that our representation actually corresponds to a rank-1 matrix, for which the trace kernel can be efficiently computed. Furthermore, this kernel decomposes in our case into the product of two simpler kernels: the standard one on forward-pass features, and a second one on quantities efficiently computed by back-propagation.

The remainder of this article is organized as follows. In section 2, we review related works. In section 3, we provide more background on the FK and ConvNets. In section  4, we introduce our novel hybrid ConvNet-gradient representation as well as our associated efficient kernel. Finally, we provide experimental results on the PASCAL VOC 2007 and 2012 benchmarks in section 5, showing that our representation consistently transfers better than the standard forward pass features.

2 Related Work

Hybrid techniques. Several works have proposed to combine the benefits of deep learning with "shallow" bag-of-patches representations based on higher-order statistics such as the FV [Perronnin and Dance(2007), Perronnin et al.(2010)Perronnin, Sánchez, and Mensink] or the VLAD [Jégou et al.(2010)Jégou, Douze, Schmid, and Pérez]. Simonyan et al [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman] propose to stack multiple FV layers, each defined as a set of five operations: i) FV encoding, ii) supervised dimensionality reduction, iii) spatial stacking, iv) normalization and v) PCA dimensionality reduction. They show that, when combined with the original FV, such networks lead to significant performance improvements on ImageNet. Peng et al [Peng et al.(2014b)Peng, Zou, Qiao, and Peng] proposed a similar idea, but for action recognition. Alternatively, Sydorov et al [Sydorov et al.(2014)Sydorov, Sakurada, and Lampert]

improve on the FV framework by jointly learning the SVM classifier and the GMM visual vocabulary. Conceptually, this is similar to back-propagation as used to learn neural network parameters: the gradients corresponding to the SVM layer are back-propagated to compute the gradients with respect to the GMM parameters. Peng 

et al [Peng et al.(2014a)Peng, Wang, Qiao, and Peng] proposed a similar idea for the VLAD [Jégou et al.(2010)Jégou, Douze, Schmid, and Pérez] descriptor. Finally, Gong et al [Gong et al.(2014)Gong, Wang, Guo, and Lazebnik] address the lack of geometric invariance in ConvNets with a hybrid approach. They extract mid-level ConvNet features from large patches, embed them using the VLAD encoding, and aggregate them at multiple scales. This leads to competitive results on a number of classification tasks. While our goal – getting the best of the FV and deep frameworks – is shared with these previous works, we differ significantly, as we are the first to propose to derive gradient features from deep nets.

Deriving representations from pre-trained classifiers. Classemes [Wang et al.(2009)Wang, Hoiem, and Forsyth, Torresani et al.(2010)Torresani, Szummer, and Fitzgibbon] is a common image representation from a set of classifiers obtained by simply stacking classifier scores. Dimensionality reduction is generally applied on classeme features [Douze et al.(2011)Douze, Ramisa, and Schmid], but learning separately the classification and dimensionality reduction is suboptimal [Gordo et al.(2012)Gordo, Rodríguez-Serrano, Perronnin, and Valveny]. Several works [Bergamo et al.(2011)Bergamo, Torresani, and Fitzgibbon, Weston et al.(2010)Weston, Bengio, and Usunier, Gordo et al.(2012)Gordo, Rodríguez-Serrano, Perronnin, and Valveny] learn an optimal embedding of images in a low-dimensional space via classifiers with an intermediate hidden layer. The first layer can be understood as a supervised dimensionality reduction step, while the second one can be interpreted as a set of classifiers in the intermediate space. A new image is represented as the output of this intermediate layer, discarding the classifiers. A natural extension is to learn deeper architectures, i.earchitectures with more than one hidden layer, and to use the output of these intermediate layers as features for the new tasks. Krizhevsky et al [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] proposed to learn end-to-end a deep classifier based on the ConvNet architecture of LeCun et al [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel]

. They showed qualitatively that the output of the penultimate layer could be used for image retrieval. This finding was quantitatively validated for a number of tasks, including image classification 

[Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell, Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic, Zeiler and Fergus(2014), Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman, Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson], image retrieval [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson, Babenko et al.(2014)Babenko, Slesarev, Chigorin, and Lempitsky], object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik], and action recognition [Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic]. The choice of the layer(s) whose output should be used for representation purposes depends on the problem at hand. As observed by Yosinski et al [Yosinski et al.(2014)Yosinski, Clune, Bengio, and Lipson], this choice should be driven by the distance between the base task (the one used to learn the classifier) and target task. In this paper, we show that this heuristic of using the output of an intermediate-level fully connected layer as image representation can be related to the application of the Fisher Kernel idea to ConvNets.

3 Background on the Fisher Kernel and ConvNets

3.1 Fisher Kernel

The Fisher Kernel (FK) is a generic principle introduced to combine the benefits of generative and discriminative models to pattern recognition. Let

be a sample, and let

be a probability density function that models the generative process of

, where denotes the vector of parameters of . In statistics, the score function is given by the gradient of the log-likelihood of the data on the model:


This gradient describes the contribution of the individual parameters to the generative process. Jaakkola and Haussler [Jaakkola and Haussler(1998)] proposed to measure the similarity between two samples and using the Fisher Kernel (FK) which is defined as:



is the Fisher Information Matrix, usually approximated by the identity matrix 

[Jaakkola and Haussler(1998)]. One of the benefits of the FK framework is that it comes with guarantees. The FK is indeed asymptotically at least as good as the MAP decision rule, when assuming that the classification label is included in the generative model as a latent variable (theorem 1 in [Jaakkola and Haussler(1998)]). Some extensions make the dependence of the kernel on the classification labels explicit. This includes the likelihood kernel [Fine et al.(2001)Fine, Navrátil, and Gopinath], which involves one generative model per class, and which consists in computing one FK for each generative model (and consequently for each class). This also includes the likelihood ratio kernel [Smith and Gales(2001)], which is tailored to the two-class problem, and which involves computing the gradient of the log-likelihood of the ratio between the two class likelihoods. Given two classes denoted and with class-conditional probability density functions and and with collective parameters , this yields:


The likelihood ratio kernel is supported by strong experimental evidence [Smith and Gales(2001)] and theory [Smith and Gales(2002)]. In section 4, we extend it to derive a gradient representation from a ConvNet model.

3.2 Convolutional Networks

Convolutional Networks (ConvNets) [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] are the de facto state-of-the-art models for image recognition since the work of Krizhevsky et al [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]. This class of deep learning models relies on a feed forward architecture typically composed of a stack of convolutional layers followed by a stack of fully connected layers (see Figure 1 for the standard AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] architecture). A convolutional layer is parametrized by a 4D tensor representing a stack of 3D filters. During the forward pass, these filters are run in a sliding window fashion across the output of the previous layer (or the image itself for the first layer) in order to produce a 3D tensor: the stack of per-filter activation maps. These activation maps then pass through a non-linearity (typically a Rectified Linear Unit, or ReLU [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]) and an optional pooling stage before being fed to the next layer. Both the standard AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and recent improved architectures like VGGNet [Simonyan and Zisserman(2015)] use a stack of fully connected layers to transform the output activation map of the convolutional layers into class-membership probabilities. A fully connected layer consists in a simple matrix vector multiplication followed by a non-linearity, typically ReLU for intermediate layers and a SoftMax for the last one.

Let be the output of layer , which is also the input of layer (for AlexNet, is the flattened activation map of the fifth convolutional layer). Layer is parametrized by the 4D tensor if it is a convolutional layer, and by the matrix for a fully connected layer. A fully connected layer performs the operation , where is the non-linearity. We note the output of layer before the non-linearity, and the parameters of all

layers of the network. Training such deep models consists in end-to-end learning of this vast number of parameters via the minimization of an error (or loss) function on a large training set of

image and ground-truth label pairs . The typical loss function used for classification is the cross-entropy:


where is the number of labels (categories), is the label vector of image , and is the predicted probability of class for image resulting from the forward pass.

The optimal network parameters are the ones minimizing the loss over the training set:


This optimization problem is typically solved using Stochastic Gradient Descent (SGD) 

[Bottou(1998)], a stochastic approximation of batch gradient descent consisting in doing approximate gradient steps equal on average to the true gradient . Each approximate gradient step is typically performed with a small batch of labeled examples in order to efficiently leverage the caching and vectorization mechanisms of modern hardware.

A particularity of deep networks is that the gradients with respect to all parameters

can be computed efficiently in a stage-wise fashion via a sequential application of the chain rule (“back-propagation” 

[LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel]). In particular, when using ConvNets as feature extractors, the first phase consists in pre-training the network (i.eobtaining ) via SGD with back-propagation on a large labeled dataset like ImageNet [Russakovsky et al.(2015)]. Then, ConvNet features can be used for different tasks using forward passes on the pre-trained network. In the following, we describe how we can also use back-propagation at test time to transfer richer Fisher Vector-like representations based on the gradient of the loss with respect to the ConvNet parameters.

4 Gradient Features from Deep Nets

We now motivate the use of gradient features from deep nets by relating the likelihood ratio kernel in equation (3) to the ConvNet objective function in equation (4). We then explicit the gradient equations and relate our gradient features to the standard heuristic features derived from the outputs of intermediate layers. Finally, we explain how to efficiently compute the similarity between these high-dimensional representations.

4.1 Relating the likelihood ratio kernel and deep nets

The FK [Jaakkola and Haussler(1998)] and its extensions [Fine et al.(2001)Fine, Navrátil, and Gopinath, Smith and Gales(2001)] were proposed as generic frameworks to derive representations and kernels from generative models. As the standard ConvNet classification architecture does not define a generative model, such frameworks cannot be applied as-is. However, we can draw inspiration from the likelihood ratio kernel for that purpose. We start from equation (3

) and note that it can be rewritten as the gradient of the log-likelihood of the ratio between posterior probabilities (assuming equal class priors),



In the two-class problem of [Smith and Gales(2001)], we have and equation gives:


where is the gradient of the log-posterior for class . We underline that the previous formula is general in the sense that it can be applied beyond generative models. To extend this representation beyond the two-class case, one may compute an embedding for each class using the gradient of the corresponding log-posterior probability.

We can now observe the relation between the ConvNet objective in equation (4) for an image and label vector with these gradient of log-posterior embeddings:


Consequently, the gradient of the ConvNet objective can be interpreted as a sum of gradient embeddings , weighted by the labels .

To use this gradient as an image representation, as is the case of the FK, there are two main challenges to be addressed. First, we do not have access to the value of the label , which we need to compute the representation of a test image according to equation (8). The simplest solution consists in using a constant uniform label vector . Although is non-informative, we experimentally validate the interest of this simple strategy. The second issue concerning the use of as an image representation lies in the associated computational cost. Although scalable in the number of classes, this representation is very high-dimensional. The number of parameters in current deep architectures is indeed too large to be able to use the full gradient in practice. Therefore, we propose to use only the partial derivatives with respect to the parameters of some fixed layers, in the same spirit as what is currently done with layer-activation features. These partial derivatives can be computed and compared efficiently using the chain rule and a rank 1 decomposition, as shown in the following sections. Note also that this approach can be further combined with other existing techniques, including ones specialized for deep nets (e.gmodel compression [Bucilua et al.(2006)Bucilua, Caruana, and Niculescu-Mizil]) or for FV (e.g product quantization [Jégou et al.(2011)Jégou, Douze, and Schmid]).

4.2 Gradient derivation

One remarkable property of ConvNets and other feed-forward architectures is that they are differentiable through all their layers. In the case of ConvNets, it is easy to show that the gradients of the loss with respect to the weights of the fully-connected layers are:


To compute the partial derivatives of the loss with respect to the output parameters needed in Equation (9), one can apply the chain rule. In the case of fully-connected layers and ReLU non-linearities, this leads to the following recursive definition and base case,


where is an indicator vector, set to one at the positions where and to zero otherwise, is the Hadamard or element-wise product, is a supplied vector of labels with which to compute the loss, and is the SoftMax function. From the previous section, we use , i.ewe assume that all classes have equal probabilities. It is worth noticing how is simply a shifted version of the output probabilities, while the derivatives w.r.t with

are linear transformations of these shifted probabilities, as the Hadamard product can be rewritten as a matrix multiplication.

4.3 Computing similarities between gradients

Using the gradients in equation (9) as features is problematic in practice due to their high-dimensional nature with current deep architectures. In the case of AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton], is around million floating point values, while and are each around and million floats. Thus, explicitly computing the dot-product between the gradients is impractical. Instead, we propose to take advantage of the unique structure of our gradients (rank-1 matrices, cf Eq. (9)) by using the trace kernel, defined for two matrices and as:


For rank-1 matrices, the trace can be decomposed as the product of two kernels. If we let , , and , , with , and , , then:

Therefore, for two images and , we can compute the similarity between gradients in a low-dimensional space without explicitly computing the gradients w.r.tthe weights :


The left part of this equation indicates that the forward activations of the two inputs should be similar. This is the standard measure of similarity which is used between images when described by the outputs of the intermediate layers of ConvNets. However, this similarity is multiplicatively weighted by the similarity between the back-propagated quantities. This indicates that, to obtain a high value with the proposed kernel, both the target forward activations and the back-propagated quantities of the images need to be similar.

Normalization. The -normalization of the activation features consistently leads to superior results [Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman]. In our experiments we -normalize our forward and backward features independently. This is consistent with normalizing the gradient matrix using a Frobenius norm, since .

5 Experimental results

5.1 Datasets and evaluation protocols

We evaluate our approach to transfer features from pretrained models on two standard classification benchmarks, Pascal VOC 2007 and Pascal VOC 2012 [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. These datasets contain and annotated images, respectively. Each image is annotated with one or more labels corresponding to object categories. The datasets include partitions for training, validating, and testing, and the accuracy is measured in terms of per class mean average precision (mAP). The test annotations of VOC 2012 are not public, but an evaluation server with a limited number of submissions per week is available. Therefore, we use the validation set for the first part of our analysis on the VOC 2012 dataset, and evaluate on the test set only for the final experiments. We conduct all VOC 2007 experiments on the full dataset.

5.2 Implementation details

We tested our approach on two different deep ConvNets: AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and VGG16 [Simonyan and Zisserman(2015)]

. VGG16 is a much deeper architecture than AlexNet, with many more convolutional layers, leading to superior performance, but also to a much slower training and feature extraction. We used the pre-trained networks that are publicly available

111 Both networks were pre-trained on the ILSVRC2012 subset of ImageNet, which is disjoint from the Pascal VOC datasets, and therefore suitable for our evaluation of feature transfer.

To extract descriptors from the Pascal images, we first resize the images so that the shortest size has pixels ( on the VGG16 case), and then take the central square crop, without distorting the aspect ratio. We found this cropping technique to work well in practice. For simplicity, we do no data augmentation. The feature extraction is performed on a customized version of the caffe library222, modified to expose the back-propagation features. This allows us to extract forward and backward features of the training and testing images. At testing time we use a tempered version of SoftMax, , with

, to produce softer probability distributions for backpropagation. As discussed in section 

4.1, we use non-informative uniform labels for the backward pass to extract the gradient features. All forward and backward features are then -normalized.

To perform classification, we use the SVM implementation of scikit-learn [Pedregosa et al.(2011)]333 The cost parameter of the solver was set to the default value of , which worked well in practice.

VOC2007 VOC2012 Features (A) (V) (A) (V) (Pool5) 71.0 86.7 66.1 81.4 (FC6) 77.1 89.3 72.6 84.4 (FC7) 79.4 89.4 74.9 84.6 (FC8) 79.1 88.3 74.3 84.1 (Prob) 76.2 86.0 71.9 81.3 ; 76.4 89.2 71.6 84.0 80.2 89.3 75.1 84.6 ; 79.1 89.5 74.3 84.6 80.9 90.0 76.3 85.2 ; 79.7 89.2 75.3 84.6 79.7 88.2 75.0 83.4 VOC’07 VOC’12 Proposed - AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] 80.9 76.5 Proposed - VGG16 [Simonyan and Zisserman(2015)] 90.0 85.3 DeCAF [Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell] from [Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman] 73.4 - Razavian et al [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson] 77.2 - Oquab et al [Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic] 77.7 78.7 Zeiler et al [Zeiler and Fergus(2014)] - 79.0 Chatfield et al [Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman] 82.4 83.2 He et al [He et al.(2014)He, Zhang, Ren, and Sun] 80.1 - Wei et al [Wei et al.(2014)Wei, Xia, Huang, Ni, Dong, Zhao, and Yan] 81.5 81.7 Simonyan et al [Simonyan and Zisserman(2015)] 89.7 89.3
Table 1: Left: Results on Pascal VOC 2007 and VOC 2012 with AlexNet (A) and VGG16 (V). Results on VOC 2012 are on the validation set. Right: Comparison with other ConvNet results (mAP in %).

5.3 Results and discussion

Table 1 summarizes the results, and compares our approach with the state of the art and different baselines. We extract and compare several features for each dataset and network architecture: (i) individual forward activation features, from Pool5 up to the probability layer; (ii) concatenation of forward activation features, e.gPool5+FC6, FC6+FC7, FC7+FC8; (iii) our proposed gradient features: , , and . The similarity between -normalized forward activation features is measured with the dot-product, while the similarity between gradient representations is measured using the trace kernel. We highlight the following points.

Forward activations. In all cases, FC7 is the best performing individual layer on both VOC2007 and VOC2012, independently of the network. This is consistent with previous findings. Also consistent is the fact that the probability layer performs badly in this case. More surprisingly, concatenating forward layers does not seem to bring any noticeable accuracy improvements in any setup.

Gradient representations. We compare the gradient representations with the concatenation of forward activations, since they are very related and share part of the features. On the deeper layers ( and ) the gradient representations outperform the individual features as well as the concatenation both for AlexNet and VGG16 on both datasets. For AlexNet, the improvements are quite significant: and absolute improvement for the gradients with respect to on VOC2007 and VOC2012, and and for . The improvements for VGG16 are more modest but still noticeable: and for the gradients with respect to and and for the gradients with respect to . Larger relative improvements on less discriminative networks such as AlexNet seem to suggest that the more complex gradient representation can, to some extent, compensate for the lack of discriminative power of the network, but that one obtains diminishing returns as the power of the network increases. Once one reaches the top of the network (), the gradient representations perform worse and these improvements diminish or disappear completely. This is expected, as the derivative with respect to depends heavily on the output of the probability layer, which is known to saturate. However, for the derivatives with respect to and , more information is involved, leading to superior results.

Comparison with other works. Our best results are compared with the state-of-the-art on PASCAL VOC2007 and VOC2012 in Table 1. We can see that we obtain competitive performance on both datasets. We note however that our results with VGG16 are somewhat inferior to those reported in [Simonyan and Zisserman(2015)] with a similar model. We believe this might be explained by the more costly feature extraction strategy employed by Simonyan and Zisserman which involves aggregating image descriptors at multiple scales.























(FC7) 79.4 95.4 88.6 92.6 87.3 42.1 80.1 90.5 89.6 59.9 68.2 74.1 85.3 89.8 85.6 95.3 58.1 78.9 57.9 94.7 74.4
80.9 96.6 89.2 93.8 89.5 44.9 81.0 91.9 89.9 61.2 70.4 78.5 86.2 91.4 87.4 95.7 60.5 78.8 62.5 95.2 73.5
(FC7) 89.3 99.2 95.9 99.1 96.9 63.8 92.8 95.1 98.1 70.4 87.8 84.3 97.0 97.2 93.5 97.3 68.6 92.2 73.3 98.7 85.5
90.0 99.6 97.2 98.8 97.0 63.3 93.8 95.6 98.4 71.1 89.4 85.3 97.7 97.7 95.6 97.5 70.3 92.7 76.2 98.8 84.2
Table 2: Results on Pascal VOC2007 with AlexNet and VGG16. Comparison between the standard forward activation features and the proposed gradient features.






















AlexNet (evaluated on the validation set)
(FC7) 74.9 92.9 75.4 88.7 81.7 48.0 89.0 70.3 88.0 62.3 63.6 57.8 83.5 78.0 82.9 92.9 49.1 74.8 50.5 90.2 78.7
76.3 94.3 77.4 89.5 82.2 50.8 90.2 72.4 89.3 64.8 63.9 60.3 84.0 79.6 84.0 93.2 50.6 76.7 52.6 91.8 79.2
AlexNet (evaluated on the test set)
(FC7) 75.0 93.8 75.0 86.4 82.2 48.2 82.5 73.8 87.6 63.8 63.5 69.3 85.7 80.3 84.1 92.3 47.4 72.2 51.8 88.1 72.5
76.5 95.0 76.6 87.7 82.9 52.5 83.4 75.6 88.6 65.3 65.4 69.8 86.5 82.1 85.1 93.0 48.2 74.5 57.0 88.4 73.0
VGG16 (evaluated on the validation set)
(FC7) 84.6 98.2 88.3 94.6 90.5 66.0 93.6 80.5 96.4 73.9 81.3 70.2 93.0 91.3 91.3 95.1 56.3 87.7 64.2 95.8 84.5
85.2 98.6 89.4 94.7 91.5 67.2 94.0 80.9 96.8 73.7 83.7 71.9 93.4 91.6 91.5 95.4 56.0 88.3 65.2 95.5 85.2
VGG16 (evaluated on the test set)
(FC7) 85.0 97.8 85.2 92.3 91.1 64.5 89.7 82.2 95.4 74.1 84.7 81.1 94.1 93.5 91.9 95.0 57.9 86.0 67.8 95.2 81.5
85.3 98.0 86.0 91.7 91.3 65.7 89.6 82.4 95.5 74.5 84.2 80.7 94.3 93.7 92.2 95.4 57.7 87.2 69.2 95.2 81.4
Table 3: Results on Pascal VOC2012 with AlexNet and VGG16. Comparison between the standard forward activation features and the proposed gradient features.

Per-class results. We report per-class results for Pascal VOC2007 on Table 2 and for VOC2012 on Table 3. We compare the best forward features (individual FC7) with the best gradient representation (). The results on VOC2007 are on the test set. For VOC2012, we report results both on validation and on test. We observe how, on both networks and datasets, the results are consistently better even when the improvements are not large. For AlexNet, the gradient representation has the best performance on 18 out of the 20 classes on VOC2007, and on all classes for VOC2012. For VGG, the gradient representation is the best one on 17 out of the 20 classes both on VOC2007 and VOC2012 (validation). The differences between validation and test on VOC2012 are minimal.

6 Conclusions

In this paper we show a link between ConvNets as feature extractors and Fisher Vector encodings. We have introduced a gradient-based representation for features extracted with ConvNets inspired by the Fisher Kernel framework. This representation takes advantage of the high-quality features learned by ConvNets on an end-to-end supervised manner, and of the discriminative power of gradient-based representations. We also presented an approach to compute similarities between gradients in an efficient manner without computing explicitly the high-dimensional gradient representations. We show that this similarity can be seen as a weighed version of the forward feature similarities that takes into account not only the features themselves, but also information back-propagated from the ConvNet objective. We tested our approach on the Pascal VOC2007 and VOC2012 benchmarks using two different popular deep architectures, showing consistent improvements over using only the individual forward activation features or their combination as it is standard practice.


  • [Babenko et al.(2014)Babenko, Slesarev, Chigorin, and Lempitsky] A Babenko, A Slesarev, A Chigorin, and V Lempitsky. Neural codes for image retrieval. In ECCV, 2014.
  • [Bergamo et al.(2011)Bergamo, Torresani, and Fitzgibbon] A Bergamo, L Torresani, and A Fitzgibbon. PiCoDeS: Learning a compact code for novel-category recognition. In NIPS, 2011.
  • [Bottou(1998)] L. Bottou. Online algorithms and stochastic approximations. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. URL
  • [Bucilua et al.(2006)Bucilua, Caruana, and Niculescu-Mizil] C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In SIGKDD, 2006.
  • [Chatfield et al.(2011)Chatfield, Lempitsky, Vedaldi, and Zisserman] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. BMVC, 2011.
  • [Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman] K. Chatfield, K. Simonyan, A. vedaldi, and A. Zisserman. Return of the devil in the details: deving deep into convolutional nets. In BMVC, 2014.
  • [Csurka et al.(2004)Csurka, Dance, Fan, Willamowski, and Bray] G. Csurka, C. Dance, L Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV SLCV workshop, 2004.
  • [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] J Deng, W Dong, R Socher, LJ Li, K Li, and L Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell] J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang, E Tzeng, and T Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
  • [Douze et al.(2011)Douze, Ramisa, and Schmid] M Douze, A Ramisa, and C Schmid. Combining attributes and fisher vectors for efficient image retrieval. In CVPR, 2011.
  • [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The Pascal visual object classes (VOC) challenge. IJCV, 2010.
  • [Fine et al.(2001)Fine, Navrátil, and Gopinath] S. Fine, J. Navrátil, and R. Gopinath. A hybrid GMM/SVM approach to speaker identification. In ICASSP, 2001.
  • [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [Gong et al.(2014)Gong, Wang, Guo, and Lazebnik] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In ECCV, 2014.
  • [Gordo et al.(2012)Gordo, Rodríguez-Serrano, Perronnin, and Valveny] A Gordo, JA Rodríguez-Serrano, F Perronnin, and E Valveny. Leveraging category-level labels for instance-level image retrieval. In CVPR, 2012.
  • [He et al.(2014)He, Zhang, Ren, and Sun] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
  • [Jaakkola and Haussler(1998)] T. Jaakkola and D. Haussler. Exploting generative models in discriminative classifiers. In NIPS, 1998.
  • [Jégou et al.(2010)Jégou, Douze, Schmid, and Pérez] H. Jégou, M Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
  • [Jégou et al.(2011)Jégou, Douze, and Schmid] H. Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE TPAMI, 2011.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] A Krizhevsky, I Sutskever, and G Hinton.

    ImageNet classification with deep convolutional neural networks.

    In NIPS, 2012.
  • [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Handwritten digit recognition with a back-propagation network. NIPS, 1989.
  • [Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, 2014.
  • [Pedregosa et al.(2011)] F. Pedregosa et al.

    Scikit-learn: Machine learning in Python.

    JMLR, 2011.
  • [Peng et al.(2014a)Peng, Wang, Qiao, and Peng] X Peng, L Wang, Y Qiao, and Q Peng. Boosting VLAD with supervised dictionary learning and high-order statistics. In ECCV, 2014a.
  • [Peng et al.(2014b)Peng, Zou, Qiao, and Peng] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked Fisher vectors. In ECCV, 2014b.
  • [Perronnin and Dance(2007)] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. CVPR, 2007.
  • [Perronnin and Larlus(2015)] F. Perronnin and D. Larlus. Fisher vectors meet neural networks: A hybrid classification architecture. In CVPR, 2015.
  • [Perronnin et al.(2010)Perronnin, Sánchez, and Mensink] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. ECCV, 2010.
  • [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson] AS Razavian, H Azizpour, J Sullivan, and S Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In CVPR Deep Vision Workshop, 2014.
  • [Russakovsky et al.(2015)] O Russakovsky et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [Sánchez et al.(2013)Sánchez, Perronnin, Mensink, and Verbeek] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 2013.
  • [Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
  • [Simonyan and Zisserman(2015)] K Simonyan and A Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015.
  • [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep fisher networks for large-scale image classification. In NIPS, 2013.
  • [Sivic and Zisserman(2003)] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. 2003.
  • [Smith and Gales(2001)] N. Smith and M. Gales. Speech recognition using SVMs. In NIPS, 2001.
  • [Smith and Gales(2002)] N. Smith and M. Gales. Using SVMs to classify variable length speech patterns. Technical report, Cambridge University, 2002.
  • [Sydorov et al.(2014)Sydorov, Sakurada, and Lampert] V. Sydorov, M. Sakurada, and C. Lampert. Deep Fisher kernels – End to end learning of the Fisher kernel GMM parameters. In CVPR, 2014.
  • [Torresani et al.(2010)Torresani, Szummer, and Fitzgibbon] L Torresani, M Szummer, and A Fitzgibbon. Efficient object category recognition using classemes. In ECCV, 2010.
  • [Wang et al.(2009)Wang, Hoiem, and Forsyth] G Wang, D Hoiem, and D Forsyth. Learning image similarity from Flickr groups using stochastic intersection kernel machines. In ICCV, 2009.
  • [Wei et al.(2014)Wei, Xia, Huang, Ni, Dong, Zhao, and Yan] Y Wei, W Xia, J Huang, B Ni, J Dong, Y Zhao, and S Yan. CNN: single-label to multi-label. arXiv, 2014.
  • [Weston et al.(2010)Weston, Bengio, and Usunier] J Weston, S Bengio, and N Usunier. Large scale image annotation: Learning to rank with joint word-image embeddings. ECML, 2010.
  • [Yosinski et al.(2014)Yosinski, Clune, Bengio, and Lipson] J Yosinski, J Clune, Y Bengio, and H Lipson. How transferable are features in deep neural networks ? In NIPS, 2014.
  • [Zeiler and Fergus(2014)] MD. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV. 2014.