1 Introduction
Image classification involves describing images with predetermined labels. One of the first breakthroughs towards solving this problem was the bagofvisualwords (BOV) [Sivic and Zisserman(2003), Csurka et al.(2004)Csurka, Dance, Fan, Willamowski, and Bray]. While the BOV simply involves counting the number of occurrences of quantized local features, approaches that encode higher order statistics such as the the Fisher Vector (FV) [Perronnin and Dance(2007), Perronnin et al.(2010)Perronnin, Sánchez, and Mensink] led to stateoftheart image classification results [Chatfield et al.(2011)Chatfield, Lempitsky, Vedaldi, and Zisserman, Sánchez et al.(2013)Sánchez, Perronnin, Mensink, and Verbeek]
. Especially, such higherorder encodings were used by the leading teams in the 2010 and 2011 editions of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and FeiFei, Russakovsky et al.(2015)]. FVbased approaches were however outperformed in 2012 by the work of Krizhevsky et al [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] based on Convolutional Networks (ConvNets) [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] trained in a supervised fashion on large amounts of labeled data. These models are feedforward architectures involving multiple computational layers that alternate linear operations, e.gconvolutions, and nonlinear operations, e.grectified linear units (ReLU). The endtoend training of the large number of parameters inside ConvNets from pixels to the specific endtask is a key to their success. Since then, ConvNets, including improved architectures [Zeiler and Fergus(2014), Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun, Simonyan and Zisserman(2015)], have consistently outperformed all other alternatives in subsequent editions of ILSVRC. Also, ConvNets have remarkable transferability properties when used as “universal” feature extractors [Yosinski et al.(2014)Yosinski, Clune, Bengio, and Lipson]: if one feeds an image to a ConvNet, the output of intermediate layers might be used as a representation of this image and typically fed to linear classifiers. To the best of our knowledge, this heuristic is not based on a strong theoretical ground, but has been experimentally shown to work well in practice
[Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell, Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic, Zeiler and Fergus(2014), Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman, Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson].Although ConvNets and FV approaches differ significantly, several works tried to combine their benefits [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman, Sydorov et al.(2014)Sydorov, Sakurada, and Lampert, Gong et al.(2014)Gong, Wang, Guo, and Lazebnik, Perronnin and Larlus(2015)]. Our work also attempts to get the best of both FV and ConvNet worlds. Our primary contribution is a novel approach to extract a transferable representation of an image given a pretrained ConvNet. We draw inspiration from the FV, which is based on the theoretically wellfounded Fisher Kernel (FK) proposed by Jaakkola and Haussler [Jaakkola and Haussler(1998)]. The FK involves deriving a kernel from an underlying generative model of the data by taking the gradient of the loglikelihood with respect to the model parameters. In a similar manner, given an unlabeled image, we propose to compute the
gradient of a crossentropy criterion measured between the predicted class probabilities and an equal probability output
. This gradient with respect to the parameters of the fully connected layers yields very highdimensional representations (cfFigure 1). Our second contribution consists in leveraging the special structure of this gradient representation to design an efficient kernel. We show that our representation actually corresponds to a rank1 matrix, for which the trace kernel can be efficiently computed. Furthermore, this kernel decomposes in our case into the product of two simpler kernels: the standard one on forwardpass features, and a second one on quantities efficiently computed by backpropagation.The remainder of this article is organized as follows. In section 2, we review related works. In section 3, we provide more background on the FK and ConvNets. In section 4, we introduce our novel hybrid ConvNetgradient representation as well as our associated efficient kernel. Finally, we provide experimental results on the PASCAL VOC 2007 and 2012 benchmarks in section 5, showing that our representation consistently transfers better than the standard forward pass features.
2 Related Work
Hybrid techniques. Several works have proposed to combine the benefits of deep learning with "shallow" bagofpatches representations based on higherorder statistics such as the FV [Perronnin and Dance(2007), Perronnin et al.(2010)Perronnin, Sánchez, and Mensink] or the VLAD [Jégou et al.(2010)Jégou, Douze, Schmid, and Pérez]. Simonyan et al [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman] propose to stack multiple FV layers, each defined as a set of five operations: i) FV encoding, ii) supervised dimensionality reduction, iii) spatial stacking, iv) normalization and v) PCA dimensionality reduction. They show that, when combined with the original FV, such networks lead to significant performance improvements on ImageNet. Peng et al [Peng et al.(2014b)Peng, Zou, Qiao, and Peng] proposed a similar idea, but for action recognition. Alternatively, Sydorov et al [Sydorov et al.(2014)Sydorov, Sakurada, and Lampert]
improve on the FV framework by jointly learning the SVM classifier and the GMM visual vocabulary. Conceptually, this is similar to backpropagation as used to learn neural network parameters: the gradients corresponding to the SVM layer are backpropagated to compute the gradients with respect to the GMM parameters. Peng
et al [Peng et al.(2014a)Peng, Wang, Qiao, and Peng] proposed a similar idea for the VLAD [Jégou et al.(2010)Jégou, Douze, Schmid, and Pérez] descriptor. Finally, Gong et al [Gong et al.(2014)Gong, Wang, Guo, and Lazebnik] address the lack of geometric invariance in ConvNets with a hybrid approach. They extract midlevel ConvNet features from large patches, embed them using the VLAD encoding, and aggregate them at multiple scales. This leads to competitive results on a number of classification tasks. While our goal – getting the best of the FV and deep frameworks – is shared with these previous works, we differ significantly, as we are the first to propose to derive gradient features from deep nets.Deriving representations from pretrained classifiers. Classemes [Wang et al.(2009)Wang, Hoiem, and Forsyth, Torresani et al.(2010)Torresani, Szummer, and Fitzgibbon] is a common image representation from a set of classifiers obtained by simply stacking classifier scores. Dimensionality reduction is generally applied on classeme features [Douze et al.(2011)Douze, Ramisa, and Schmid], but learning separately the classification and dimensionality reduction is suboptimal [Gordo et al.(2012)Gordo, RodríguezSerrano, Perronnin, and Valveny]. Several works [Bergamo et al.(2011)Bergamo, Torresani, and Fitzgibbon, Weston et al.(2010)Weston, Bengio, and Usunier, Gordo et al.(2012)Gordo, RodríguezSerrano, Perronnin, and Valveny] learn an optimal embedding of images in a lowdimensional space via classifiers with an intermediate hidden layer. The first layer can be understood as a supervised dimensionality reduction step, while the second one can be interpreted as a set of classifiers in the intermediate space. A new image is represented as the output of this intermediate layer, discarding the classifiers. A natural extension is to learn deeper architectures, i.earchitectures with more than one hidden layer, and to use the output of these intermediate layers as features for the new tasks. Krizhevsky et al [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] proposed to learn endtoend a deep classifier based on the ConvNet architecture of LeCun et al [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel]
. They showed qualitatively that the output of the penultimate layer could be used for image retrieval. This finding was quantitatively validated for a number of tasks, including image classification
[Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell, Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic, Zeiler and Fergus(2014), Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman, Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson], image retrieval [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson, Babenko et al.(2014)Babenko, Slesarev, Chigorin, and Lempitsky], object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik], and action recognition [Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic]. The choice of the layer(s) whose output should be used for representation purposes depends on the problem at hand. As observed by Yosinski et al [Yosinski et al.(2014)Yosinski, Clune, Bengio, and Lipson], this choice should be driven by the distance between the base task (the one used to learn the classifier) and target task. In this paper, we show that this heuristic of using the output of an intermediatelevel fully connected layer as image representation can be related to the application of the Fisher Kernel idea to ConvNets.3 Background on the Fisher Kernel and ConvNets
3.1 Fisher Kernel
The Fisher Kernel (FK) is a generic principle introduced to combine the benefits of generative and discriminative models to pattern recognition. Let
be a sample, and letbe a probability density function that models the generative process of
, where denotes the vector of parameters of . In statistics, the score function is given by the gradient of the loglikelihood of the data on the model:(1) 
This gradient describes the contribution of the individual parameters to the generative process. Jaakkola and Haussler [Jaakkola and Haussler(1998)] proposed to measure the similarity between two samples and using the Fisher Kernel (FK) which is defined as:
(2) 
where
is the Fisher Information Matrix, usually approximated by the identity matrix
[Jaakkola and Haussler(1998)]. One of the benefits of the FK framework is that it comes with guarantees. The FK is indeed asymptotically at least as good as the MAP decision rule, when assuming that the classification label is included in the generative model as a latent variable (theorem 1 in [Jaakkola and Haussler(1998)]). Some extensions make the dependence of the kernel on the classification labels explicit. This includes the likelihood kernel [Fine et al.(2001)Fine, Navrátil, and Gopinath], which involves one generative model per class, and which consists in computing one FK for each generative model (and consequently for each class). This also includes the likelihood ratio kernel [Smith and Gales(2001)], which is tailored to the twoclass problem, and which involves computing the gradient of the loglikelihood of the ratio between the two class likelihoods. Given two classes denoted and with classconditional probability density functions and and with collective parameters , this yields:(3) 
The likelihood ratio kernel is supported by strong experimental evidence [Smith and Gales(2001)] and theory [Smith and Gales(2002)]. In section 4, we extend it to derive a gradient representation from a ConvNet model.
3.2 Convolutional Networks
Convolutional Networks (ConvNets) [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] are the de facto stateoftheart models for image recognition since the work of Krizhevsky et al [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]. This class of deep learning models relies on a feed forward architecture typically composed of a stack of convolutional layers followed by a stack of fully connected layers (see Figure 1 for the standard AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] architecture). A convolutional layer is parametrized by a 4D tensor representing a stack of 3D filters. During the forward pass, these filters are run in a sliding window fashion across the output of the previous layer (or the image itself for the first layer) in order to produce a 3D tensor: the stack of perfilter activation maps. These activation maps then pass through a nonlinearity (typically a Rectified Linear Unit, or ReLU [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]) and an optional pooling stage before being fed to the next layer. Both the standard AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and recent improved architectures like VGGNet [Simonyan and Zisserman(2015)] use a stack of fully connected layers to transform the output activation map of the convolutional layers into classmembership probabilities. A fully connected layer consists in a simple matrix vector multiplication followed by a nonlinearity, typically ReLU for intermediate layers and a SoftMax for the last one.
Let be the output of layer , which is also the input of layer (for AlexNet, is the flattened activation map of the fifth convolutional layer). Layer is parametrized by the 4D tensor if it is a convolutional layer, and by the matrix for a fully connected layer. A fully connected layer performs the operation , where is the nonlinearity. We note the output of layer before the nonlinearity, and the parameters of all
layers of the network. Training such deep models consists in endtoend learning of this vast number of parameters via the minimization of an error (or loss) function on a large training set of
image and groundtruth label pairs . The typical loss function used for classification is the crossentropy:(4) 
where is the number of labels (categories), is the label vector of image , and is the predicted probability of class for image resulting from the forward pass.
The optimal network parameters are the ones minimizing the loss over the training set:
(5) 
This optimization problem is typically solved using Stochastic Gradient Descent (SGD)
[Bottou(1998)], a stochastic approximation of batch gradient descent consisting in doing approximate gradient steps equal on average to the true gradient . Each approximate gradient step is typically performed with a small batch of labeled examples in order to efficiently leverage the caching and vectorization mechanisms of modern hardware.A particularity of deep networks is that the gradients with respect to all parameters
can be computed efficiently in a stagewise fashion via a sequential application of the chain rule (“backpropagation”
[LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel]). In particular, when using ConvNets as feature extractors, the first phase consists in pretraining the network (i.eobtaining ) via SGD with backpropagation on a large labeled dataset like ImageNet [Russakovsky et al.(2015)]. Then, ConvNet features can be used for different tasks using forward passes on the pretrained network. In the following, we describe how we can also use backpropagation at test time to transfer richer Fisher Vectorlike representations based on the gradient of the loss with respect to the ConvNet parameters.4 Gradient Features from Deep Nets
We now motivate the use of gradient features from deep nets by relating the likelihood ratio kernel in equation (3) to the ConvNet objective function in equation (4). We then explicit the gradient equations and relate our gradient features to the standard heuristic features derived from the outputs of intermediate layers. Finally, we explain how to efficiently compute the similarity between these highdimensional representations.
4.1 Relating the likelihood ratio kernel and deep nets
The FK [Jaakkola and Haussler(1998)] and its extensions [Fine et al.(2001)Fine, Navrátil, and Gopinath, Smith and Gales(2001)] were proposed as generic frameworks to derive representations and kernels from generative models. As the standard ConvNet classification architecture does not define a generative model, such frameworks cannot be applied asis. However, we can draw inspiration from the likelihood ratio kernel for that purpose. We start from equation (3
) and note that it can be rewritten as the gradient of the loglikelihood of the ratio between posterior probabilities (assuming equal class priors),
i.e:(6) 
In the twoclass problem of [Smith and Gales(2001)], we have and equation gives:
(7) 
where is the gradient of the logposterior for class . We underline that the previous formula is general in the sense that it can be applied beyond generative models. To extend this representation beyond the twoclass case, one may compute an embedding for each class using the gradient of the corresponding logposterior probability.
We can now observe the relation between the ConvNet objective in equation (4) for an image and label vector with these gradient of logposterior embeddings:
(8) 
Consequently, the gradient of the ConvNet objective can be interpreted as a sum of gradient embeddings , weighted by the labels .
To use this gradient as an image representation, as is the case of the FK, there are two main challenges to be addressed. First, we do not have access to the value of the label , which we need to compute the representation of a test image according to equation (8). The simplest solution consists in using a constant uniform label vector . Although is noninformative, we experimentally validate the interest of this simple strategy. The second issue concerning the use of as an image representation lies in the associated computational cost. Although scalable in the number of classes, this representation is very highdimensional. The number of parameters in current deep architectures is indeed too large to be able to use the full gradient in practice. Therefore, we propose to use only the partial derivatives with respect to the parameters of some fixed layers, in the same spirit as what is currently done with layeractivation features. These partial derivatives can be computed and compared efficiently using the chain rule and a rank 1 decomposition, as shown in the following sections. Note also that this approach can be further combined with other existing techniques, including ones specialized for deep nets (e.gmodel compression [Bucilua et al.(2006)Bucilua, Caruana, and NiculescuMizil]) or for FV (e.g product quantization [Jégou et al.(2011)Jégou, Douze, and Schmid]).
4.2 Gradient derivation
One remarkable property of ConvNets and other feedforward architectures is that they are differentiable through all their layers. In the case of ConvNets, it is easy to show that the gradients of the loss with respect to the weights of the fullyconnected layers are:
(9) 
To compute the partial derivatives of the loss with respect to the output parameters needed in Equation (9), one can apply the chain rule. In the case of fullyconnected layers and ReLU nonlinearities, this leads to the following recursive definition and base case,
(10) 
where is an indicator vector, set to one at the positions where and to zero otherwise, is the Hadamard or elementwise product, is a supplied vector of labels with which to compute the loss, and is the SoftMax function. From the previous section, we use , i.ewe assume that all classes have equal probabilities. It is worth noticing how is simply a shifted version of the output probabilities, while the derivatives w.r.t with
are linear transformations of these shifted probabilities, as the Hadamard product can be rewritten as a matrix multiplication.
4.3 Computing similarities between gradients
Using the gradients in equation (9) as features is problematic in practice due to their highdimensional nature with current deep architectures. In the case of AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton], is around million floating point values, while and are each around and million floats. Thus, explicitly computing the dotproduct between the gradients is impractical. Instead, we propose to take advantage of the unique structure of our gradients (rank1 matrices, cf Eq. (9)) by using the trace kernel, defined for two matrices and as:
(11) 
For rank1 matrices, the trace can be decomposed as the product of two kernels. If we let , , and , , with , and , , then:
Therefore, for two images and , we can compute the similarity between gradients in a lowdimensional space without explicitly computing the gradients w.r.tthe weights :
(12) 
The left part of this equation indicates that the forward activations of the two inputs should be similar. This is the standard measure of similarity which is used between images when described by the outputs of the intermediate layers of ConvNets. However, this similarity is multiplicatively weighted by the similarity between the backpropagated quantities. This indicates that, to obtain a high value with the proposed kernel, both the target forward activations and the backpropagated quantities of the images need to be similar.
Normalization. The normalization of the activation features consistently leads to superior results [Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman]. In our experiments we normalize our forward and backward features independently. This is consistent with normalizing the gradient matrix using a Frobenius norm, since .
5 Experimental results
5.1 Datasets and evaluation protocols
We evaluate our approach to transfer features from pretrained models on two standard classification benchmarks, Pascal VOC 2007 and Pascal VOC 2012 [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. These datasets contain and annotated images, respectively. Each image is annotated with one or more labels corresponding to object categories. The datasets include partitions for training, validating, and testing, and the accuracy is measured in terms of per class mean average precision (mAP). The test annotations of VOC 2012 are not public, but an evaluation server with a limited number of submissions per week is available. Therefore, we use the validation set for the first part of our analysis on the VOC 2012 dataset, and evaluate on the test set only for the final experiments. We conduct all VOC 2007 experiments on the full dataset.
5.2 Implementation details
We tested our approach on two different deep ConvNets: AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and VGG16 [Simonyan and Zisserman(2015)]
. VGG16 is a much deeper architecture than AlexNet, with many more convolutional layers, leading to superior performance, but also to a much slower training and feature extraction. We used the pretrained networks that are publicly available
^{1}^{1}1https://github.com/BVLC/caffe/wiki/ModelZoo. Both networks were pretrained on the ILSVRC2012 subset of ImageNet, which is disjoint from the Pascal VOC datasets, and therefore suitable for our evaluation of feature transfer.To extract descriptors from the Pascal images, we first resize the images so that the shortest size has pixels ( on the VGG16 case), and then take the central square crop, without distorting the aspect ratio. We found this cropping technique to work well in practice. For simplicity, we do no data augmentation. The feature extraction is performed on a customized version of the caffe library^{2}^{2}2http://caffe.berkeleyvision.org, modified to expose the backpropagation features. This allows us to extract forward and backward features of the training and testing images. At testing time we use a tempered version of SoftMax, , with
, to produce softer probability distributions for backpropagation. As discussed in section
4.1, we use noninformative uniform labels for the backward pass to extract the gradient features. All forward and backward features are then normalized.To perform classification, we use the SVM implementation of scikitlearn [Pedregosa et al.(2011)]^{3}^{3}3http://scikitlearn.org/. The cost parameter of the solver was set to the default value of , which worked well in practice.
5.3 Results and discussion
Table 1 summarizes the results, and compares our approach with the state of the art and different baselines. We extract and compare several features for each dataset and network architecture: (i) individual forward activation features, from Pool5 up to the probability layer; (ii) concatenation of forward activation features, e.gPool5+FC6, FC6+FC7, FC7+FC8; (iii) our proposed gradient features: , , and . The similarity between normalized forward activation features is measured with the dotproduct, while the similarity between gradient representations is measured using the trace kernel. We highlight the following points.
Forward activations. In all cases, FC7 is the best performing individual layer on both VOC2007 and VOC2012, independently of the network. This is consistent with previous findings. Also consistent is the fact that the probability layer performs badly in this case. More surprisingly, concatenating forward layers does not seem to bring any noticeable accuracy improvements in any setup.
Gradient representations. We compare the gradient representations with the concatenation of forward activations, since they are very related and share part of the features. On the deeper layers ( and ) the gradient representations outperform the individual features as well as the concatenation both for AlexNet and VGG16 on both datasets. For AlexNet, the improvements are quite significant: and absolute improvement for the gradients with respect to on VOC2007 and VOC2012, and and for . The improvements for VGG16 are more modest but still noticeable: and for the gradients with respect to and and for the gradients with respect to . Larger relative improvements on less discriminative networks such as AlexNet seem to suggest that the more complex gradient representation can, to some extent, compensate for the lack of discriminative power of the network, but that one obtains diminishing returns as the power of the network increases. Once one reaches the top of the network (), the gradient representations perform worse and these improvements diminish or disappear completely. This is expected, as the derivative with respect to depends heavily on the output of the probability layer, which is known to saturate. However, for the derivatives with respect to and , more information is involved, leading to superior results.
Comparison with other works. Our best results are compared with the stateoftheart on PASCAL VOC2007 and VOC2012 in Table 1. We can see that we obtain competitive performance on both datasets. We note however that our results with VGG16 are somewhat inferior to those reported in [Simonyan and Zisserman(2015)] with a similar model. We believe this might be explained by the more costly feature extraction strategy employed by Simonyan and Zisserman which involves aggregating image descriptors at multiple scales.
Features 
mean 
aeroplane 
bicycle 
bird 
boat 
bottle 
bus 
car 
cat 
chair 
cow 
diningtable 
dog 
horse 
motorbike 
person 
pottedplant 
sheep 
sofa 
train 
tvmonitor 

AlexNet  
(FC7)  79.4  95.4  88.6  92.6  87.3  42.1  80.1  90.5  89.6  59.9  68.2  74.1  85.3  89.8  85.6  95.3  58.1  78.9  57.9  94.7  74.4 
80.9  96.6  89.2  93.8  89.5  44.9  81.0  91.9  89.9  61.2  70.4  78.5  86.2  91.4  87.4  95.7  60.5  78.8  62.5  95.2  73.5  
VGG16  
(FC7)  89.3  99.2  95.9  99.1  96.9  63.8  92.8  95.1  98.1  70.4  87.8  84.3  97.0  97.2  93.5  97.3  68.6  92.2  73.3  98.7  85.5 
90.0  99.6  97.2  98.8  97.0  63.3  93.8  95.6  98.4  71.1  89.4  85.3  97.7  97.7  95.6  97.5  70.3  92.7  76.2  98.8  84.2 
Features 
mean 
aeroplane 
bicycle 
bird 
boat 
bottle 
bus 
car 
cat 
chair 
cow 
diningtable 
dog 
horse 
motorbike 
person 
pottedplant 
sheep 
sofa 
train 
tvmonitor 

AlexNet (evaluated on the validation set)  
(FC7)  74.9  92.9  75.4  88.7  81.7  48.0  89.0  70.3  88.0  62.3  63.6  57.8  83.5  78.0  82.9  92.9  49.1  74.8  50.5  90.2  78.7 
76.3  94.3  77.4  89.5  82.2  50.8  90.2  72.4  89.3  64.8  63.9  60.3  84.0  79.6  84.0  93.2  50.6  76.7  52.6  91.8  79.2  
AlexNet (evaluated on the test set)  
(FC7)  75.0  93.8  75.0  86.4  82.2  48.2  82.5  73.8  87.6  63.8  63.5  69.3  85.7  80.3  84.1  92.3  47.4  72.2  51.8  88.1  72.5 
76.5  95.0  76.6  87.7  82.9  52.5  83.4  75.6  88.6  65.3  65.4  69.8  86.5  82.1  85.1  93.0  48.2  74.5  57.0  88.4  73.0  
VGG16 (evaluated on the validation set)  
(FC7)  84.6  98.2  88.3  94.6  90.5  66.0  93.6  80.5  96.4  73.9  81.3  70.2  93.0  91.3  91.3  95.1  56.3  87.7  64.2  95.8  84.5 
85.2  98.6  89.4  94.7  91.5  67.2  94.0  80.9  96.8  73.7  83.7  71.9  93.4  91.6  91.5  95.4  56.0  88.3  65.2  95.5  85.2  
VGG16 (evaluated on the test set)  
(FC7)  85.0  97.8  85.2  92.3  91.1  64.5  89.7  82.2  95.4  74.1  84.7  81.1  94.1  93.5  91.9  95.0  57.9  86.0  67.8  95.2  81.5 
85.3  98.0  86.0  91.7  91.3  65.7  89.6  82.4  95.5  74.5  84.2  80.7  94.3  93.7  92.2  95.4  57.7  87.2  69.2  95.2  81.4 
Perclass results. We report perclass results for Pascal VOC2007 on Table 2 and for VOC2012 on Table 3. We compare the best forward features (individual FC7) with the best gradient representation (). The results on VOC2007 are on the test set. For VOC2012, we report results both on validation and on test. We observe how, on both networks and datasets, the results are consistently better even when the improvements are not large. For AlexNet, the gradient representation has the best performance on 18 out of the 20 classes on VOC2007, and on all classes for VOC2012. For VGG, the gradient representation is the best one on 17 out of the 20 classes both on VOC2007 and VOC2012 (validation). The differences between validation and test on VOC2012 are minimal.
6 Conclusions
In this paper we show a link between ConvNets as feature extractors and Fisher Vector encodings. We have introduced a gradientbased representation for features extracted with ConvNets inspired by the Fisher Kernel framework. This representation takes advantage of the highquality features learned by ConvNets on an endtoend supervised manner, and of the discriminative power of gradientbased representations. We also presented an approach to compute similarities between gradients in an efficient manner without computing explicitly the highdimensional gradient representations. We show that this similarity can be seen as a weighed version of the forward feature similarities that takes into account not only the features themselves, but also information backpropagated from the ConvNet objective. We tested our approach on the Pascal VOC2007 and VOC2012 benchmarks using two different popular deep architectures, showing consistent improvements over using only the individual forward activation features or their combination as it is standard practice.
References
 [Babenko et al.(2014)Babenko, Slesarev, Chigorin, and Lempitsky] A Babenko, A Slesarev, A Chigorin, and V Lempitsky. Neural codes for image retrieval. In ECCV, 2014.
 [Bergamo et al.(2011)Bergamo, Torresani, and Fitzgibbon] A Bergamo, L Torresani, and A Fitzgibbon. PiCoDeS: Learning a compact code for novelcategory recognition. In NIPS, 2011.
 [Bottou(1998)] L. Bottou. Online algorithms and stochastic approximations. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. URL http://leon.bottou.org/papers/bottou98x.
 [Bucilua et al.(2006)Bucilua, Caruana, and NiculescuMizil] C. Bucilua, R. Caruana, and A. NiculescuMizil. Model compression. In SIGKDD, 2006.
 [Chatfield et al.(2011)Chatfield, Lempitsky, Vedaldi, and Zisserman] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. BMVC, 2011.
 [Chatfield et al.(2014)Chatfield, Simonyan, vedaldi, and Zisserman] K. Chatfield, K. Simonyan, A. vedaldi, and A. Zisserman. Return of the devil in the details: deving deep into convolutional nets. In BMVC, 2014.
 [Csurka et al.(2004)Csurka, Dance, Fan, Willamowski, and Bray] G. Csurka, C. Dance, L Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV SLCV workshop, 2004.
 [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and FeiFei] J Deng, W Dong, R Socher, LJ Li, K Li, and L FeiFei. ImageNet: A largescale hierarchical image database. In CVPR, 2009.
 [Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell] J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang, E Tzeng, and T Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
 [Douze et al.(2011)Douze, Ramisa, and Schmid] M Douze, A Ramisa, and C Schmid. Combining attributes and fisher vectors for efficient image retrieval. In CVPR, 2011.
 [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The Pascal visual object classes (VOC) challenge. IJCV, 2010.
 [Fine et al.(2001)Fine, Navrátil, and Gopinath] S. Fine, J. Navrátil, and R. Gopinath. A hybrid GMM/SVM approach to speaker identification. In ICASSP, 2001.
 [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
 [Gong et al.(2014)Gong, Wang, Guo, and Lazebnik] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multiscale orderless pooling of deep convolutional activation features. In ECCV, 2014.
 [Gordo et al.(2012)Gordo, RodríguezSerrano, Perronnin, and Valveny] A Gordo, JA RodríguezSerrano, F Perronnin, and E Valveny. Leveraging categorylevel labels for instancelevel image retrieval. In CVPR, 2012.
 [He et al.(2014)He, Zhang, Ren, and Sun] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
 [Jaakkola and Haussler(1998)] T. Jaakkola and D. Haussler. Exploting generative models in discriminative classifiers. In NIPS, 1998.
 [Jégou et al.(2010)Jégou, Douze, Schmid, and Pérez] H. Jégou, M Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
 [Jégou et al.(2011)Jégou, Douze, and Schmid] H. Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE TPAMI, 2011.

[Krizhevsky et al.(2012)Krizhevsky, Sutskever, and
Hinton]
A Krizhevsky, I Sutskever, and G Hinton.
ImageNet classification with deep convolutional neural networks.
In NIPS, 2012.  [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Handwritten digit recognition with a backpropagation network. NIPS, 1989.
 [Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring midlevel image representations using convolutional neural networks. In CVPR, 2014.

[Pedregosa et al.(2011)]
F. Pedregosa et al.
Scikitlearn: Machine learning in Python.
JMLR, 2011.  [Peng et al.(2014a)Peng, Wang, Qiao, and Peng] X Peng, L Wang, Y Qiao, and Q Peng. Boosting VLAD with supervised dictionary learning and highorder statistics. In ECCV, 2014a.
 [Peng et al.(2014b)Peng, Zou, Qiao, and Peng] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked Fisher vectors. In ECCV, 2014b.
 [Perronnin and Dance(2007)] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. CVPR, 2007.
 [Perronnin and Larlus(2015)] F. Perronnin and D. Larlus. Fisher vectors meet neural networks: A hybrid classification architecture. In CVPR, 2015.
 [Perronnin et al.(2010)Perronnin, Sánchez, and Mensink] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for largescale image classification. ECCV, 2010.
 [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson] AS Razavian, H Azizpour, J Sullivan, and S Carlsson. CNN features offtheshelf: An astounding baseline for recognition. In CVPR Deep Vision Workshop, 2014.
 [Russakovsky et al.(2015)] O Russakovsky et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
 [Sánchez et al.(2013)Sánchez, Perronnin, Mensink, and Verbeek] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 2013.
 [Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
 [Simonyan and Zisserman(2015)] K Simonyan and A Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. In ICLR, 2015.
 [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep fisher networks for largescale image classification. In NIPS, 2013.
 [Sivic and Zisserman(2003)] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. 2003.
 [Smith and Gales(2001)] N. Smith and M. Gales. Speech recognition using SVMs. In NIPS, 2001.
 [Smith and Gales(2002)] N. Smith and M. Gales. Using SVMs to classify variable length speech patterns. Technical report, Cambridge University, 2002.
 [Sydorov et al.(2014)Sydorov, Sakurada, and Lampert] V. Sydorov, M. Sakurada, and C. Lampert. Deep Fisher kernels – End to end learning of the Fisher kernel GMM parameters. In CVPR, 2014.
 [Torresani et al.(2010)Torresani, Szummer, and Fitzgibbon] L Torresani, M Szummer, and A Fitzgibbon. Efficient object category recognition using classemes. In ECCV, 2010.
 [Wang et al.(2009)Wang, Hoiem, and Forsyth] G Wang, D Hoiem, and D Forsyth. Learning image similarity from Flickr groups using stochastic intersection kernel machines. In ICCV, 2009.
 [Wei et al.(2014)Wei, Xia, Huang, Ni, Dong, Zhao, and Yan] Y Wei, W Xia, J Huang, B Ni, J Dong, Y Zhao, and S Yan. CNN: singlelabel to multilabel. arXiv, 2014.
 [Weston et al.(2010)Weston, Bengio, and Usunier] J Weston, S Bengio, and N Usunier. Large scale image annotation: Learning to rank with joint wordimage embeddings. ECML, 2010.
 [Yosinski et al.(2014)Yosinski, Clune, Bengio, and Lipson] J Yosinski, J Clune, Y Bengio, and H Lipson. How transferable are features in deep neural networks ? In NIPS, 2014.
 [Zeiler and Fergus(2014)] MD. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV. 2014.
Comments
There are no comments yet.