1 Introduction
Convolutional Neural Networks (CNNs) Ref:LeCun1989 ; Ref:Krizhevsky2012
have led to leapforward in a large number of computer vision applications. On the task of large scale image classification, particularly ImageNet
Ref:Deng2009 , CNNsfamily models have been dominating. CNNs are able to automatically learn rich hierachical features from input images.However, for images dataset like PASCAL VOC Ref:Everingham2010 where objects have a large variation in shape, size, and clutter, directly adopting CNN does not produce satisfactory results: stateoftheart results for PASCAL VOC object classification are obtained with Bag of Visual Words (BoVW) on top of the CNN features that are learned separately. As shown in Fig. 1 (a), ImageNet mainly consists of iconicobject images
, i.e. single large objects in the canonical perspective are located in the center of these images. Compared with ImageNet, structures of PASCAL VOC images tend to be much more complex. Objects have large variations in location, scale, layout, and appearance; The backgrounds are cluttered; There tends to be multiple objects in an image. A standard pipeline includes (1) local feature extraction using offtheshelf CNNs that are pretrained on ImageNet; (2) sparse coding
Ref:Yang2009 or Fisher Vector Ref:Sanchez2013 adopted to aggregate local features into a global, fixeddimensional image representation; (3) classification on the encoded feature space. These specific approaches often produce results that are much better than those by plain CNN Ref:Liu2014 ; Ref:Cimpoi2015 .Due to the complexity of BoVW based methods, most previous works extract representations with a standalone module which cannot be trained together with other modules. Consequently, former modules in their algorithm pipelines such as the CNN feature extractor does not receive error differentials from latter ones, and thus cannot be finetuned. This has negative impacts on the overall performance. In one particular aspect, CNN features are learned from ImageNet, whose object distribution is quite different from that in PASCAL VOC. Without finetuning, the features are likely to be less effective.
In this paper, we propose FisherNet, an endtoend trainable neural network which takes advantage of both CNN features and the powerful Fisher Vector (FV) Ref:Sanchez2013 encoding method. FisherNet densely extracts local features at multiple scales, and aggregates them with FV. FV encodes and aggregates local descriptors with a universal generative Gaussian Mixture Model (GMM). We model this process into a learnable module, which we call Fisher Layer. The Fisher Layer allows backpropagating error differentials as well as optimizing the FV codebook, eventually making the whole network trainable endtoend.
Moreover, FisherNet learns and extracts local patch features with great efficiency. Inspired by the recent success of fast object detection algorithms such as SPPnet Ref:He2015 and Fast RCNN Ref:Girshick2015 , we share the computation of feature extraction among patches.
Experiments show that our FisherNet significantly boosts the performance of an untrained FV, and achieves highly competitive performance on the PASCAL VOC object classification task. In testing, a FisherNet with the AlexNet Ref:Krizhevsky2012 takes only
s to classify one image, and
s with the VGG16 Ref:Simonyan2015 , both over faster than the previous stateoftheart method HCP Ref:Wei2015 .2 Related Work
Bag of Visual Words (BoVW) based image representation is one of the most popular methods in computer vision community, especially for image classification Ref:Yang2009 ; Ref:Wang2013 ; Ref:Doersch2013 ; Ref:Sanchez2013 . BoVW has been widely applied for their robustness, especially to object deformation, translation, and occlusion. Fisher Vector (FV) Ref:Sanchez2013 is one of the most powerful BoVW based representation methods. It has achieved many stateoftheart performance on image classification. The traditional FV uses handcrafted descriptors like SIFT Ref:Lowe2004 as patch features, and learns FV parameters by Gaussian Mixture Models (GMM), which is not trainable for both patch features and FV parameters.
Recently, inspired by the great success of deep CNN on image classification Ref:Krizhevsky2012 ; Ref:Simonyan2015 , many attempts have been made to combine CNN and FV. Liu et al. Ref:Liu2014 extract activations from the first fully connected layer for patch features, and use Sparse Coding based FV for object, scene, and finegrained recognition. Cimpoi et al. Ref:Cimpoi2015
extract outputs of the last convolutional layer of a CNN as input descriptors, and use FV to represent images for texture recognition, object and scene classification. Dixit et al.
Ref:Dixit2015 combine activations from the last fully connected layer and FV for scene classification. They all show better results than plain CNN. But they simply use CNN to extract patch features, and FV parameters are also not trainable, both of which may limit the performance of their methods.There are many researchers also trying to improve the FV Ref:Sydorov2014 ; Ref:Simonyan2013 . Sydorov et al. Ref:Sydorov2014 extract patch features by handcrafted SIFT, and learn parameters of FV and SVM jointly for object classification. They choose an alternative strategy to optimize FV and SVM parameters, that is, fixing the FV parameters and train the SVM, fixing the SVM parameters and train the FV. But using SIFT makes it impossible to learn patch features endtoend. Meanwhile, the alternative optimization is incompatible with the gradient descend optimization adopted by CNN. Different from their method, we decompose FV into a series of network layers and insert them to a CNN, and learn both patch features and FV parameters in an endtoend manner, by standard backpropagation. As we will show in Section 4.3, learning parameters of patch features and FV endtoend outperforms only learning FV parameters by a large margin. Simonyan et al. Ref:Simonyan2013 also propose a “Fisher Network” by stacking FVs on the top of each other. However, the network depends on handcrafted descriptors, and parameters of FV are also fixed upon constructing the codebook.
Recently, a “NetVLAD” framework presented by Arandjelović et al. Ref:Arandjelovic2016 develop a VLAD layer for deep networks. They choose outputs from the last convolutional layer as input descriptors, followed by a VLAD layer, which also learns all parameters of patch features and VLAD endtoend. But notice that VLAD is just a simplified version of FV Ref:Jegou2012 ; Ref:Sanchez2013 . It is more difficult to embed FV into CNN frameworks. Meanwhile, VLAD and NetVLAD are only able to capture firstorder statistics of data, while FV and FisherNet capture both first and secondorder statics. So in many applications especially for image classification, FV is more suitableRef:Cimpoi2015 ; Ref:Dixit2015 . Moreover, as receptive field sizes of convolutional layers are fixed, the patches from the last convolutional layer are only with one scale. We share the computation of convolutional layer for different patches, and use Spatial Pyramid Pooling (SPP) layer Ref:He2015 to generate patch features, making it possible to extract features from patches at multiple scales.
3 The Architecture of Deep FisherNet
The architecture of our FisherNet is shown in Fig. 2. Given an input image, we first densely extract image patches at multiple scales. These patches cover objects with different locations and scales. The image is passed through several convolutional (conv) layers, finally resulting in a conv feature map, and the size of this map varies as the size of input image changes. Then, a SPP layer Ref:He2015 is employed to generate a fixedsize conv feature map for every patch, since the following fully connected (fc) layers require fixedlength input. The fc layers accept each of the feature maps and output a patch feature vector correspondingly. It eventually outputs a collection of patch features from the set of feature maps. Following that, the collection is fed into our proposed Fisher Layer, which aggregates patch features into an orderless fixedlength image representation. At last, this representation is used for classification. In this section, we will describe our whole FisherNet architecture for object classification.
3.1 Pretrained CNN Models
As the number of training images is limited, it seems unpractical to train a randomly initialized CNN model on target dataset directly. It is more advisable to finetune a CNN model trained on large datasets like ImageNet Ref:Deng2009 . Here we choose two CNN models AlexNet Ref:Krizhevsky2012 and VGG16 Ref:Simonyan2015
, both have several conv layers with some maxpooling layers and three fc layers.
3.2 SPP Layer
One way to generate patch features is feeding each patch into CNN models separately. But this tends to be timeconsuming since it does not share the computation of overlapping patches. Here we adopt the SPP layer method in SPPnet Ref:He2015 and Fast RCNN Ref:Girshick2015 . After obtaining the feature map of input image, the SPP layer is employed to project each patch to a fixedsize feature map. Specifically, the last maxpooling layer of the original CNN is replaced by the SPP layer. For patch , its output feature map can be acquired by Eq. (1), where is the th activation into the SPP layer, is the th subwindow of , and is the output from . The number of subwindows depends on the original CNN model (e.g., for AlexNet Ref:Krizhevsky2012 and for VGG16 Ref:Simonyan2015 ). After the SPP layer, each patch feature map is passed to following fc layer to produce patch features.
(1) 
3.3 Fisher Layer
After generating patch features, we aggregate them into an image representation. This is implemented by our Fisher Layer. In this subsection, we will first review the FV Ref:Sanchez2013 for image classification briefly, then we present the proposed Fisher Layer of FisherNet.
3.3.1 Fisher Vector
Let be a set of images. Their patch features are , where and are the number of images and the number of patches for image respectively. A component Gaussian Mixture Model (GMM)
is used as probability density function of
. Let be parameters of GMM, where are weight, mean vector, covariance matrix of th GMM component, respectively. Then can be written as(2) 
Notice that covariance matrix is restricted to be a diagonal matrix in FV Ref:Sanchez2013 , i.e., . For any patch feature , we define a vector , where its subvector and are as follows
(3) 
(4) 
) is posterior probability as in Eq. (
5). Then the FV of image is the meanpooling of all patch representations in , i.e., .(5) 
3.3.2 The Architecture of Fisher Layer
The parameters of traditional FV are fixed once codebook is constructed. Fisher Layer, however, has learnable parameters after the codebook construction, thus can be trained jointly with CNN. To achieve this, we first make two simplification to the original FV: 1) We drop the weight , which assumes all GMM components have equal weights; 2) We simplify to be the form in Eq. (6), which is similar to covariance matrices share the same determinants. Despite the small differences from the original FV, our simplified FV still inherits the superiority of capturing first and secondorder statistics. We will also show in Section 4.3 that even with these simplifications, our FisherNet still achieves better performance than the traditional FV.
(6) 
(7) 
Suppose and , the final form of Fisher Layer can be obtained as follows
(8) 
(9) 
(10) 
where is an elementwise product operation. Notice that Eq. (10) is just a softmax function, and are sets of learnable parameters for each GMM component . We can observe that Eq. (8), (9), and (10) share the same computation part , which is obviously differentiable. Meanwhile, others are just some linear or square operations, so we can derive all parameters via backpropagation. To make the Fisher Layer more clear, we also show the architecture in Fig. 3.
3.4 Loss Function
In above subsections, we describe some important parts of our FisherNet. In this subsection, we will introduce our loss. Since we are focusing on object classification, which may have multiple different objects in the same image, the popular softmax loss is not suitable. Here we choose the muticlass sigmoid cross entropy loss. Specifically, suppose the predicted score vector of image is ; label vector is , where is the number of classes, if has object or
otherwise. Then the loss function will be Eq. (
11), whereis the sigmoid function, and it can be described as
. Actually, the loss can be changed for different tasks.(11) 
method  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow 

SCFVC Ref:Liu2014  
Cimpoi et al. Ref:Cimpoi2015                     
VGG1619SVM Ref:Simonyan2015                     
HCPVGG16 Ref:Wei2015  
FisherNetAlexNet  
FisherNetVGG16 
method  table  dog  horse  mbike  persn  plant  sheep  sofa  train  tv  mAP 

SCFVC Ref:Liu2014  
Cimpoi et al. Ref:Cimpoi2015                      
VGG1619SVM Ref:Simonyan2015                      
HCPVGG16 Ref:Wei2015  
FisherNetAlexNet  
FisherNetVGG16 
method  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow 

Oquab et al. Ref:Oquab2015  
VGG1619SVM Ref:Simonyan2015  
HCPVGG16 Ref:Wei2015  
FisherNetAlexNet  
FisherNetVGG16 
method  table  dog  horse  mbike  persn  plant  sheep  sofa  train  tv  mAP 

Oquab et al. Ref:Oquab2015  
VGG1619SVM Ref:Simonyan2015  
HCPVGG16 Ref:Wei2015  
FisherNetAlexNet  
FisherNetVGG16 
dataset  CNNfinetune  CNNFV  CNNFL  FisherNet 

PASCAL VOC 2007  
PASCAL VOC 2012 
aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  

CNNFV  
FisherNet  
Improvement 
table  dog  horse  mbike  persn  plant  sheep  sofa  train  tv  mAP  

CNNFV  
FisherNet  
Improvement 
aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  

CNNFV  
FisherNet  
Improvement 
table  dog  horse  mbike  persn  plant  sheep  sofa  train  tv  mAP  

CNNFV  
FisherNet  
Improvement 
4 Experiments
In this section, we will report our results for object classification, and do some discussions.
4.1 Experimental Setup
Datasets and Evaluation Metrics
Two popular object classification datasets are chosen in our experiments, PASCAL VOC 2007 and 2012 Ref:Everingham2010 , which have and images respectively, with 20 different object classes. Both datasets have multiple labels for every image. We train our model on the standard trainval set ( images for VOC 2007 and for 2012) with only image level labels, and test on the test set. Average Percision (AP) and mean of AP (mAP) are used for evaluation.
Implementation Details
As referred in Section 3.1, our FisherNet is based on two CNN architectures AlexNet Ref:Krizhevsky2012 and VGG16 Ref:Simonyan2015 , which are pretrained on large scale dataset ImageNet Ref:Deng2009 .
As the dimension of the second fc layer is , using it as patch features directly will result in highdimensional FV. So we remove the last fc layer (mantain the first and second fc layers) and add two fc layers after the second fc layer: the first one is dimension for dimension reduction; the second one is for predicting image scores (its dimension is depended on the number of classes). Then we use the whole image to finetune this network on the target dataset. We train this network for iterations with minibatch size . Learning rates of these two layers and other layers are set to and respectively, and divided by after every iterations. Results of this whole image finetuning procedure are shown as the CNNfinetune in Table 3. After that, we use the SPP layer to replace the last maxpooling layer. Also, we replace the last fc layer of the finetuned model with our Fisher Layer and a fc layer for predicting scores. We train the FisherNet iterations, with minibatch size . Learning rates of the Fisher Layer, the last fc layer, and other layers are set to , , and , respectively. The number of GMM components is , so the final dimension of image representation is . The momentum and weight decay are set to and respectively.
For Fisher Layer, we extract patch features from the dimension fc layer, then use GMM to get and for initializing and in Section 3.3.2. Other new added layers are initialized using Gaussian distributions with mean
.In all experiments, once our FisherNet is trained, we extract trained FV, and train a onevsall linear SVM with learning hyperparameter
. As in Ref:Sanchez2013 , we use two normalization: powernormalization () and L2normalization ().Other Setup
To generate patches, we densely extract patches from seven scales , with step size , which will produce to patches perimage. For data augmentation, we horizontally flip images for whole image finetuning. As our FisherNet can handle images with arbitrary sizes, we use five image scales (resize its longest side to one of the five scales and maintain its aspect ratio) with horizontal flip for training FisherNet. Then we compute the mean of vector of these five scales (no flip) to train and test SVM.
The GMM, SVM, and CNN are implemented by VLFeat Ref:Vedaldi2010 , LIBLINEAR Ref:Fan2008
, and Caffe
Ref:Jia2014 , respectively. All of our experiments are running on a PC with Inter(R) i74790K CPU (4.00GHZ), NVIDIA GTX TitanX GPU, and 32GB RAM.4.2 Experimental Results
Experimental results on PASCAL VOC 2007 and 2012 are shown in Table 1 and Table 2.^{1}^{1}1Results of our FisherNet on PASCAL VOC 2012 can be viewed at http://host.robots.ox.ac.uk:8080/anonymous/DJ5JTS.html and http://host.robots.ox.ac.uk:8080/anonymous/AKKQXE.html. We can observe that our FisherNet achieves highly competitive results compared with other CNN based methods with single model. More importantly, our method outperforms other CNNFV based methods Ref:Liu2014 ; Ref:Cimpoi2015 . For example, Liu et al. Ref:Liu2014 use outputs from the first fc layer as patch descriptors, and encode image using Sparse Coding based FV; Cimpoi et al. Ref:Cimpoi2015 choose activations of the last conv layer as patch features, and extract patch features from ten different scales, then encode image using standard FV. These demonstrate the effectiveness of learning FV parameters and patch features.
Testing Time
Our method is also very efficient. It takes only s and s perimage during testing, for AlexNet and VGG16 respectively, which is over faster than the previous stateoftheart method HCP Ref:Wei2015 (s for AlexNet and s for VGG16).
4.3 Discussion
Here we discuss benefits of our endtoend training for object classification. Without loss generality, we only choose the AlexNet in this part. Some results are shown in Table 3, where CNNfinetune means the whole image finetuning procedure as in Section 4.1; CNNFV means extracting patch features using SPP, with the same patches as our FisherNet, then representing images by the standard FV; CNNFL means setting learning rates before the Fisher Layer to and training our FisherNet, i.e., only learning FV parameters; FisherNet means our whole FisherNet training as in Section 4.1.^{2}^{2}2Some results on PASCAL VOC 2012 in Table 3 can be viewed at http://host.robots.ox.ac.uk:8080/anonymous/AN0JUF.html, http://host.robots.ox.ac.uk:8080/anonymous/38ZBIX.html, and http://host.robots.ox.ac.uk:8080/anonymous/RKWM6E.html. We can observe that, the CNNFV outperforms the CNNfinetune by a large margin, which demonstrates that BoVW based representation can achieve better performance compared with plain CNN. The CNNFL only achieves small improvements compared with the CNNFV. Actually the CNNFL is similar to Ref:Sydorov2014 which only learn FV parameters instead of learning parameters of FV and patch features jointly. If we train all parameters jointly, the performance will be boosted a lot. Also from Table 4 and Table 5, learning all parameters of patch features and FV jointly can boost the performance for all classes ( on PASCAL VOC 2007 and 2012 compared with the traditional FV). All these facts confirm the effectiveness of our learning strategy.
5 Conclusion
In this paper, we propose a novel deep FisherNet framework, which makes all parameters of patch features and FV be learned in an endtoend manner. Compared with traditional FV, experiments show substantial improvements by this learning strategy. We believe that using CNN based patch features with traditional BoVW based representation methods can achieve more satisfactory performance than plain CNN, and integrating these methods into a CNN framework can improve results further. As FV is quite a effective image representation method and has achieved many stateoftheart results on many computer vision applications like image classification and retrieval, in the future, we will explore how to use our new designed FisherNet for other applications.
References
 [1] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, 2016.
 [2] M. Cimpoi, S. Maji, I. Kokkinos, and A. Vedaldi. Deep filter banks for texture recognition, description, and segmentation. IJCV, pages 1–30, 2015.
 [3] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255, 2009.
 [4] M. Dixit, S. Chen, D. Gao, N. Rasiwasia, and N. Vasconcelos. Scene classification with semantic fisher vectors. In CVPR, pages 2974–2983, 2015.
 [5] C. Doersch, A. Gupta, and A. A. Efros. Midlevel visual element discovery as discriminative mode seeking. In NIPS, pages 494–502, 2013.
 [6] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. IJCV, 88(2):303–338, 2010.
 [7] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
 [8] R. Girshick. Fast rcnn. In ICCV, pages 1440–1448, 2015.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.
 [10] H. Jégou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 34(9):1704–1716, 2012.
 [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, pages 675–678, 2014.
 [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [13] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
 [14] L. Liu, C. Shen, L. Wang, A. van den Hengel, and C. Wang. Encoding high dimensional local features by sparse coding based fisher vectors. In NIPS, pages 1143–1151, 2014.
 [15] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60(2):91–110, 2004.

[16]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic.
Is object localization for free?Weaklysupervised learning with convolutional neural networks.
In CVPR, pages 685–694, 2015.  [17] J. Sánchez, F. Perronnin, T. Mensink, and J. J. Verbeek. Image classification with the Fisher Vector: Theory and practice. IJCV, 105(3):222–245, 2013.
 [18] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep fisher networks for largescale image classification. In NIPS, pages 163–171, 2013.
 [19] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [20] V. Sydorov, M. Sakurada, and C. H. Lampert. Deep fisher kernelsend to end learning of the fisher kernel GMM parameters. In CVPR, pages 1402–1409, 2014.
 [21] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. In ACM MM, pages 1469–1472, 2010.
 [22] X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu. Maxmargin multipleinstance dictionary learning. In ICML, pages 846–854, 2013.
 [23] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan. HCP: A flexible CNN framework for multilabel image classification. TPAMI, 2015.
 [24] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pages 1794–1801, 2009.
Comments
There are no comments yet.