is an extension of the popular Bag-Of-Words (BOW) model by encoding for codeword the mean and variance of local descriptors. It consists of two steps (i) encoding step, encoding the descriptors into dense and high-dimensional features codes; and (ii) pooling step, pooling the codes into a vector. With several improvements[2, 3], Fisher vector has been one of the most effective ways for image categorization.
The success of FV representation is ascribed to its high dimensionality, but FV representation also suffers high computation cost when compared to BOW model 
, especially for large-scale image retrieval and object detection. For specific tasks, several simplified[4, 7, 8] and extended [2, 5, 6, 9] versions of Fisher coding have emerged. In , Jegou et al. proposed the VLAD representation in which each local descriptor is assigned to the nearest visual word, then the differences between codewords and corresponding descriptor are accumulated. In , Florent Perronnin et al. compressed the high-dimensional Fisher vectors through Local Sensitive Hashing. Recently, Dan Oneata et al.  presented approximations to normalizations in Fisher vector. In , authors realized a fast local area independent representation by representing the picture as sparse integral images. In this paper, we combine the locality strategy into Fisher vector to reduce the time consumption in feature coding step.
Locality strategy has been used in Linear Embedding and spectral clustering i.e. Local Linear Embedding and local spectral clustering . Inspired by this strategy, many localized coding ways or nearest search algorithms have emerged in BOW, for example, Locality-constrained Linear Coding (LLC) , Local Soft Coding (LSC) , Laplacain Sparse Coding , Local Coordinate Coding , local sparse coding . Also this locality-preserving method has been used in pooling step . This locality will produce an early cut off effect to remove the unreliable longer distances. Previous work has shown the effectiveness of preserving configuration space locality during coding, so that similar inputs lead to similar codes 
. Also we can view them as a trick whose computational cost would be prohibitive with standard coding. Because all the coding coefficients can be regarded as the probability density to describe the feature which can be represented by histogram or fisher vector. In this paper, we will introduce the LLC, LSC and SFV from probabilistic perspective and reveal the relationships between LLC, LSC and SFV, and this part will be discussed in section 4.
For the pooling step in image categorization, Naila Murray  et al. tried to generalize max pooling (GMP) to Fisher vector by constructing object function with loss term. Based on this structure, we reformulate the sparse Fisher vector which is the origin Fisher vector combined with locality strategy.
A notable previous idea which is similar to our work is proposed in  with ”posterior thresholding”. But  only regarded this as an accelerating trick, and failed to provide the detailed theoretical proof and the effectiveness of the proposed method are not explained. Our paper provides the detailed explanation of the scheme and implement a experimental evaluation on image categorization task.
2 Generalized max pooling revisit
Fisher vector is essentially the sum pooling of encoded SIFT features. It should be noted that the sum-pooled representation is more influenced by frequent descriptors in one image. While max-pooled representation only considers the greatest response, and therefore immune to this effect, but it does not apply to aggregation-based encoding such as FV representation. To alleviate the problem,  proposed the generalized max pooling method that mimics the desirable properties of max pooling. They denote the code vector of each feature, and the GMP vector. GMP demands that , which indicates that is equally similar to frequent and rare features. In the BOF case, GMP is strictly equivalent to max pooling . GMP can be formalized in two ways. The first is the primal formulation:
which directly gives the result of pooling , where is the N-dimensional vector of all ones. The second is the dual formulation:
which gives the weight of each feature. is the result of weighted sum pooling.
3 Sparse Fisher Vector Theory
Let be a set of N local descriptors extracted from an image. We denote M
the number of Gaussian Mixture Model(GMM) clusters, andD the dimension of SIFT descriptors after using PCA. Clearly. According to Section 2, the Fisher vector representation should be equally similar to each Fisher vector code, which is defined as:
where is the code matrix, of which each row represents a Fisher vector code corresponding to the descriptor .
where is the sub-vector of cluster m of the Fisher vector code corresponding to .
In , the normalization of the Fisher information matrix takes a diagonal form, which assumes the sub-vectors are independent of each other. Therefore it is natural to divide Eq. 3 into M subtasks. We denote by the m-th column of , which is the code matrix in the -th subtask:
And . If each subtask is fulfilled as follows, the whole task likes Eq. 3 will be fulfilled as:
The objective function of the -th subtask in the primal formulation is:
Clearly the primal formulation does not have the sparsifying effect, so we turn to the dual formulation. According to Section 2, we denote by the code weight so that , which means that is the pooling result of code matrix with weight . is consistent with the idea of Sparse Fisher vector because it can determine whether a Fisher vector code is valid in the final image representation.
The objective function of the -th task in the dual formulation is:
For convenience, we substitute for . The analytical solution to the dual formulation is:
The analytical solution indicates that we can leverage the individual items of which are the weights of the Fisher vector codes in the -th subtasks. If the weight is zero, then the corresponding descriptor makes no contributions in the pooling. In other words, the -th component of the Fisher Vector code is sparsified, whose idea is like FV sparsity encoding in .
In LLC , weighted L2-norm constraint is used to assure that the local atoms are preserved, which inspires us to use a similar regularity to leverage the sparsity of , let denote the SFV representation,
where d gives different constraints to the individual items of . Specially,
denotes the first k maximum posterior of m-th cluster. The analytical solution is:
When approaches infinity, will be comparatively negligible, and the solution can be written as:
Eq. 13 sparsifies the items in that are heavily constrained by d, but the weights of the unsparsified descriptors are determined by , which is time-costly. Therefore, we make a further simplification. is the kernel matrix of patch-to-patch similarities. Clearly only depends on : when a feature shows little similarity with the other features, the corresponding weight will be greater. Because the Fisher vector codes are all normalized, the diagonal items of are all ones. If we ignore the non-diagonal items of which means that the Fisher vector codes are orthogonal, Eq. 13 goes to: .
Because will be eliminated by normalization, the individual item of can be written as:
where is the j-th term of , and is the j-th term of . As , sparse makes be sparsified, i.e., Sparse Fisher vector.
For , we have , which corresponds to the original Fisher vector. Therefore, does not only play a role in regularization, but also realize a smooth transition between the solution to original Fisher vector () and Sparse Fisher vector ().
|Coding ways||Feature Dims||Accuracy(%)||Time per image(s)|
4 Experiment Evaluations
To verify the effectiveness of Sparse Fisher vector, we validate the proposed approach on image category task. Firstly, we describe the image classification datasets and experimental setup. We experimentally compare the Sparse Fisher vector against the canonical Fisher vector for two large data sets: Caltech-101 by Fei-Fei et al.  and the Pascal VOC sets of 2007  .
4.1 Experimental setup
We compute all SIFT descriptors on overlapping pixels patches with the step size of 4 pixels. We reduce their dimensionality to 64 dimensions with PCA, so as to better fit the diagonal covariance matrix assumption.
EM algorithm is employed to learn the parameters of the GMM and the cluster number ranges from 64 to 256. By default, for Fisher vector, we calculate the gradient with respect to mean and standard deviation. And for the Sparse Fisher vector we set the neighborhood as. We streamline the standard experimental setting and employ linear SVM. It is worth mentioning that the computing platform in our experiments is Intel Core Duo (4G RAM), so the results are slightly different with origin paper in computation time. We use the origin Fisher vector [1, 2] as the baseline and also the Sparse Fisher vector is improved based on origin Fisher vector.
4.2 Pascal Voc 2007
The Pascal VOC 2007 database contains 9,963 images of 20 classes. We use the standard protocol which consists in training on the provided trainval set and testing on the test set and we set the BOW model as the baseline. The classification results are compared in Table 1, where M denotes the number of clusters in GMM. We compared three sections in different coding ways, including feature dimensions, accuracy and coding time per image.
For the same feature dimension, for example 8192, the FV achieves higher accuracy than BOW. This result shows that the FV is more discriminative than BOW with the double time cost. But for SFV, when the cluster number of Gaussian mixture distributions(GMM) is 64, we can obtain a comparable accuracy with FV but much faster image coding. This result is in accordance with the conditions of 32, 128 and 256 clusters number.
Caltech 101 dataset consists of 9144 images of 102 classes like animals, flower and so on. Following the standard experimental setting, we use 30 images per class for training while leaving the remaining for test. Other experimental setting agrees with experiment setup above. Classification results are compared in Table 2.
|Coding ways||Accuracy(%)||Time per image(s)|
Table 2 shows the similar result as Table 1. Under the same size of codebook, SFV runs more quickly than FV with a comparable accuracy. And with the increase in codebook size, the difference of time consuming between these two coding ways is increasing. For example, when the codebook size is 256, coding time per image in FV is 10.69 s, while for SFV is 1.98 s which is nearly 5 times as fast as FV.
4.4 Experiment analysis
4.4.1 Computation cost analysis
To further show the advantage of SFV in computation cost, we demonstrate the average coding time per image with the size of codebook and analyze the computation complexity.
Fig.1(a) and Fig. 1(b) show the average coding time per image as a function of the codebooks size on datasets above. As was the case on both datasets, SFV consistently outperforms the FV and the computation time difference increase with the codebooks size.
Considering the D dims of features and M
clusters mentioned above, we can estimate the computation complexity. There are two sub-steps in FV encoding steps: the first sub-step is calculating the posterior probability and the second sub-step is calculating the derivation on the GMM. The computation complexity of the first step iswhich is same for FV and SFV. The computaion complexity of the second step is and respectively. As , so the total time of SFV is much less than FV and the time difference increases with M which is consist with experiment results.
4.4.2 Similarity correspondence between SIFT and Sparse Fisher vector
One implicit contribution of our work is that SFV better preserves similarity. To demonstrate this, 200 SIFT features from PASCAL VOC 2007 are randomly selected. We calculate the pair-wise similarity by using cosine measure. The similarity correspondence is shown in Fig. 2. Fig. 2 indicates an obvious linear trend of the similarity between SFV against the similarity between SIFT features, while FV does not. The comparison confirmed that the effectiveness of preserving configuration space locality during coding, which makes similar inputs correspond to similar codes [14, 21].
4.5 Discussion about SFV
In Fisher vector, local features are described by deviation from a GMM. The probability representation of a feature by GMM can be represented as:
where denotes the prior of the codeword and reflects the probability of feature x belongs to the m-th cluster. So we can regard the feature coding coefficient as the probability of a feature belonging to the codebook. We notice that no matter in LLC , or LSC , codewords in codebook are independent and there are no priors on them or we can regard the priors as equal. For LSC, Eq.15 can be rewritten as:
where I is a binary vector.
Also we need to notice that all dimensions of soft coding [13, 23] are independent of each other. In Fisher coding, the relations among different dimensions are represented by GMM. The object function of SFV can be represented as:
where I is a binary vector.
So when we execute the localization operation in Eq. 16, we calculate the codewords which belong to the k-nearest neighborhood of the feature. This can be regarded as the soft maximum of the likelihood of conditional probability. This is also true for LLC model. But in SFV, when we execute the early cut off operation, the prior of the codeword is incorporated. So we calculate the codewords which belong to the k-nearest neighborhood of the feature as Eq. 15. This can be regarded as a soft maximum of the posterior probability.
In this paper, we have introduced a ’localized’ Fisher vector called Sparse Fisher vector. Based on GMP, we sparsified the Fisher vector code matrix by adding local regular term. These ways allow efficient image categorization without undermining its performance on several public datasets and coding outputs preserve the similarity among input features.
Fisher vector origins from the natural gradient in , so Sparse Fisher vector can be seen as partial gradient descent. Also, from probabilistic perspective, Sparse Fisher vector can be regarded as a soft maximum of the posterior probability. Since GMP considers the uniqueness of features and weight them according to uniqueness, we will combine it in our future work.
This work was supported in part by the National Basic Research Program of China(2012CB719903).
-  Florent Perronnin and Christopher R. Dance. Fisher kernels on visual vocabularies for image categorization. In , 2007.
-  Florent Perronnin, Jorge Sánchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV, pages 143–156, 2010.
-  Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3):222–245, 2013.
-  Herve Jegou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pages 3304–3311, 2010.
-  Florent Perronnin, Yan Liu, Jorge Sánchez, and Herve Poirier. Large-scale image retrieval with compressed fisher vectors. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pages 3384–3391, 2010.
-  Ramazan Gokberk Cinbis, Jakob J. Verbeek, and Cordelia Schmid. Segmentation driven object detection with fisher vectors. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2968–2975, 2013.
-  Dan Oneata, Jakob J. Verbeek, and Cordelia Schmid. Efficient action localization with approximately normalized fisher vectors. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 2545–2552, 2014.
-  Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders. Fisher and VLAD with FLAIR. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 2377–2384, 2014.
-  Jie Lin, Ling-Yu Duan, Tiejun Huang, and Wen Gao. Robust fisher codes for large scale image retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 1513–1517, 2013.
-  Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pages 849–856, 2001.
-  Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pages 3360–3367, 2010.
-  Lingqiao Liu, Lei Wang, and Xinwang Liu. In defense of soft-assignment coding. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 2486–2493, 2011.
-  Shenghua Gao, Ivor Wai-Hung Tsang, Liang-Tien Chia, and Peilin Zhao. Local features are not lonely - laplacian sparse coding for image classification. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pages 3555–3561, 2010.
Kai Yu and Tong Zhang.
Improved local coordinate coding using local tangents.
Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 1215–1222, 2010.
-  Jianchao Yang, Kai Yu, and Thomas S. Huang. Efficient highly over-complete sparse coding using a mixture model. In Computer Vision - ECCV 2010 - 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V, pages 113–126, 2010.
-  Y-Lan Boureau, Nicolas Le Roux, Francis Bach, Jean Ponce, and Yann LeCun. Ask the locals: Multi-way local pooling for image recognition. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 2651–2658, 2011.
-  Naila Murray and Florent Perronnin. Generalized max pooling. pages 2473–2480, 2014.
-  Fei-Fei Li, Robert Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106(1):59–70, 2007.
-  Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge, 2010.
-  Shenghua Gao, Ivor Wai-Hung Tsang, and Liang-Tien Chia. Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):92–104, 2013.
-  Yongzhen Huang, Zifeng Wu, Liang Wang, and Tieniu Tan. Feature coding in image classification: A comprehensive study. IEEE Trans. Pattern Anal. Mach. Intell., 36(3):493–506, 2014.
-  Jan van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, and Jan-Mark Geusebroek. Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell., 32(7):1271–1283, 2010.
-  Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.