With the advancement of both stable interest region detectors  and robust and distinctive descriptors , local feature based image or object retrieval has attracted a great deal of attention. In local feature based image retrieval or recognition, each image is first represented by a set of local features , where is the number of local features. The set of features is then encoded into a fixed length vector in order to calculate any (dis)similarity between sets of features. The most frequently used method is a bag-of-visual words (BoVW) representation , where feature vectors are quantized into visual words (VWs) using a visual codebook that result in a histogram representation of VWs.
Recently, the Fisher vector representation  has attracted much attention because of its effectiveness. The Fisher vector is defined by the gradient of log-likelihood function normalized with the Fisher information matrix. In 
, feature vectors are modeled by the Gaussian mixture model (GMM), and a closed form approximation is first proposed for the Fisher information matrix of GMM. Then, the performance of the Fisher vector is improved in by using power and normalization. Because the Fisher vector can represent higher order information than the BoVW representation, it has been shown that it can outperform the BoVW representation in both image classification  and image retrieval tasks [34, 18, 19].
Another trend in the area of image retrieval is the use of binary features such as Oriented FAST and Rotated BRIEF (ORB) , Fast Retina Keypoint (FREAK) , Binary Robust Invariant Scalable Keypoints (BRISK) , KAZE features , Accelerated-KAZE (A-KAZE) , Local Difference Binary (LDB) , and Learned Arrangements of Three patCH codes (LATCH) . Binary features are one or two orders of magnitude faster than the Scale Invariant Feature Transform (SIFT)  or Speeded Up Robust Features (SURF)  features in detection and description, while providing comparable performance [39, 13]. These binary features are especially suitable for mobile visual search or augmented reality on mobile devices. While the Fisher vector is widely applied to continuous features (e.g., SIFT) that can be modeled by GMM, to the best of our knowledge, there has been no attempt to apply the Fisher vector to the abovementioned recent binary features for the purpose of image retrieval. Considering the significant performance improvement for accuracy in both image classification and retrieval by the Fisher vector of continuous features, if the Fisher vector were also to be applied to binary features, we would receive similar benefits in binary feature-based image retrieval and classification.
In this paper, we propose to apply the Fisher vector representation to binary features to improve the accuracy of binary feature based image retrieval. Table 1 shows the position of this paper. Our main contribution is to model binary features using the Bernoulli mixture model (BMM) and derive the closed-form approximation of the Fisher vector of BMM . Experimental results show that the proposed Fisher vector outperforms the BoVW method on various types of objects. In addition, we also propose a fast approximation method to accelerate the computation of the proposed Fisher vectors by one order of magnitude with comparable performance. In the experiments, we evaluate the effectiveness of both the proposed Fisher vector representation of binary features and their associated vector normalization method. In particular, we demonstrate that a normalization method, originally proposed for the other vector representation, also works well for the proposed Fisher vector. The proposed Fisher vector representation of binary features is general and not restricted to image features; it is also expected to be applicable to other modalities such as audio signals [12, 6].
The rest of this paper is organized as follows. In Section 2, the recent binary features that we are going to model are briefly introduced. In Section 3, we describe the BoVW and Fisher vector image representations, which have been applied to continuous feature vectors (e.g., SIFT). In Section 4, we model binary features with BMM and derive the Fisher vector of BMM, which enables us to apply the Fisher vector representation to binary features. In Section 5, the effectiveness of the Fisher vector of binary features is confirmed. Our conclusions are presented in Section 6.
2 Local binary features
Recently, binary features such as ORB , FREAK , and BRISK  have attracted much attention . Binary features are one or two orders of magnitude faster than SIFT or SURF features in extraction, while providing comparable performance to SIFT and SURF. In this section, recent binary features are briefly introduced.
Most of the local binary features employ fast feature detectors. The ORB feature utilizes the Features from Accelerated Segment Test (FAST)  detector, which detects pixels that are brighter or darker than neighboring pixels based on the accelerated segment test. The test is optimized to reject candidate pixels very quickly, realizing extremely fast feature detection. In order to ensure approximate scale invariance, feature points are detected from an image pyramid. The FREAK and BRISK features adopt the multi-scale version of the Adaptive and Generic Accelerated Segment Test (AGAST) 
detector. Although the AGAST detector is based on the same criteria as FAST, the detection is accelerated by using an optimal decision tree in deciding whether each pixel satisfies the criteria or not.
Local binary features extract binary strings from patches of interest regions instead of extracting gradient-based high-dimensional feature vectors like SIFT. Many methods utilize binary tests in extracting binary strings. The BRIEF descriptor, a pioneering work in the area of binary descriptors, is a bit string description of an image patch constructed from a set of binary intensity tests. Consider the -th smoothed image patch , a binary test for -th bit is defined by:
where and denote relative positions in the patch , and denotes the intensity at the point. Using independent tests, we obtain -bit binary string for the patch . The ORB feature employs a learning method for de-correlating BRIEF features under rotational invariance. Although the BRISK and FREAK features use different sampling patterns from BRIEF, they are also based on a set of binary intensity tests. These binary features are designed so that each bit has the same probability of being 1 or 0, and bits are uncorrelated.
3 Image representations
In local feature based image retrieval or recognition, each image is first represented by a set of local features . A set of features is then encoded into a fixed length vector in order to calculate (dis)similarity between sets of features. In this section, two encoding methods are introduced.
3.1 Bag-of-Visual Words
The BoVW framework is the de-facto standard to encode local features into a fixed length vector. In the BoVW framework, feature vectors are quantized into VWs using a visual codebook that result in a histogram representation of VWs. Image (dis)similarity is measured by or distance between the normalized histograms. Although it was first proposed for an image retrieval task , it is now widely used for both image retrieval [31, 36, 17, 44] and image classification [22, 20]. In , the bag-of-visual words approach is also applied to binary features.
3.2 Fisher Kernel and Fisher Vector
The Fisher kernel is a powerful tool for combining the benefits of generative and discriminative approaches . Let denote the set of local feature vectors extracted from an image. We assume that the generation process of
can be modeled by a probability density functionwhose parameters are denoted by . In , it is proposed to describe by the gradient of the log-likelihood function, which is also referred to as the Fisher score:
where denotes the log-likelihood function:
The gradient vector describes the direction in which parameters should be modified to best fit the data . A natural kernel on these gradients is the Fisher kernel , which is based on the idea of natural gradient :
is the Fisher information matrix of defined as
Because is positive semidefinite and symmetric, it has a Cholesky decomposition . Therefore the Fisher kernel is rewritten as a dot-product between normalized gradient vectors with:
The normalized gradient vector is referred to as the Fisher vector of .
In , the generation process of feature vectors (SIFT) are modeled by GMM, and the diagonal closed-form approximation of the Fisher vector is derived. Then, the performance of the Fisher vector is significantly improved in  by using power normalization and normalization. The Fisher vector framework has achieved promising results and is becoming the new standard in both image classification  and image retrieval tasks [34, 18, 19]. There are several extensions to this framework such as multiple-layered Fisher vector 
and a combination with Convolutional Neural Networks (CNN).
While the Fisher vector is widely applied to continuous features (e.g., SIFT) that can be modeled by GMM, to the best of our knowledge, there has been no attempt to apply the Fisher vector to recent binary features such as ORB  for the purpose of image retrieval. In this paper, we derive the closed-form approximation of the Fisher vector of binary features which are modeled by the Bernoulli mixture model, and evaluate the effectiveness of both the Fisher vector of binary features and their associated normalization approaches.
3.3 Vector of Locally Aggregated Descriptors
In , Jégou et al. proposed an efficient way of aggregating local features into a vector of fixed dimension, namely Vector of Locally Aggregated Descriptors (VLAD). In the construction of VLAD, each descriptor is first assigned to the closest visual word in a visual codebook in the same way as in the construction of the BoVW vector. For each of the visual words, the residuals in quantization are accumulated, and the sums of residuals are concatenated into a single vector, VLAD. VLAD can be considered as the simplified non-probabilistic version of the Fisher vector. VLAD has been further improved by modifying its vector normalization or aggregation step [7, 42]. As the performance of VLAD is about the same or a little worse than the Fisher vector , we focus on the Fisher vector in this paper.
4 Fisher Vector for Binary Features
In this section, we model binary features with the Bernoulli distribution, and derive the Fisher vector representation of binary features.
4.1 Bernoulli Mixture Model
Let denote a -dimensional binary feature out of binary features extracted from an image. In modeling binary features, it is straightforward to adopt a single multivariate Bernoulli distribution. However, although many binary descriptors are designed so that bits of resulting binary features are uncorrelated , there are still strong dependencies among the bits. Therefore, a single multivariate Bernoulli component will be inadequate to cope with the kind of complex bit dependencies that often underlie binary features. This drawback is overcome when several Bernoulli components are adequately mixed. In this paper, we model binary features with the Bernoulli mixture model (BMM). The use of BMM instead of a single multivariate Bernoulli distribution will be justified in the experimental section.
Let denote a set of parameters for a multivariate Bernoulli mixture model with components, and represents the -th bit of . Given the parameter set , the probability density function of binary features is described as:
In order to estimate the values of the parameter set, given a set of training binary features
, the expectation-maximization (EM) algorithm is applied. In the expectation step, the occupancy probability (or posterior probability ) of being generated by the -th component of BMM is calculated as
In the maximization step, the parameters are updated as
In our implementation, parameter is initialized with , and
is with uniform distribution. From our experience, these initial parameters do not have a large impact on the final result.
4.2 Deriving the Fisher Vector of BMM
Letting denote the Fisher score w.r.t. the parameter , is calculated as:
Finally we obtain:
where is the occupancy probability defined in Eq. (8).
Then, we derive the approximate Fisher information matrix of BMM under the following three assumptions : (1) the Fisher information matrix is diagonal, (2) the number of binary features extracted from an image is constant and equal to , and (3) the occupancy probability is peaky; there is one index such that and that , .
As we assume the Fisher information matrix is diagonal, Eq. (5) is approximated as , where denotes the Fisher information w.r.t. :
Then, with the (2) and (3) assumptions, we approximately obtain:
Please refer to Appendix for the derivation. Finally, the Fisher vector is obtained with the concatenation of normalized Fisher scores .
4.3 Vector Normalization
The Fisher vector is further normalized with power normalization and normalization . Given a Fisher vector , the power-normalized vector is calculated as
In experiments, we set as recommended in . After the power normalization, normalization is performed to , resulting in the final Fisher vector representation of the set of binary features. In addition, we propose to use intra normalization  for this Fisher vector instead of the power and normalization. The intra normalization method was originally proposed for the VLAD representation described in Section 3.3, not for the Fisher vector. However, the purpose of intra normalization is to alleviate the problem of burstiness in visual words [16, 7] and, this is the same as that of power normalization. Therefore, it is also expected to work well for the Fisher vector. In the case of the Fisher vector, intra normalization is done by performing normalization within each BMM component.
4.4 Fast Approximated Fisher Vector
The most computationally expensive part of the proposed Fisher vector is the calculation of the occupancy probability in Eq. (12) because in Eq. (14) does not depend on the input vector and can be precomputed. In this paper, we also propose to accelerate the proposed Fisher vector by using the approximate value of .
Firstly, each -th component of BMM is converted into a representative binary vector as
Then, for each , the most similar representative binary vector is calculated by . This involves only the calculation of Hamming distance and can be done very fast. Finally, we obtain approximated as
This approximation is also based on the assumption that the occupancy probability is peaky.
In the experiments, the Stanford mobile visual search dataset111http://www.stanford.edu/~dmchen/mvs.html is used to evaluate the effectiveness of the proposed Fisher vector in image retrieval. The dataset contains camera-phone images of CD covers, books, business cards, DVD covers, outdoor landmarks, museum paintings, print documents, and video clips. These images consist of 100 reference images and 400 query images. Because some query images are too large (10M pixels), all images are resized so that the longest sides of the images are less than 640 pixels, while keeping the original aspect ratio. Figure 1 shows example images from the dataset.
The dissimilarity between two images is defined by the Euclidean distance between either the BoVW or the Fisher vector representations of the images. As an indicator of the retrieval performance, mean average precision (MAP) [31, 17] is used. For each query, a precision-recall curve is obtained based on the retrieval results. Average precision is calculated as the area under the precision-recall curve. Finally, the MAP score is calculated as the mean of the average precisions over all queries.
As a binary feature, we adopt the ORB  descriptor, which is one of the most frequently used binary features. An implementation of the ORB descriptor is available in an open source library222http://opencv.org/. On average, 900 features are extracted from four scales. The parameter set is estimated with the EM algorithm using one million ORB binary features extracted from the MIR Flickr collection333http://press.liacs.nl/mirflickr/. The following experiments were performed on a standard desktop PC with a Core i7 970 CPU.
5.1 Clustering Effect
First, we investigate the clustering results generated from the estimation of the parameter set of BMM with components. Figure 2 (a) represents all point pairs of the 256 binary tests used in the ORB descriptor explained in Section 2.2. Figures 2 (b)-(e) visualize a part of the parameter sets of four randomly selected components out of components. In each figure, red (blue) arrows represent five tests corresponding to the five largest (smallest) . The arrows are drawn from to , and represents the probability that is brighter than . Therefore, the pixel at the head of a red arrow tends to be brighter than the tail of the red arrow while the pixel at the head of a blue arrow tends to be darker than the tail of the blue arrow. We can see that the binary tests with the largest and the smallest concentrate on small areas (e.g., between the areas A and B in Figure 2 (b)). Thus, Figure 2 implies that some bits of the ORB descriptor are highly correlated and that BMM successfully captures this correlation. The result justifies the use of BMM instead of the single multivariate Bernoulli distribution to model binary features.
5.2 Impact of Normalization
The performance of the Fisher vector of binary features is evaluated in terms of image retrieval accuracy. In particular, the effect of the normalization methods described in Section 4.3 is investigated. The following six methods are compared: (1) bag of binary words approach with 1024 centroids (BoBW) , (2) Fisher vector without normalization (FV), (3) Fisher vector with normalization (L2 Norm), (4) Fisher vector with power normalization (P Norm), and (5) Fisher vector with both power and normalization (P+L2 Norm). (6) Fisher vector with intra normalization (In Norm).
Figure 3 shows a comparison of the Fisher vector and BoBW representations applied to binary features on eight classes. The accuracy of the Fisher vector without any normalization (FV) is disappointing compared to the BoBW framework. If or power normalization is adopted, the accuracy of the Fisher vector is significantly improved. The combination of the two normalizations further improves the performance, which is consistent with the case of SIFT+GMM . A little surprisingly, in many casese, the intra normalization method outperforms the other normalization methods. With appropriate normalization methods, the accuracy improves as the number of components increases, which is also consistent with the case of SIFT+GMM . Table 2 shows the accuracy of the proposed method (In Norm, ) and the BoVW method on eight classes. We can see that the proposed Fisher vector consistently outperforms the BoVW method on different datasets. In particular, the difference of accuracy between the proposed method and the BoVW method is relatively larger for book, card, dvd, document, and video classes. These classes includes many simple edges and corners (e.g., logo or text), and binary features extracted from these edges and corners are similar to each other. In the case of BoVW method, these binary features tend to be quantized into the same VW and less discriminative. On the other hand, the proposed Fisher vector can capture the ”difference” from the components of BMMs; therefore it is more discriminative, resulting in better results.
5.3 Performance for Various String Lengths
Next, we investigate the impact of the length of the binary strings on accuracy. In this experiment, the first bits of the full 256-bit string are used. This is reasonable because, in the ORB algorithm, binary tests are sorted according to their entropies; the leading bits are more important.
Figure 4 shows the accuracy of the Fisher vector with intra normalization as a function of the number of components , where the length of binary strings varies from 16 to 256. We can see that the Fisher vector of longer binary strings achieves better accuracy. However, the gain becomes smaller as the binary string becomes longer. This is because the ending bits tend to be correlated to other bits and thus are less informative. Therefore, we can use shorter strings for efficiency at the cost of accuracy. Because the computational cost of the Fisher vector is proportional to 444This is required in calculating the occupancy probability , which is the most computationally expensive part of the propsoed Fisher vector as described in Section 4.4., the other choice to reduce the computational cost is to use smaller . Figure 4 indicates that it is better to use shorter strings down to 64-bit strings instead of using smaller . For instance, the MAP score of 0.733 at and is better than that of 0.714 at and . Otherwise, it is better to use smaller instead of using smaller , e.g. 0.639 at and v.s. 0.663 at and . However, in this paper, we use full 256-bit strings in the other experiments to make the most of the ORB descriptor.
5.4 Increasing Database Size
We investigate the performance of the Fisher vector when the size of the database becomes large. In order to increase the size of database, we use images in the MIR Flickr collection as a distractor in the same way as in . Figure 5 compares the Fisher vector with intra normalization (, ) and the BoBW representation with the different numbers of distractors. In Figure 5, 0, 100, 1,000, and 10,000 distractor images are added to 100 reference images, resulting that the size of database becomes 100, 200, 1,100, and 10,100 respectively. We can see that the Fisher vector achieves better performance for all database sizes. Although the accuracy of both the Fisher vector and the BoBW representations drops as the size of the database increases, the degradation of the Fisher vector is relatively small. This is because the Fisher vector can represent higher order information than the BoBW representation and is more discriminating even with larger database sizes. It can be said that the effectiveness of the proposed Fisher vector representation becomes more significant when the database size is increased.
5.5 Evaluation of Fast Approximated Fisher Vector
Finally, we evaluate the performance of the approximated Fisher vector described in Section 4.4. Figure 6 compares the approximated and exact Fisher vector with intra normalization (). The distractor images in Section 5.4 are not used in this experiment. It can be seen that the approximated Fisher vector is one order of magnitude faster than the exact Fisher vector while the degradation of accuracy is only 6.4% on average and 1.6% for . Table 3 shows the average degradation of the MAP score on eight classes when the approximated Fisher vector with intra normalization () is adopted. We can see that there is not much difference in the degradation of accuracy among different classes.
This approximated Fisher vector uses two approximations. The first one is the approximation of the occupancy probability ; we approximate with the largest value to 1 and the others to 0 as Eq. (17). This is based on the assumption that is peaky. Figure 7 (a) shows the distribution of maximum occupancy probability for . We can confirm that is near to 1 in most cases as expected. The other approximation is that the component with maximum occupancy probability is approximately obtained as using representative binary vectors defined in Eq. (16). Figure 7 (b) shows the accuracy of this approximation; the probability that . We can see that although the accuracy declines as increases, the accuracy of this approximation is still 57% even for . As the final approximated Fisher vector is created using a number of feature vectors, this approximation works well as shown in Figure 6.
In this paper, we proposed the application of the Fisher vector representation to binary features to improve the accuracy of binary feature based image retrieval. We derived the closed-form approximation of the Fisher vectors of binary features that are modeled by the Bernoulli mixture model. In addition, we also proposed a fast approximation method that accelerates the computation of the proposed Fisher vectors by one order of magnitude with comparable performance. The effectiveness of the Fisher vectors of binary features was confirmed. There were some interesting observations; for example, the performance of the Fisher vector without power and normalization was very poor, while the Fisher vector with power and normalization outperformed the BoBW framework. The effectiveness of the proposed Fisher vector representation became more significant when the database size increased. Furthermore, we demonstrated that the intra normalization method originally proposed for VLAD also worked well for the proposed Fisher vector and outperformed the conventional normalization methods. This result encourages us to apply the intra normalization method to the Fisher vector of GMM. In future, we will apply the Fisher vector of binary features to image classification problems. We also expect that the proposed Fisher vector representation can also be successfully applied to other modalities such as audio signals.
We derive the Fisher information matrix under the following three assumptions: (1) the Fisher information matrix is diagonal, (2) the number of binary features extracted from an image is constant and equal to , and (3) the occupancy probability is peaky. From Eq. (13), we get:
If the parameter set is estimated with maximum-likelihood estimation, we have:
Using the value of the Fisher score in Eq. (12), we get:
Using the assumption that the occupancy probability is peaky, we approximate as . Finally, using the following equations,
-  A. Alahi, R. Ortiz, and P. Vandergheynst. Freak: Fast retina keypoint. In Proc. of CVPR, pages 510–517, 2012.
-  P. Alcantarilla, A. Bartoli, and A. Davison. Kaze features. In Proc. of ECCV, 2012.
-  P. Alcantarilla, J. Nuevo, and A. Bartoli. Fast explicit diffusion for accelerated features in nonlinear scale spaces. In Proc. of BMVC, 2013.
-  S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
-  M. Ambai and Y. Yoshida. Card: Compact and real-time descriptors. In Proc. of ICCV, 2011.
-  X. Anguera, A. Garzon, and T. Adamek. Mask: Robust local features for audio fingerprinting. In Proc. of ICME, 2012.
-  R. Arandjelović and A. Zisserman. All about VLAD. In Proc. of CVPR, 2013.
-  H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. CVIU, 110(3):346–359, 2008.
-  M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Binary robust independent elementary features. In Proc. of ECCV, pages 778–792, 2010.
-  D. Gálvez-López and J. D. Tardós. Real-time loop detection with bags of binary words. In Proc. of IROS, pages 51–58, 2011.
-  Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proc. of CVPR, pages 817–824, 2011.
-  J. Haitsma and T. Kalker. A highly robust audio fingerprinting system. In Proc. of ISMIR, pages 107–115, 2002.
-  J. Heinly, E. Dunn, and J.-M. Frahm. Comparative evaluation of binary features. In Proc. of ECCV, pages 759–773, 2012.
-  G. Irie, Z. Li, X. Wu, and S. Chang. Locally linear hashing for extracting non-linear manifolds. In Proc. of CVPR, 2014.
T. Jaakkola and D. Haussler.
Exploiting generative models in discriminative classifiers.In Proc. of NIPS, pages 487–493, 1998.
-  H. Jégou, M. Douze, and C. Schmid. On the burstiness of visual elements. In Proc. of CVPR, pages 1169–1176, 2009.
-  H. Jégou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search. IJCV, 87(3):316–336, 2010.
-  H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In Proc. of CVPR, pages 3304–3311, 2010.
-  H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 34(9):1704–1716, 2012.
-  Y. Jiang, C. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proc. of CIVR, pages 494–501, 2007.
-  A. Juan and E. Vidal. Bernoulli mixture models for binary images. In Proc. of ICPR, pages 367–370, 2004.
-  S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. of CVPR, pages 2169–2178, 2006.
-  Y. Lee, J. Heo, and S. Yoon. Quadra-embedding: Binary code embedding with low quantization error. In Proc. of ACCV, 2012.
-  S. Leutenegger, M. Chli, and R. Siegwart. Brisk: Binary robust invariant scalable keypoints. In Proc. of ICCV, pages 2548–2555, 2011.
-  G. Levi and T. Hassner. Latch: Learned arrangements of three patch codes. In Proc. of WACV, 2016.
-  V. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In Proc. of CVPR, 2015.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
-  E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger. Adaptive and generic corner detection based on the accelerated segment test. In Proc. of ECCV, 2010.
-  K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. TPAMI, 27(10):1615–1630, Oct. 2005.
-  K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. A comparison of affine region detectors. IJCV, 60(1-2):43–72, Nov. 2005.
-  D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In Proc. of CVPR, pages 2161–2168, 2006.
-  F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In Proc. of CVPR, 2007.
-  F. Perronnin and D. Larlus. Fisher vectors meet neural networks: A hybrid classification architecture. In Proc. of CVPR, 2015.
-  F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In Proc. of CVPR, pages 3384–3391, 2010.
-  F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In Proc. of ECCV, pages 143–156, 2010.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proc. of CVPR, pages 1–8, 2007.
-  M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In Proc. of NIPS, 2009.
-  E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In Proc. of ICCV, pages 1508–1515, 2005.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Proc. of ICCV, pages 2564–2571, 2011.
-  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep fisher networks for large-scale image classification. In Proc. of NIPS, 2013.
-  J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. of ICCV, pages 1470–1477, 2003.
-  E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A comprehensive study over vlad and product quantization in large-scale image retrieval. TMM, 16(6):1713–1728, 2014.
-  Y. Uchida and S. Sakazawa. Image retrieval with fisher vectors of binary features. In Proc. of ACPR, 2013.
-  Y. Uchida, K. Takagi, and S. Sakazawa. An alternative to idf: Effective scoring for accurate image retrieval with non-parametric density ratio estimation. In Proc. of ICPR, 2012.
-  J. Wang, S. Kumar, and S. F. Chang. Semi-supervised hashing for scalable image retrieval. In Proc. of CVPR, pages 3424–3431, 2010.
-  X. Yang and K. Cheng. Ldb: An ultra-fast feature for scalable augmented reality on mobile devices. In Proc. of ISMAR, pages 49–57, 2012.
-  X. Yang and K. Cheng. Local difference binary for ultra-fast and distinctive feature description. TPAMI, 36(1), 2014.