Content-Based Image Retrieval (CBIR) is a relevant topic studied by many scientists in the last decades. CBIR refers to the possibility of organizing archives containing digital pictures, so that they can be searched and retrieved by using their visual content datta05 . A specialization of the basic CBIR techniques include the techniques of object recognition Ullman96 , where visual content of images is analyzed so that objects contained in digital pictures are recognized, and/or images containing specific objects are retrieved. Techniques of CBIR and object recognition are becoming increasingly popular in many web search engines, where images can be searched by using their visual content Google-Images ; Bing-Images , and on smartphones apps, where information can be obtained by pointing the smartphone camera toward a monument, a painting, a logo Google-Goggles .
During the last few years, local descriptors, as for instance SIFT lowe04 , SURF bay06 , BRISK leutenegger11 , ORB rublee11 , to cite some, have been widely used to support effective CBIR and object recognition tasks. A local descriptor is generally a histogram representing statistics of the pixels in the neighborhood of an interest point (automatically) chosen in an image. Among the promising properties offered by local descriptors, we mention the possibility to help mitigating the so called semantic gap Smeulders:2000:CIR:357871.357873 , that is the gap between the visual representation of images and the semantic content of images. In most cases visual similarity does not imply semantic similarity.
Executing image retrieval and object recognition tasks, relying on local features, is generally resource demanding. Each digital image, both queries and images in the digital archives, are typically described by thousands of local descriptors. In order to decide that two images match, since they contain the same or similar objects, local descriptors in the two images need to be compared, in order to identify matching patterns. This poses some problems when local descriptors are used on devices with low resources, as for instance smartphones, or when response time must be very fast even in presence of huge digital archives. On one hand, the cost for extracting local descriptors, storing all descriptors of all images, and performing feature matching between two images must be reduced to allow their interactive use on devices with limited resources. On the other hand, compact representation of local descriptors and ad hoc index structures for similarity matching ZADB06Similarity are needed to allow image retrieval to scale up with very large digital picture archives. These issues have been addressed by following two different directions.
To reduce the cost of extracting, representing, and matching local visual descriptors, researchers have investigated the use binary local descriptors, as for instance BRISK and ORB. Binary features are built from a set of pairwise intensity comparisons. Thus, each bit of the descriptors is the result of exactly one comparison. Binary descriptors are much faster to be extracted, are obviously more compact than non-binary ones, and can also be matched faster by using the Hamming distance Hamming:1950:EDE rather than the Euclidean distance. For example, in rublee11 it has been showed that ORB is an order of magnitude faster than SURF, and over two orders faster than SIFT. However, note that even if binary local descriptors are compact, each image is still associated with thousand local descriptors, making it difficult to scale up to very large digital archives.
The use of the information provided by each local feature is crucial for tasks such as image stitching and 3D reconstruction. For other tasks such as image classification and retrieval, high effectiveness have been achieved using the quantization and/or aggregation techniques which provide meaningful summarization of all the extracted features of an image jegou10:VLAD . One profitable outcome of using quantization/aggregation techniques is that they allow us to represent an image by a single descriptor rather than thousands descriptors. This reduces the cost of image comparison and leads to scale up the search to large database. On one hand, quantization methods, as for instance the Bag-of-Words approach (BoW) sivic03 , define a finite vocabulary of “visual words”, that is a finite set of local descriptors to be used as representative. Every possible local descriptors is thus represented by its closest visual word, that is the closest element of the vocabulary. In this way images are described by a set (a bag) of identifiers of representatives, rather than a set of histograms. On the other hand, aggregation methods, as for instance Fisher Vectors (FV) perronnin07 or Vectors of Locally Aggregated Descriptors (VLAD) jegou10:VLAD , analyze the local descriptors contained in an image to create statistical summaries that still preserve the effectiveness power of local descriptors and allow treating them as global descriptors. In both cases index structures for approximate or similarity matching ZADB06Similarity can be used to guarantee scalability on very large datasets.
Since quantization and aggregation methods are defined and used almost exclusively in conjunction with non-binary features, the cost of extracting local descriptors and to quantize/aggregate them on the fly, is still high. Recently, some approaches that attempt to integrate the binary local descriptors with the quantization and aggregation methods have been proposed in literature galvez11 ; grana13 ; lee15 ; van14 ; uchida13 ; zhang13
. In these proposals, the aggregation is directly applied on top of binary local descriptors. The objective is to improve efficiency and reduce computing resources needed for image matching by leveraging on the advantages of both aggregation techniques (effective compact image representation) and binary local features (fast feature extraction), by reducing, or eliminating the disadvantages.
The contribution of this paper is providing an extensive comparisons and analysis of the aggregation and quantization methods applied to binary local descriptors also providing a novel formulation of Fisher Vectors built using the Bernoulli Mixture model (BMM), referred to as BMM-FV. Moreover, we investigate the combination of BMM-FVs and other encodings of binary features with the Convolutional Neural Network razavian2014cnn features as other case of use of binary feature aggregations. We focus on cases where, for efficiency issues rublee11 ; heinly12 , the binary features are extracted and used to represent images. Thus, we compare aggregations of binary features in order to find the most suitable techniques to avoid the direct matching. We expect this topic to be relevant for application that uses binary features on devices with low CPU and memory resources, as for instance mobile and wearable devices. In these cases the combination of aggregation methods with binary local features is very useful and led to scale up image search on large scale, where direct matching is not feasible.
This paper extends our early work on aggregations of binary features amato16
by a) providing a formulation of the Fisher Vector built using the Bernoulli Mixture Model (BMM) which preserve the structure of the traditional FV built using a Gaussian Mixture Model (existing implementations of the FV can be easily adapted to work also with BMMs); b) comparison of the BMM-FV against the other state-of-the-art aggregation approaches on two standard benchmarks (INRIA Holidays111Respect to the experimental setting used in our previous work amato16 , we improved the computation of the local features before the aggregation phase which allowed us to obtain better performances for BoW and VLAD on the INRIA Holidays dataset than that reported in amato16 . jegou08 and Oxford5k philbin07 ); c) evaluation of the BMM-FV on the top of several binary local features (ORB rublee11 , LATCH levi15_LATCH , AKAZE akaze ) whose performances have not been yet reported on benchmark for image retrieval; d) evaluation of the combination of the BMM-FV with the emerging Convolutional Neural Network (CNN) features, including experiments on a large scale. The results of our experiments show that the use of aggregation and quantization methods with binary local descriptors is generally effective even if, as expected, retrieval performance is worse than that obtained applying the same aggregation and quantization methods directly to non-binary features. The BMM-FV approach provided us with performance results that are better than all the other aggregation methods on binary descriptors. In addition, our results show that some aggregation methods led to obtain very compact image representation with a retrieval performance comparable to the direct matching, which actually is the most used approch to evaluate the similarity of images described by binary local features. Moreover, we show that the combinations of BMM-FV and CNN improve the latter retrieval performances and achieves effectiveness comparable with that obtained combining CNN and FV built upon SIFTs, previous proposed in chandrasekhar2015 . The advantage of combining BMM-FV and CNN instead of combining traditional FV and CNN is that BMM-FV relies on binary features whose extraction is noticeably faster than SIFT extraction.
The paper is organized as follows. Section 2 offers an overview of other articles in literature, related to local features, binary local features, and aggregation methods. Section 3 discusses how existing aggregation methods can be used with binary local features. It also contains our approach for applying Fisher Vectors on binary local features and how combining it with the CNN features. Section 4 discusses the evaluation experiments and the obtained results. Section 5 concludes.
2 Related Work
, is at the core of many computer vision applications, since it allows systems to efficiently match local structures between images. To date, the most used and cited local feature is the Scale Invariant Feature Transformation (SIFT)lowe04 . The success of SIFT is due to its distinctiveness that enable to effectively find correct matches between images. However, the SIFTs extraction is costly due to the local image gradient computations. In bay06 integral images were used to speed up the computation and the SURF feature was proposed as an efficient approximation of the SIFT. To further reduce the cost of extracting, representing, and matching local visual descriptors, researchers have investigated the binary local descriptors. These features have a compact binary representation that is not the result of a quantization, but rather is computed directly from pixel-intensity comparisons. One of the early studies in this direction was the Binary Robust Independent Elementary Features (BRIEF) calonder10 . Rublee et al. rublee11 proposed a binary feature, called ORB (Oriented FAST and Rotated BRIEF), whose extraction process is an order of magnitude faster than SURF, and two orders faster than SIFT according to the experimental results reported in rublee11 ; Miksik2012 ; heinly12 . Recently, several other binary local features have been proposed, such as BRISK leutenegger11 , AKAZE akaze , and LATCH levi15_LATCH .
Local features have been widely used in literature and applications, however since each image is represented by thousands of local features there is a significant amount of memory consumption and time required to compare local features within large databases. Aggregation techniques have been introduced to summarize the information contained in all the local features extracted from an image into a single descriptor. The advantage is twofold: 1) reduction of the cost of image comparison (each image is represented by a single descriptor rather than thousands descriptors); 2) aggregated descriptors have been proved to be particularly effective for image retrieval and classification task.
By far, the most popular aggregation method has been the Bag-of-Word (BoW) sivic03 . BoW was initially proposed for matching object in video and has been studied in many other papers, such as csurka04 ; philbin07 ; jegou10:improvingBoW ; jegou10:VLAD
, for classification and CBIR tasks. BoW uses a visual vocabulary to quantize the local descriptors extracted from images; each image is then represented by a histogram of occurrences of visual words. The BoW approach used in computer vision is very similar to the BoW used in natural language processing and information retrievalsalton86 , thus many text indexing techniques, such as inverted files witten99 , have been applied for image search. Search results obtained using BoW in CBIR have been improved by exploiting additional geometrical information philbin07 ; perdoch09 ; tolias11:SpeededUp ; zhao13 , applying re-ranking approaches philbin07 ; jegou08 ; chum07 ; tolias13:queryExp or using better encoding techniques, such as the Hamming Embedding jegou08 , soft/multiple-assignment philbin08 ; vanGemert08 ; jegou10:improvingBoW , sparse coding yang09 ; boureau10 , locality-constrained linear coding wang10 and spatial pyramids lazebnik06 .
Recently, alternative encodings schemes, like the Fisher Vectors (FVs) perronnin07 and the Vector of Locally Aggregated Descriptors (VLAD) jegou10:VLAD , have attracted much attention because of their effectiveness in both image classification and large-scale image search. The FV uses the Fisher Kernel framework jaakkola98 to transform an incoming set of descriptors into a fixed-size vector representation. The basic idea is to characterize how a sample of descriptors deviates from an average distribution that is modeled by a parametric generative model. The Gaussian Mixture Model (GMM) mclachlan2000
is typically used as generative model and might be understood as a “probabilistic visual vocabulary”. While BoW counts the occurrences of visual words and so takes in account just 0-order statistics, the FV offers a more complete representation by encoding higher order statistics (first, and optionally second order) related to the distribution of the descriptors. The FV results also in a more efficient representation, since fewer visual words are required in order to achieve a given performance. However, the vector representation obtained using BoW is typically quite sparse while that obtained using the Fisher Kernel is almost dense. This leads to some storage and input/output issues that have been addressed by using techniques of dimensionality reduction, such as the Principal Component Analysis (PCA)bishop06 , compression with product quantization gray98 ; jegou11:PQ and binary codes perronnin10 . In chandrasekhar2015 a fusion of FV and CNN features razavian2014cnn ; DeCaf was proposed and other works perronnin2015 ; Simonyan2013 ; Sydorov2014 have started exploring the combination of FVs and CNNs by defining hybrid architectures.
The VLAD method, similarly to BoW, starts with the quantization of the local descriptors of an image by using a visual vocabulary learned by -means. Differently from BoW, VLAD encodes the accumulated difference between the visual words and the associated descriptors, rather than just the number of descriptors assigned to each visual word. Thus, VLAD exploits more aspects of the distribution of the descriptors assigned to a visual word. As highlighted in jegou12 , VLAD might be viewed as a simplified non-probabilistic version of the FV. In the original scheme jegou10:VLAD , as for the FV, VLAD was -normalized. Subsequently a power normalization step was introduced for both VLAD and FV jegou12 ; perronnin10 . Furthermore, PCA dimensionality reduction and product quantization were applied and several enhancements to the basic VLAD were proposed arandjelovic13:allAbVALD ; chen11 ; delhumeau13 ; zhao13 .
The aggregation methods have been defined and used almost exclusively in conjunction with local features that have a real-valued representation, such as SIFT and SURF. Few articles have addressed the problem of modifying the state-of-the-art aggregation methods to work with the emerging binary local features. In galvez11 ; zhang13 ; grana13 ; lee15 the use of ORB descriptors was integrated into the BoW model by using different clustering algorithms. In galvez11
the visual vocabulary was calculated by binarizing the centroids obtained using the standard-means. In zhang13 ; grana13 ; lee15 the -means clustering was modified to fit the binary features by replacing the Euclidean distance with the Hamming distance, and by replacing the mean operation with the median operation. In van14 the VLAD image signature was adapted to work with binary descriptors: -means is used for learning the visual vocabulary and the VLAD vectors are computed in conjunction with an intra-normalization and a final binarization step. Recently, also the FV scheme has been adapted for the use with binary descriptors: Uchida et al. uchida13 derived a FV where the Bernoulli Mixture Model was used instead of the GMM to model binary descriptors, while Sanchez and Redolfi sanchez15
generalized the FV formalism to a broader family of distributions, known as the exponential family, that encompasses the Bernoulli distribution as well as the Gaussian one.
3 Image Representations
In order to decide if two images contain the same object or have a similar visual content, one needs an appropriate mathematical description of each image. In this section, we describe some of the most prominent approaches to transform an input image into a numerical descriptor. First we describe the principal aggregation techniques and the application of them to binary local features. Then, the emerging CNN features are presented.
3.1 Aggregation of local features
In the following we review how quantization and aggregation methods have been adapted to cope with binary features. Specifically we present the BoW sivic03 , the VLAD jegou10:VLAD and the FV perronnin07 approaches.
The Bag of (Visual) Words (BoW) sivic03 uses a visual vocabulary to group together the local descriptors of an image and represent each image as a set (bag) of visual words. The visual vocabulary is built by clustering the local descriptors of a dataset, e.g. by using -means kmeans . The cluster centers, named centroids, act as the visual words
of the vocabulary and they are used to quantize the local descriptors extracted from the images. Specifically, each local descriptor of an image is assigned to its closest centroid and the image is represented by a histogram of occurrences of the visual words. The retrieval phase is performed using text retrieval techniques, where visual words are used in place of text word and considering a query image as disjunctive term-query. Typically, the cosine similarity measure in conjunction with a term weighting scheme, e.g. term frequency-inverse document frequency (tf-idf), is adopted for evaluating the similarity between any two images.
BoW and Binary Local Features
In order to extend the BoW scheme to deal with binary features we need a cluster algorithm able to deal with binary strings and Hamming distance. The -medoids kaufman87 are suitable for this scope, but they requires a computational effort to calculate a full distance matrix between the elements of each cluster. In grana13 it was proposed to use a voting scheme, named -majority, to process a collection of binary vectors and seek for a set of good centroids, that will become the visual words of the BoW model. An equivalent representation is given also in zhang13 ; lee15 , where the BoW model and the -means clustering have been modified to fit the binary features by replacing the Euclidean distance with the Hamming distance, and by replacing the mean operation with the median operation.
3.1.2 Vector of Locally Aggregated Descriptors
The Vector of Locally Aggregated Descriptors (VLAD) was initially proposed in jegou10:VLAD . As for the BoW, a visual vocabulary is first learned using a clustering algorithm (e.g. -means). Then each local descriptor of a given image is associated with its nearest visual word in the vocabulary and for each centroid the differences of the vectors assigned to are accumulated: The VLAD is the concatenation of the residual vectors , i.e. . All the residuals have the same size which is equal to the size of the used local features. Thus the dimensionality of the whole vector is fixed too and it is equal to .
VLAD and Binary Local Features
A naive way to apply the VLAD scheme to binary local descriptors is treating binary vectors as a particular case of real-valued vectors. In this way, the -means algorithm can be used to build the visual vocabulary and the difference between the centroids and the descriptors can be accumulated as usual. This approach has also been used in van14 , where a variation to the VLAD image signature, called BVLAD, has been defined to work with binary features. Specifically, the BVLAD is the binarization (by thresholding) of a VLAD obtained using power-law, intra-normalization, normalization and multiple PCA. Thereafter we have not evaluated the performance of the BVLAD because the binarization of the final image signature is out of the scope of this paper.
Similarly to BoW, various binary-cluster algorithms (e.g. -medoids and -majority) and the Hamming distance can be used to build the visual vocabulary and associate each binary descriptor to its nearest visual word. However, as we will see, the use of binary centroids may provide less discriminant information during the computation of the residual vectors.
3.1.3 Fisher Vector
The Fisher Kernel jaakkola98 is a powerful framework adopted in the context of image classification in perronnin07 as efficient tool to encode image local descriptors into a fixed-size vector representation. The main idea is to derive a kernel function to measure the similarity between two sets of data, such as the sets of local descriptors extracted from two images. The similarity of two sample sets and is measured by analyzing the difference between the statistical properties of and , rather than comparing directly and
. To this scope a probability distributionwith some parameters
is first estimated on a training set and it is used as generative model over the the space of all the possible data observations. Then each setof observations is represented by a vector, named Fisher Vector, that indicates the direction in which the parameter of the probability distribution should be modified to best fit the data in . In this way, two samples are considered similar if the directions given by their respective Fisher Vectors are similar. Specifically, as proposed in jaakkola98 , the similarity between two sample sets and is measured using the Fisher Kernel, defined as where is the Fisher Information Matrix (FIM) and is referred to as the score function.
The computation of the Fisher Kernel is costly due the multiplication by the inverse of the FIM. However, by using the Cholesky decomposition , it is possible to re-written the Fisher Kernel as an Euclidean dot-product, i.e. where is the Fisher Vector (FV) of perronnin10 .
Note that the FV is a fixed size vector whose dimensionality only depends on the dimensionality of the parameter . The FV is further divided by in order to avoid the dependence on the sample size sanchez13 and -normalized because, as proved in perronin10:improvingFK ; sanchez13 , this is a way to cancel-out the fact that different images contain different amounts of image-specific information (e.g. the same object at different scales).
The distribution , which models the generative process in the space of the data observation, can be chosen in various way. The Gaussian Mixture Model (GMM) is typically used to model the distribution of non-binary features considering that, as pointed in mclachlan2000 , any continuous distribution can be approximated arbitrarily well by an appropriate finite Gaussian mixture. Since the Bernoulli distribution models an experiment that has only two possible outcomes (0 and 1), a reasonable alternative to characterize the distribution of a set of binary features is to use a Bernoulli Mixture Model (BMM).
FV and Binary Local Features
In this work we derive and test an extension of the FV built using BMM, called BMM-FV, to encode binary features. Specifically, we chose to be multivariate Bernoulli mixture with components and parameters :
Given a set of -dimensional binary vectors and assuming that the samples are independent we have that the score vector with respect to the parameter is calculated (see Appendix A) as the concatenation of
where is the occupancy probability
(or posterior probability). The occupancy probabilityrepresents the probability for the observation to be generated by the -th Bernoulli and it is calculated as
The FV of is then obtained by normalizing the score by the matrix , which is the square root of the inverse of the FIM, and by the sample size . In the Appendix B we provide an approximation of FIM under the assumption that the occupancy probability is sharply peaked on a single value of for each descriptor , obtained following an approach very similar to that used in sanchez13 for the GMM case. By using our FIM approximation, we got the following normalized gradient:
The final BMM-FV is the concatenation of and for , and is therefore of dimension .
|BMM-FV (our formalization)|
|BMM-FV (Uchida et. al uchida13 )|
|( not explicitly derived in uchida13 )|
An extension of the FV by using the BMM has been also carried in uchida13 ; sanchez15 . Our approach differs from the one proposed in uchida13 in the approximation of the square root of the inverse of the FIM (i.e.,
) . It is worth noting that our formalization preserves the structure of the traditional FV derived by using the GMM, where Gaussian means and variances are replaced by Bernoulli meansand variances (see Table 1).
In sanchez15 , the FV formalism was generalized to a broaden family of distributions knows as exponential family that encompasses the Bernoulli distribution as well as the Gaussian one. However, sanchez15 lacks in an explicit definition of the FV and of the FIM approximation in the case of BMM which was out of the scope of their work. Our formulation differs from that of sanchez15 in the choice of the parameters used in the gradient computation of the score function 222A Bernoulli distribution of parameter can be written as exponential distribution
can be written as exponential distributionwhere is the natural parameter. In sanchez15 the score function is computed considering the gradient w.r.t. the natural parameters while in this paper we used the gradient w.r.t. the standard parameter of the Bernoulli (as also done in uchida13 ). . A similar difference holds also for the FV computed on the GMM, given that in sanchez15
the score function is computed w.r.t. the natural parameters of the Gaussian distribution rather than the mean and the variance parameters which are typically used in literature for the FV representationperronnin10 ; perronnin07 ; sanchez13 . Unfortunately, the authors of sanchez15 didn’t experimentally compare the FVs obtained using or not the natural parameters.
Sànchez sanchez13 highlights that the FV derived from GMM can be computed in terms of the following -order and -order statistics: , . Our BMM-FV can be also written in terms of these statistics as
We finally used power-law and normalization to improve the effectiveness of the BMM-FV approach.
3.2 Combination of Convolutional Neural Network Features and Aggregations of Binary Local Feature
Convolutional Neural Networks (CNNs) LeCun2015 have brought breakthroughs in the computer vision area by improving the state-of-the-art in several domains, such as image retrieval, image classification, object recognition, and action recognition. Depp CNN allows a machine to automatically learn representations of data with multiple levels of abstraction which can be used for detection or classification tasks. CNNs are neural networks specialized for data that has a grid-like topology as image data. The applied discrete convolution operation results in a multiplication by a matrix which has several entries constrained to be equal to other entries. Three important ideas are behind the success CNNs: sparse connectivity, parameter sharing, and equivariant representations DeepLearning2016 .
In image retrieval, the activations produced by an image within the top layers of the CNN have been successfully used as a high-level descriptors of the visual content of the image DeCaf . The results reported in razavian2014cnn
shows that these CNN features, compared by using the Euclidean distance, achieve state-of-the-art quality in terms of mAP. Most of the papers reporting results obtained using the CNN features maintain the Rectified Linear Unit (ReLU) transformDeCaf ; razavian2014cnn ; chandrasekhar2015 , i.e., negative activations values are discarded replacing them with 0. Values are typically normalized babenko2014neural ; razavian2014cnn ; chandrasekhar2015 and we did the same in this work. In Section 4.2 we describe the CNN model used in our experiments.
Recently, in chandrasekhar2015 it has been shown that the information provided by the FV built upon SIFT helps to further improve the retrieval performance of the CNN features and a combination of FV and CNN features has been used as well chandrasekhar2015 ; amato16:JOCCH . However, the benefits of such combinations are clouded by the cost of extracting SIFTs that can be considered to high with respect to the cost of computing the CNN features (see Table 2). Since the extraction of binary local features is up two times faster than SIFT, in this work we also investigate the combination of CNN features with aggregations of binary local feature, including BMM-FV.
We combined BMM-FV and CNN using the following approach. Each image was represented by a couple , where and were respectively the CNN descriptor and the BMM-FV of the image. Then, we evaluated the distance between two couples and as the convex combination between the distances of the CNN descriptors (i.e. ) and the BMM-FV descriptors (i.e. ). In other words, we defined the distance between two couples and as
with . Choosing corresponds to use only FV approach, while correspond to use only CNN features. Please note that in our case both the FV and the CNN features are normalized so the distance function between the CNN descriptors has the same range value of the distance function between the BMM-FV descriptors.
Similarly, combinations between CNN features and other image descriptors, such as GMM-FV, VLAD, and BoW can be considered by using the convex combination of the respective distances. Please note that whenever the range of the two used distances is not the same, the distances should be rescaled before the convex combination (e.g. divide each distance function by its maximum value).
Average time costs for computing various image representations using a CPU implementation. The cost of computing the CNN feature of an image was estimated using a pre-learned AlexNet model and the Caffe frameworkcaffe2014 on an Intel i7 3.5 GHz. The values related to the FV refers only to the cost of aggregating the local descriptors of an image into a single vector and do not encompass the cost of extracting the local features, neither the learning of the Gaussian or the Bernoulli Mixture Model which is calculated off-line. The cost of computing FV varies proportionally with , where is the number of local features extracted from an image, is the number of mixtures of Gaussian/Bernoulli, and is the dimensionality of each local feature; we reported the approximate cost for and and on an Intel i7 3.5 GHz. The cost of SIFT/ORB local feature extraction was estimated according to heinly12 by considering about features per image.
In this section we evaluate and compare the performance of the techniques described in this paper to aggregate binary local descriptors. Specifically, in the Subsection 4.3 we compare the BoW, the VLAD, the FV based on the GMM, and the BMM-FV approach to aggregate ORB binary features. Since the BMM-FV achieved the best results over the other tested approaches, in the Subsection 4.4 we further evaluate the performance of the BMM-FVs using different binary features (ORB, LATCH, AKAZE) and combining them with the CNN features. Finally, in the Subsection 4.5, we report experimental results on large scale.
The experiments were conducted using two benchmark datasets, namely INRIA Holidays jegou08 and Oxford5k philbin07 , that are publicly available and often used in the context of image retrieval jegou10:VLAD ; zhao13 ; jegou08 ; arandjelovic12:rootsift ; perronnin10 ; jegou12 ; tolias14 .
INRIA Holidays jegou08 is a collection of images which mainly contains personal holidays photos. The images are of high resolution and represent a large variety of scene type (natural, man-made, water, fire effects, etc). The dataset contains 500 queries, each of which represents a distinct scene or object. For each query a list of positive results is provided. As done by the authors of the dataset, we resized the images to a maximum of pixels ( pixels for the smaller dimension) before extracting the local descriptors.
Oxford5k philbin07 consists of images collected from Flickr. The dataset comprise distinct Oxford buildings together with distractors. There are query images: queries for each building. The collection is provided with a comprehensive ground truth. For each query there are four image sets: Good (clear pictures of the object represented in the query), OK (images where more that of the object is clearly visible), Bad (images where the object is not present) and Junk (images where less than of the object is visible or images with high level of distortion).
As in many other articles, e.g. jegou10:VLAD ; jegou08 ; philbin08 ; jegou12 , all the learning stages (clustering, etc.) were performed off-line using independent image collections. Flickr60k dataset jegou08 was used as training set for INRIA Holidays. It is composed of images randomly extracted from Flickr. The experiments on Oxford5k were conducted performing the learning stages on Paris6k dataset philbin08 , that contains high resolution images obtained from Flickr by searching for famous Paris landmarks.
For large-scale experiments we combined the Holidays dataset with the 1 million MIRFlickr dataset huiskes08 , used as distractor set as also done in jegou08 ; Amato2016 . Compared to Holidays, the Flickr datasets is slightly biased, because it includes low-resolution images and more photos of humans.
4.2 Experimental settings
In the following we report some details on how the features for the various approaches were extracted.
- Local features.
- Visual Vocabularies and Bernoulli/Gaussian Mixture Models.
The visual vocabularies used for building the BoW and VLAD representations were computed using several clustering algorithms, i.e. -medoids, -majority and -means. The -means algorithm was applied to the binary features by treating the binary vectors as real-valued vectors. The parameters of the BMM and of the GMM (where is the number of mixture components and
is the dimension of each local descriptor) were learned independently by optimizing a maximum-likelihood criterion with the Expectation Maximization (EM) algorithmbishop06 . EM is an iterative method that is deemed to have converged when the change in the likelihood function, or alternatively in the parameters , falls below some threshold . As stopping criterion we used the convergence in -norm of the mean parameters, choosing . As suggested in bishop06 , the BMM/GMM parameters used in EM algorithm were initialized with: (a) for the mixing coefficients and ; (b) random values chosen uniformly in the range , for the BMM means ; (c) centroids precomputed using -means for the GMM means ; (d) mean variance of the clusters found using -means for the diagonal elements of the GMM covariance matrices.
All the learning stages, i.e. -means, -medoids, -majority and the estimation of GMM/BMM, were performed using in order of 1M descriptors randomly selected from the local features extracted on the training sets (namely Flickr60k for INRIA Holidays and Paris6k for Oxford5k).
- BoW, VLAD, FV.
The various encodings of the local feature (as well as the visual vocabularies and the BMM/GMM) were computed by using our Visual Information Retrieval library that is publicly available on GitHub444https://github.com/ffalchi/it.cnr.isti.vir. These representations are all parametrized by a single integer . It corresponds to the number of centroids (visual words) used in BoW and VLAD, and to the number of mixture components of GMM/BMM used in FV representations.
For the FVs, we used only the components associated with the mean vectors because, as happened in the non-binary case, we observed that the components related to the mixture weights do not improve the results.
As a common post-processing step perronin10:improvingFK ; jegou12 , both the FVs and the VLADs were power-law normalized and subsequently -normalized. The power-law normalization is parametrized by a constant and it is defined as . In our experiments we used .
We also applied PCA to reduce VLAD and FV dimensionality. The projection matrices were estimated on the training datasets.
- CNN features.
We used the pre-trained HybridNet zhou2014 model, downloaded from the Caffe Model Zoo555https://github.com/BVLC/caffe/wiki/Model-Zoo. The architecture of HybridNet is the same as the BVLC Reference CaffeNet666https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet which mimics the original AlexNet krizhevsky2012 , with minor variations as described in caffe2014 . It has weight layers ( convolutional + fully-connected). The model has been trained on categories ( scene categories from Places Database zhou2014 and
object categories from ImageNetdeng2009 ) with about million images.
In the test phase we used Caffe and we extracted the output of the first fully-connected layer (fc6) after applying the Rectified Linear Unit (ReLU) transform. The resulting -dimensional descriptors were normalized.
As preprocessing step we warped the input images to the canonical resolution of RGB (as also done in DeCaf ).
- Feature comparison and performance measure.
The cosine similarity in conjunction with a term weighting scheme (e.g., tf-idf) is adopted for evaluating the similarity between BoW representations, while the Euclidean distance is used to compare VLAD, FV and CNN-based image signatures. Please note that the Euclidean distance is equivalent to the cosine similarity whenever the vectors are -normalized, as in our case777 To search a database for the objects similar to a query we can use either a similarity function or a distance function. In the first case, we search for the objects with greatest similarity to the query. In the latter case, we search for the objects with lowest distance from the query. A similarity function is said to be equivalent to a distance function if the ranked list of the results to query is the same. For example, the Euclidean distance between two vectors () is equivalent to the cosine similarity () whenever the vectors are - normalized (i.e. ). In fact, in such case, , which implies that the ranked list of the results to a query is the same (i.e., iff )..
The image comparison based on the direct matching of the local features (i.e. without aggregation) was performed adopting the distance ratio criterion proposed in lowe04 ; heinly12 . Specifically, candidate matches to local features of the image query are identified by finding their nearest neighbors in the database of images. Matches are discarded if the ratio of the distances between the two closest neighbors is above the 0.8 threshold. Similarity between two images is computed as the percentage of matching pairs with respect to the total local features in the query image.
The retrieval performance of each method was measured by the mean average precision (mAP). In the experiments on INRIA Holidays, we computed the average precision after removing the query image from the ranking list. In the experiments on Oxford5k, we removed the junk images from the ranking before computing the average precision, as recommended in philbin07 and in the evaluation package provided with the dataset.
4.3 Comparison of Various Encodings of Binary Local Features
In Table 3 we summarize the retrieval performance of various aggregation methods applied to ORB features, i.e. the BoW, the VLAD, the FV based on the GMM, and the BMM-FV. In addition, in the last line of the table we reports the results obtained without any aggregation, that we refer to as the direct matching of local features, which was performed adopting the distance ratio criterion as previously described in the Subsection 4.2.
In our experiments the FV derived as in uchida13 obtained very similar performance to that of our BMM-FV, thus we have reported just the results obtained by using our formulation. Furthermore, we have not experimentally evaluated the FVs computed using the gradient with respect to the natural parameters of a BMM or a GMM as described in sanchez15 , because the evaluation of the retrieval performance obtained using or not the natural parameters in the derivation of the score function is a more general topic which reserve to be further investigated outside the specific context of the encodings binary local features.
|Method||Local Feature||Learning method||dim||mAP|
|Method||Local Feature||Learning method||dim||mAP|
|BoW||SIFT PCA 64||-means||20,000||20,000||43.7||35.4|
|VLAD||SIFT PCA 64||-means||64||4,096||55.6||37.8|
|FV||SIFT PCA 64||GMM||64||4,096||59.5||41.8|
Among the various baseline aggregation methods (i.e. without using PCA), the BMM-FV approach achieves the best retrieval performance, that is a mAP of 49.6% on Holidays and 24.3% on Oxford. PCA dimensionality reduction from to components, applied on BMM-FV, marginally reduces the mAP on Oxford5k, while on Holiday allows us to get 51.3% that is, for this dataset, the best result achieved between all the other aggregation techniques tested on ORB binary features.
Good results are also achieved using VLAD in conjunction with -means, which obtains a mAP of 47.8% on Holidays and 23.6% on Oxford5k.
The BOW representation allows to get a mAP of 44.9%/44.2%/37.9% on Holidays and 22.2%/22.8%/18.8% on Oxford5k using respectively -means/-majority/-medoids for the learning of a visual vocabulary of visual words.
The GMM-FV method gives results slight worse than BoW: 42.0% of mAP on Holidays and 20.4% of mAP on Oxford5k. The use of PCA to reduce dimensions from to lefts the results of GMM-FV on Oxford5k substantially unchanged while slightly improved the mAP on Holidays (42.6%).
Finally, the worst performance are that of VLAD in combination with vocabularies learned by -majority (32.4% on Holidays and 16.6% on Oxford) and -medoids (30.6% on Holidays and 15.6% on Oxford).
It is generally interesting to note that on INRIA Holidays, the VLAD with -means, the BoW with -means/-majority, and the FVs are better than direct match. In fact, mAP of direct matching of ORB descriptors is 38.1% while on Oxford5k the direct matching reached a mAP of 31.7%.
In Table 5 we also report the performance of our derivation of the BMM-FV varying the number of Bernoulli mixture components and investigating the impact of the PCA dimensionality reduction in the case of .
In Table (6(a)) we can see that with the Holidays dataset, the mAP grows from 32.0% when using only 4 mixtures to 54.7% when using . On Oxford5k, mAP varies from 14.3% to 27.4%, respectively, for and .
Table (6(b)) shows that the best results are achieved when reducing the full size BMM-FV to with a mAP of 52.6% for Holidays and 25.1% for Oxfrod5k.
Analysis of the results
Summing up, the results show that in the context of binary local features the BMM-FV outperforms the compared aggregation methods, namely the BoW, the VLAD and the GMM-FV. The performance of the BMM-FV is an increasing function of the number of Benoulli mixtures. However, for large , the improvement tends to be smaller and the dimensionality of the FV becomes very large (e.g. dimensions using ). Hence, for high values of , the benefit of the improved accuracy is not worth the computational overhead (both for the BMM estimation and for the cost of storage/comparison of FVs).
The PCA reduction of BMM-FV is effective since it can provide a very compact image signature with just a slight loss in accuracy, as shown in the case of (Table 6(b)). Dimension reduction does not necessarily reduce the accuracy. Conversely, limited reduction tend to improve the retrieval performance of the FV representations.
For the computation of VLAD, the -means results are more effective than -majority/-medoids clustering, since the use of non-binary centroids gives more discriminant information during the computation of the residual vectors used in VLAD.
For the BoW approach, -means and -majority performs equally better than -medoids. However, the -majority is preferable in this case because the cost of the quantization process is significantly reduced by using the Hamming distance, rather than Euclidean one, for the comparison between centroids and binary local features.
Both BMM-FV and VLAD, with only , outperform BoW. However, as happens for non-binary features (see Table 4), the loss in accuracy of BoW representation is comparatively lower when the variability of the images is limited, as for the Oxford5k dataset.
As expected, BMM-FV outperforms GMM-FV, since the probability distribution of binary local features is better described using mixtures of Bernoulli rather than mixtures of Gaussian. The results of our experiments also show that the use of BMM-FV is still effective even if compared with the direct matching strategy. In fact, the retrieval performance of BMM-FV on Oxford5k is just slightly worse than traditional direct matching of local feature, while on INRIA Holidays the BMM-FV even outperforms the direct matching result.
For completeness, in Table 4, we also report the results of the same base-line encodings approaches applied to non-binary features (both full-size SIFT and PCA-reduced to 64 components) taken from literature jegou10:VLAD ; jegou12 . As expected, aggregation methods in general exhibit better performance in combination with SIFT/SIFTPCA then with ORB, expecially for the Oxford5k dataset. However, it is worth noting that on the INRIA Holidays the BMM-FV outperforms the BoW on SIFT/SIFTPCA and reach similar performance of the FV built upon SIFTs.
The FV and VLAD get considerable benefit from performing PCA of SIFT local descriptors before the aggregation phase as the PCA rotation decorrelate the descriptors components. This suggest that techniques, such as VLAD with k-means and GMM-FV, which treat binary vectors as real-valued vectors, may also benefit from the use of PCA before the aggregation phase.
In conclusion, it is important to point-out that there are several applications where binary features need to be used to improve efficiency, at the cost of some effectiveness reduction heinly12 . We showed that in this case the use of the encodings techniques represent a valid alternative to the direct matching.
4.4 Combination of CNNs and Aggregations of Binary Local Feature
In this section we evaluate the retrieval performance of the combination of CNN features with the aggregations of binary local feature, following the approach described in Section 3.2. We considered the INRIA Holidays dataset and we used the the output of the first fully-connected layer (fc6) of the HybridNet zhou2014 model as CNN feature. In fact, in chandrasekhar2015 several experiments on the INRIA Holidays have shown that HybridNet fc6 achieve better mAP result than other outputs (e.g. pool5, fc6, fc7, fc8) of several pre-trained CNN models: the OxfordNet simonyan2014 , the AlexNet krizhevsky2012 , the PlacesNet zhou2014 and the HybridNet itself.
Figure 1 shows the mAP obtained by combining HybridNet fc6 with different aggregations of ORB binary features, namely the BMM-FV, the GMM-FV, the VLAD, and the BoW. Interestingly, with the exception of the GMM-FV, the retrieval performance obtained after the combination is very similar for the various aggregation techniques. This, on the one hand confirms that the GMM-FV is not the best choice for encoding binary features, and on the other hand, since each aggregation technique computes statistical summaries of the same set of the local descriptors, suggests that the additional information provided by the various aggregated descriptors helps almost equally to improve the retrieval performance of the CNN feature. Thus, in the following we further investigate combinations of CNNs and the BMM-FV that, even for a shot, reaches the best performance for all the tested parameter .
In Table 6 we report the mAP obtained combining the HybridNet fc6 feature with the BMM-FV computed for three different kind of binary local features, namely ORB, LATCH and AKAZE, using mixtures of Bernoulli. It is worth noting that all the three BMM-FVs give a similar improvement when combined with the HybridNet fc6, although they have rather different mAP results (see first row of Table 6) which are substantially lower than that of CNN (last row of Table 6). The intuition is that the additional information provided by using a specific BMM-FV rather than using the CNN feature alone, do not depend very much on the used binary feature.
For each tested BMM-FV seems that exist an optimal to be used in the convex combination (equation (4)). When ORB binary features were used, the optimal was obtained around , which correspond to give the same importance to both FV and CNN feature. For the less effective BMM-FVs built upon LATCH and AKAZE, the optimal was , which means that the CNN feature is used with slightly more importance than BMM-FV during the convex combination.
The use of ORB or AKAZE led to obtain the best performance that was 79.2% of mAP. This results in a relative improvement of 4.9% respect to the single use of the CNN feature, that in our case was 75.5%. So we obtain the same relative improvement of chandrasekhar2015 but using a less expensive FV representation. Indeed, in chandrasekhar2015 the fusion of HybridNet fc6 and a FV computed on 64-dimensional PCA-reduced SIFTs, using mixtures of Gaussian, have led to obtain a relative improvement of respect to the use of the CNN feature alone (see also Table 8).
However, the cost for integrating traditional FV built upon SIFTs with CNN features may be considered too high, especially for systems that need to process image in real time. For example, according to heinly12 and as showed in the table 2, the SIFTs extraction (about features per image), the PCA-reduction to dimensions, and the FV aggregation with requires more than seconds per image, while the CNN feature extraction is 4 times faster (i.e., about ms per image). On the other hand, extracting ORB binary features (about features per image, each of dimension ) and aggregating them using a BMM-FV with requires less than ms that is in line with the cost of CNN extraction ( ms). In our tests, the cost for integrating the already extracted BMM-FV and the CNN features was negligible in the search phase, using a sequential scan to search a dataset, also thanks to the fact that both BMM-FV and CNN features are computed using the not too costly Euclidean distance.
Since as observed in levi15_LATCH ; akaze the ORB extraction is faster than LATCH and AKAZE, in the following we focus just on ORB binary feature. In figure 2 we show the results obtained by combining HybridNet fc6 with the BMM-FVs obtained using . We observed that the performance of the CNN feature is improved also when it is combined with the less effective BMM-FV built using Bernoulli. The BMM-FV with achieve the best effectiveness (mAP of 79.5%) for . However, since the cost for computing and storing FV increase with the number of Bernoulli, the improvement obtained using respect to that of doesn’t worth the extra cost of using a bigger value of .
The BMM-FV with is still high dimensional, so to reduce the cost of storing and comparing FV, we also evaluated the combination after the PCA dimensionality reduction. As already observed, limited dimensionality reduction tends to improve the accuracy of the single FV representation. In fact, the BMM-FV with achieved a mAP of when reduced from to dimensions. However, as shown in Table 7 and Table 8, when the PCA-reduced version of the BMM-FV was combined with HybriNet fc6, the overall relative improvement in mAP was , which is less than that obtained using the full-sized BMM-FV. These result is not surprising given that after the dimensionality reduction we may have a loss of the additional information provided by the FV representation during the combination with the CNN feature.
Finally, in Table 8 we summarizes the relative improvement achieved by combining BMM-FV and HybriNet fc6, and we compare the obtained results with the relative improvement achieved in chandrasekhar2015 , where the more expensive FV built upon SIFTs was used. We observed that BMM-FV led to achieve similar or even better relative improvements with an evident advantage from the computational point of view, because it uses binary local features.
|FV full dim||FV PCA-reduced||FV full dim||FV PCA-reduced|
4.5 Large-Scale Experiments
|FV full dim||FV PCA-reduced||FV full dim||FV PCA-reduced|
|angle=45,raise=3pt27pt4ptHolidays||angle=45,raise=3pt27pt4ptHolidays+ MIRFlickr||angle=45,raise=3pt27pt4ptHolidays||angle=45,raise=3pt27pt4ptHolidays+ MIRFlickr|
|Maximum relative mAP improvement||4.9%||13.4%||4.0%||13.7%|
In order to evaluate the behavior of feature combinations on a large scale, we have used a set of up to one million images. More precisely, as in jegou08 , we merged the INRIA Holidays dataset with a public large-scale dataset (MIRFlickr-1M huiskes08 ) used as distraction set; the mAP was measured using the Holidays ground-truth.
Table 9 reports results obtained using both the BMM-FV alone and the combinations with the HybridNet fc6 CNN feature. Given the results reported in the previous section we focus on the BMM-FV encoding of ORB binary features. All the feature combinations show an improvement with respect to the single use of the CNN feature (mAP of ) or BMM-FV (mAP of respectively using the full length/PCA-reduced descriptor). This reflects the very good behavior of feature combinations also in the large-scale case.
The mAP reaches a maximum using between 0.4 and 0.5, that is giving (quite) the same weight to BMM-FV and CNN feature during the combination. The results obtained using the full length BMM-FV and the PCA-reduced version are similar. The latter performs slightly better and achieved a maximum of 67.2% of mAP that correspond to 13.7% of relative mAP improvement respect to use the CNN feature alone. It is worth noting that the relative mAP improvement obtained in the large-scale setting is much greater than that obtained without the distraction set. This suggests that the information provided by the BMM-FV during the combination helps in discerning the visual content of images particularly in presence of distractor images.
Since the computational time of extracting binary features is much faster than others, the computational gain of combining CNN features with BMM-FV encodings of ORB over traditional FV encodings of SIFT is especially notable in the large-scale scenario. For example, the process for extracting SIFTs from the INRIA Holidays+ MIRFlickr dataset ( images) would have required more than 13 days (about 1,200 ms per image) while ORB extraction took less than 8 hours (about 26 ms per image).
Motivated by recent results obtained on one hand with the use of aggregation methods applied to local descriptors, and on the other with the definition of binary local features, this paper has performed an extensive comparisons of techniques that mix the two approaches by using aggregation methods on binary local features. The use of aggregation methods on binary local features is motivated by the need for increasing efficiency and reducing computing resources for image matching on a large scale, at the expense of some degradation in the accuracy of retrieval algorithms. Combining the two approaches lead to execute image retrieval on a very large scale and reduce the cost for feature extraction and representation. Thus we expect that the results of our empirical evaluation are useful for people working with binary local descriptors.
Moreover, we investigated how aggregations of binary local features work in conjunction with the CNN pipeline in order to improve the latter retrieval performance. We showed that the BMM-FV built upon ORB binary features can be profitable use to this scope, even if a relative small number of Bernoulli is used. In fact, the relative improvement in the retrieval performance obtained combining CNN features with the BMM-FV is similar to that previously obtained in chandrasekhar2015 where a combination of the CNN features with the more expensive FV built on SIFT was proposed. Experimental evaluation on large scale confirms the effectiveness and scalability of our proposal.
It is also worth mentioning that the BMM-FV approach is very general and could be applied to any binary feature. Recent works based on CNNs suggest that binary features aggregation technique could be further applied to deep features. In fact, on one hand, local features based on CNNs, aggregated with VLAD and FV approaches, have been proposed to obtain robustness to geometric deformationsNg_2015_CVPR_Workshops ; Uricchio_2015_ICCV_Workshops . On the other hand, binarization of global CNN features have been also proposed in Lin_2015_CVPR_Workshops ; Lai_2015_CVPR . Thus, as a future work, we plan to test the BMM-FV approach over binary deep local descriptors leveraging on the local and binary approaches mentioned above.
Acknowledgements.This work was partially founded by: EAGLE, Europeana network of Ancient Greek and Latin Epigraphy, co-founded by the European Commission, CIP-ICT-PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and Smart News, Social sensing for breakingnews, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.
- (1) Alcantarilla, P.F., Nuevo, J., Bartoli, A.: Fast explicit diffusion for accelerated features in nonlinear scale spaces. In: In British Machine Vision Conference (BMVC) (2013)
- (2) Amato, G., Falchi, F., Gennaro, C., Vadicamo, L.: Deep Permutations: Deep Convolutional Neural Networks and Permutation-Based Indexing, pp. 93–106. Springer International Publishing, Cham (2016). DOI 10.1007/978-3-319-46759-7˙7. URL http://dx.doi.org/10.1007/978-3-319-46759-7_7
- (3) Amato, G., Falchi, F., Vadicamo, L.: How effective are aggregation methods on binary features? In: Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, vol. 4, pp. 566–573 (2016)
- (4) Amato, G., Falchi, F., Vadicamo, L.: Visual recognition of ancient inscriptions using convolutional neural network and fisher vector. J. Comput. Cult. Herit. 9(4), 21:1–21:24 (2016). DOI 10.1145/2964911. URL http://doi.acm.org/10.1145/2964911
Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve
In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2911–2918 (2012)
- (6) Arandjelovic, R., Zisserman, A.: All about VLAD. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 1578–1585 (2013). DOI 10.1109/CVPR.2013.207
- (7) Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Computer Vision–ECCV 2014, pp. 584–599. Springer (2014). DOI 10.1007/978-3-319-10590-1˙38. URL http://dx.doi.org/10.1007/978-3-319-10590-1_38
- (8) Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: A. Leonardis, H. Bischof, A. Pinz (eds.) Computer Vision - ECCV 2006, Lecture Notes in Computer Science, vol. 3951, pp. 404–417. Springer Berlin Heidelberg (2006). DOI 10.1007/11744023˙32. URL http://dx.doi.org/10.1007/11744023_32
- (9) Bing images. URL http://www.bing.com/images/
Bishop, C.M.: Pattern Recognition and Machine Learning.Information Science and Statistics. Springer (2006)
- (11) Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2559–2566 (2010)
- (12) Calonder, M., Lepetit, V., Strecha, C., Fua, P.: Brief: Binary robust independent elementary features. In: K. Daniilidis, P. Maragos, N. Paragios (eds.) Computer Vision - ECCV 2010, Lecture Notes in Computer Science, vol. 6314, pp. 778–792. Springer Berlin Heidelberg (2010)
- (13) Chandrasekhar, V., Lin, J., Morère, O., Goh, H., Veillard, A.: A practical guide to cnns and fisher vectors for image instance retrieval. CoRR abs/1508.02496 (2015). URL http://arxiv.org/abs/1508.02496
- (14) Chen, D., Tsai, S., Chandrasekhar, V., Takacs, G., Chen, H., Vedantham, R., Grzeszczuk, R., Girod, B.: Residual enhanced visual vectors for on-device image matching. In: Signals, Systems and Computers (ASILOMAR), 2011 Conference Record of the Forty Fifth Asilomar Conference on, pp. 850–854 (2011). DOI 10.1016/j.sigpro.2012.06.005. URL http://dx.doi.org/10.1016/j.sigpro.2012.06.005
- (15) Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: Automatic query expansion with a generative feature model for object retrieval. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 1–8 (2007)
- (16) Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. Workshop on statistical learning in computer vision, ECCV 1(1-22), 1–2 (2004)
- (17) Datta, R., Li, J., Wang, J.Z.: Content-based image retrieval: Approaches and trends of the new age. In: Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR ’05, pp. 253–262. ACM, New York, NY, USA (2005)
- (18) Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P.: Revisiting the VLAD image representation. In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013, pp. 653–656. ACM, New York, NY, USA (2013). DOI 10.1145/2502081.2502171. URL http://doi.acm.org/10.1145/2502081.2502171
- (19) Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255 (2009). DOI 10.1109/CVPR.2009.5206848
- (20) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. CoRR abs/1310.1531 (2013). URL http://arxiv.org/abs/1310.1531
- (21) Galvez-Lopez, D., Tardos, J.: Real-time loop detection with bags of binary words. In: Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pp. 51–58 (2011)
- (22) van Gemert, J.C., Geusebroek, J.M., Veenman, C.J., Smeulders, A.W.: Kernel codebooks for scene categorization. In: D. Forsyth, P. Torr, A. Zisserman (eds.) Computer Vision - ECCV 2008, Lecture Notes in Computer Science, vol. 5304, pp. 696–709. Springer Berlin Heidelberg (2008)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning (2016).URL http://www.deeplearningbook.org. Book in preparation for MIT Press
- (24) Google googles. URL http://www.google.com/mobile/goggles/
- (25) Google images. URL https://images.google.com/
- (26) Grana, C., Borghesani, D., Manfredi, M., Cucchiara, R.: A fast approach for integrating ORB descriptors in the bag of words model. In: C.G.M. Snoek, L.S. Kennedy, R. Creutzburg, D. Akopian, D. Wüller, K.J. Matherson, T.G. Georgiev, A. Lumsdaine (eds.) IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics (2013)
- (27) Gray, R.M., Neuhoff, D.L.: Quantization. Information Theory, IEEE Transactions on 44(6), 2325–2383 (1998). DOI 10.1109/18.720541. URL http://dx.doi.org/10.1109/18.720541
- (28) Hamming, R.W.: Error detecting and error correcting codes. The Bell System Technical Journal 29(2), 147–160 (1950). DOI 10.1002/j.1538-7305.1950.tb00463.x
- (29) Heinly, J., Dunn, E., Frahm, J.M.: Comparative evaluation of binary features. In: Computer Vision - ECCV 2012, Lecture Notes in Computer Science, pp. 759–773. Springer Berlin Heidelberg (2012)
- (30) Householder, A.: The Theory of Matrices in Numerical Analysis. A Blaisdell book in pure and applied sciences: introduction to higher mathematics. Blaisdell Publishing Company (1964)
- (31) Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. ACM, New York, NY, USA (2008)
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers.In: In Advances in Neural Information Processing Systems 11, pp. 487–493. MIT Press (1998). URL http://dl.acm.org/citation.cfm?id=340534.340715
- (33) Jégou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. In: D. Forsyth, P. Torr, A. Zisserman (eds.) European Conference on Computer Vision, LNCS, vol. I, pp. 304–317. Springer (2008)
- (34) Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. International Journal of Computer Vision 87(3), 316–336 (2010). DOI 10.1007/s11263-009-0285-2. URL http://dx.doi.org/10.1007/s11263-009-0285-2
- (35) Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(1), 117–128 (2011). DOI 10.1109/TPAMI.2010.57
- (36) Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: IEEE Conference on Computer Vision & Pattern Recognition (2010). DOI 10.1109/CVPR.2010.5540039
- (37) Jégou, H., Perronnin, F., Douze, M., Sànchez, J., Pérez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(9), 1704–1716 (2012). DOI 10.1109/TPAMI.2011.235
- (38) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014). DOI 10.1145/2647868.2654889. URL http://doi.acm.org/10.1145/2647868.2654889
- (39) Kaufman, L., Rousseeuw, P.: Clustering by means of medoids. In: Y. Dodge (ed.) An introduction to L1-norm based statistical data analysis, Computational Statistics & Data Analysis, vol. 5 (1987)
- (40) Krapac, J., Verbeek, J., Jurie, F.: Modeling Spatial Layout with Fisher Vectors for Image Categorization. In: ICCV 2011 - International Conference on Computer Vision, pp. 1487–1494. IEEE, Barcelona, Spain (2011)
- (41) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: F. Pereira, C. Burges, L. Bottou, K. Weinberger (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)
- (42) Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
- (43) Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2 (2006)
- (44) LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015). DOI 10.1038/nature14539
- (45) Lee, S., Choi, S., Yang, H.: Bag-of-binary-features for fast image representation. Electronics Letters 51(7), 555–557 (2015)
- (46) Leutenegger, S., Chli, M., Siegwart, R.: Brisk: Binary robust invariant scalable keypoints. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2548–2555 (2011)
- (47) Levi, G., Hassner, T.: LATCH: learned arrangements of three patch codes. CoRR abs/1501.03719 (2015)
- (48) Lin, K., Yang, H.F., Hsiao, J.H., Chen, C.S.: Deep learning of binary hash codes for fast image retrieval. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2015)
- (49) Lloyd, S.: Least squares quantization in pcm. Information Theory, IEEE Transactions on 28(2), 129–137 (1982). DOI 10.1109/TIT.1982.1056489. URL http://dx.doi.org/10.1109/TIT.1982.1056489
- (50) Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004). DOI 10.1023/B:VISI.0000029664.99615.94. URL http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94
- (51) McLachlan, G., Peel, D.: Finite Mixture Models. Wiley series in probability and statistics. Wiley (2000)
- (52) Miksik, O., Mikolajczyk, K.: Evaluation of local detectors and descriptors for fast feature matching. In: Pattern Recognition (ICPR), 2012 21st International Conference on, pp. 2681–2684 (2012)
- (53) Perd’och, M., Chum, O., Matas, J.: Efficient representation of local geometry for large scale object retrieval. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 9–16 (2009)
- (54) Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pp. 1–8 (2007). DOI 10.1109/CVPR.2007.383266
- (55) Perronnin, F., Larlus, D.: Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3743–3752 (2015)
- (56) Perronnin, F., Liu, Y., Sànchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3384–3391 (2010). DOI 10.1109/CVPR.2010.5540009
- (57) Perronnin, F., Sànchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Computer Vision - ECCV 2010, Lecture Notes in Computer Science, vol. 6314, pp. 143–156. Springer Berlin Heidelberg (2010). DOI 10.1007/978-3-642-15561-1˙11. URL http://dx.doi.org/10.1007/978-3-642-15561-1_11
- (58) Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Computer Vision and Pattern Recognition (CVPR), 2007 IEEE Conference on, pp. 1–8 (2007). DOI 10.1109/CVPR.2007.383172
- (59) Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8 (2008). DOI 10.1109/CVPR.2008.4587635
- (60) Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pp. 512–519. IEEE (2014). DOI 10.1109/CVPRW.2014.131
- (61) Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2564–2571 (2011)
- (62) Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA (1986)
- (63) Sànchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision 105(3), 222–245 (2013). DOI 10.1007/s11263-013-0636-x. URL http://dx.doi.org/10.1007/s11263-013-0636-x
- (64) Sànchez, J., Redolfi, J.: Exponential family fisher vector for image classification. Pattern Recognition Letters 59, 26 – 32 (2015). DOI http://dx.doi.org/10.1016/j.patrec.2015.03.010
- (65) Simonyan, K., Vedaldi, A., Zisserman, A.: Deep fisher networks for large-scale image classification. In: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems 26, pp. 163–171. Curran Associates, Inc. (2013)
- (66) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). URL http://arxiv.org/abs/1409.1556
- (67) Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV ’03, vol. 2, pp. 1470–1477. IEEE Computer Society (2003). DOI 10.1109/ICCV.2003.1238663
- (68) Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)
- (69) Sydorov, V., Sakurada, M., Lampert, C.H.: Deep fisher kernels - end to end learning of the fisher kernel gmm parameters. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
- (70) Tolias, G., Avrithis, Y.: Speeded-up, relaxed spatial matching. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1653–1660 (2011). DOI 10.1109/ICCV.2011.6126427
- (71) Tolias, G., Furon, T., Jégou, H.: Orientation covariant aggregation of local descriptors with embeddings. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (eds.) Computer Vision - ECCV 2014, Lecture Notes in Computer Science, vol. 8694, pp. 382–397. Springer International Publishing (2014)
- (72) Tolias, G., Jégou, H.: Local visual query expansion: Exploiting an image collection to refine local descriptors. Research Report RR-8325 (2013). URL https://hal.inria.fr/hal-00840721
- (73) Uchida, Y., Sakazawa, S.: Image retrieval with fisher vectors of binary features. In: Pattern Recognition (ACPR), 2013 2nd IAPR Asian Conference on, pp. 23–28 (2013)
- (74) Ullman, S.: High-Level Vision - Object Recognition and Visual Cognition. MIT Press (1996)
- (75) Uricchio, T., Bertini, M., Seidenari, L., Del Bimbo, A.: Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging. In: The IEEE International Conference on Computer Vision (ICCV) Workshops (2015)
- (76) Van Opdenbosch, D., Schroth, G., Huitl, R., Hilsenbeck, S., Garcea, A., Steinbach, E.: Camera-based indoor positioning using scalable streaming of compressed binary image signatures. In: IEEE International Conference on Image Processing (2014)
- (77) Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3360–3367 (2010)
- (78) Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann (1999)
- (79) Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1794–1801 (2009)
- (80) Yue-Hei Ng, J., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2015)
- (81) Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach, Advances in Database Systems, vol. 32. Springer (2006)
- (82) Zhang, Y., Zhu, C., Bres, S., Chen, L.: Encoding local binary descriptors by bag-of-features with hamming distance for visual object categorization. In: P. Serdyukov, P. Braslavski, S. Kuznetsov, J. Kamps, S. Rüger, E. Agichtein, I. Segalovich, E. Yilmaz (eds.) Advances in Information Retrieval, Lecture Notes in Computer Science, vol. 7814, pp. 630–641. Springer Berlin Heidelberg (2013)
- (83) Zhao, W., Jégou, H., Gravier, G.: Oriented pooling for dense and non-dense rotation-invariant features. In: BMVC - 24th British Machine Vision Conference (2013)
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database.In: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (eds.) Advances in Neural Information Processing Systems 27, pp. 487–495. Curran Associates, Inc. (2014)
Appendix A Score vector computation
In the following, we have reported the computation of the score function , defined as the gradient of the log-likelihood of a data with respect to the parameters of a Bernoulli Mixture Model. Throughout this appendix we have used notation to represent the Iverson bracket which equals one if the arguments is true, and zero otherwise.
Under the independence assumption, the Fisher score with respect to the generic parameter is expressed as: To compute , we first observe that
Hence, the Fisher score with respect to the parameter is obtained as
and the Fisher score related to the parameter is
Appendix B Approximation of the Fisher Information Matrix
Our derivation of the FIM is based on the assumption (see also perronnin10 ; sanchez13 ) that for each observation the distribution of the occupancy probability is sharply peaking, i.e. there is one Bernoulli index such that and , . This assumption implies that
where is the Iverson bracket.
The elements of the FIM are defined as:
Hence, the FIM is symmetric and can be written as block matrix
By using the definition of the occupancy probability (i.e. ) and the fact that is the distribution of a -dimensional Bernoulli of mean , we have the following useful equalities: