1 Introduction
ContentBased Image Retrieval (CBIR) is a relevant topic studied by many scientists in the last decades. CBIR refers to the possibility of organizing archives containing digital pictures, so that they can be searched and retrieved by using their visual content datta05 . A specialization of the basic CBIR techniques include the techniques of object recognition Ullman96 , where visual content of images is analyzed so that objects contained in digital pictures are recognized, and/or images containing specific objects are retrieved. Techniques of CBIR and object recognition are becoming increasingly popular in many web search engines, where images can be searched by using their visual content GoogleImages ; BingImages , and on smartphones apps, where information can be obtained by pointing the smartphone camera toward a monument, a painting, a logo GoogleGoggles .
During the last few years, local descriptors, as for instance SIFT lowe04 , SURF bay06 , BRISK leutenegger11 , ORB rublee11 , to cite some, have been widely used to support effective CBIR and object recognition tasks. A local descriptor is generally a histogram representing statistics of the pixels in the neighborhood of an interest point (automatically) chosen in an image. Among the promising properties offered by local descriptors, we mention the possibility to help mitigating the so called semantic gap Smeulders:2000:CIR:357871.357873 , that is the gap between the visual representation of images and the semantic content of images. In most cases visual similarity does not imply semantic similarity.
Executing image retrieval and object recognition tasks, relying on local features, is generally resource demanding. Each digital image, both queries and images in the digital archives, are typically described by thousands of local descriptors. In order to decide that two images match, since they contain the same or similar objects, local descriptors in the two images need to be compared, in order to identify matching patterns. This poses some problems when local descriptors are used on devices with low resources, as for instance smartphones, or when response time must be very fast even in presence of huge digital archives. On one hand, the cost for extracting local descriptors, storing all descriptors of all images, and performing feature matching between two images must be reduced to allow their interactive use on devices with limited resources. On the other hand, compact representation of local descriptors and ad hoc index structures for similarity matching ZADB06Similarity are needed to allow image retrieval to scale up with very large digital picture archives. These issues have been addressed by following two different directions.
To reduce the cost of extracting, representing, and matching local visual descriptors, researchers have investigated the use binary local descriptors, as for instance BRISK and ORB. Binary features are built from a set of pairwise intensity comparisons. Thus, each bit of the descriptors is the result of exactly one comparison. Binary descriptors are much faster to be extracted, are obviously more compact than nonbinary ones, and can also be matched faster by using the Hamming distance Hamming:1950:EDE rather than the Euclidean distance. For example, in rublee11 it has been showed that ORB is an order of magnitude faster than SURF, and over two orders faster than SIFT. However, note that even if binary local descriptors are compact, each image is still associated with thousand local descriptors, making it difficult to scale up to very large digital archives.
The use of the information provided by each local feature is crucial for tasks such as image stitching and 3D reconstruction. For other tasks such as image classification and retrieval, high effectiveness have been achieved using the quantization and/or aggregation techniques which provide meaningful summarization of all the extracted features of an image jegou10:VLAD . One profitable outcome of using quantization/aggregation techniques is that they allow us to represent an image by a single descriptor rather than thousands descriptors. This reduces the cost of image comparison and leads to scale up the search to large database. On one hand, quantization methods, as for instance the BagofWords approach (BoW) sivic03 , define a finite vocabulary of “visual words”, that is a finite set of local descriptors to be used as representative. Every possible local descriptors is thus represented by its closest visual word, that is the closest element of the vocabulary. In this way images are described by a set (a bag) of identifiers of representatives, rather than a set of histograms. On the other hand, aggregation methods, as for instance Fisher Vectors (FV) perronnin07 or Vectors of Locally Aggregated Descriptors (VLAD) jegou10:VLAD , analyze the local descriptors contained in an image to create statistical summaries that still preserve the effectiveness power of local descriptors and allow treating them as global descriptors. In both cases index structures for approximate or similarity matching ZADB06Similarity can be used to guarantee scalability on very large datasets.
Since quantization and aggregation methods are defined and used almost exclusively in conjunction with nonbinary features, the cost of extracting local descriptors and to quantize/aggregate them on the fly, is still high. Recently, some approaches that attempt to integrate the binary local descriptors with the quantization and aggregation methods have been proposed in literature galvez11 ; grana13 ; lee15 ; van14 ; uchida13 ; zhang13
. In these proposals, the aggregation is directly applied on top of binary local descriptors. The objective is to improve efficiency and reduce computing resources needed for image matching by leveraging on the advantages of both aggregation techniques (effective compact image representation) and binary local features (fast feature extraction), by reducing, or eliminating the disadvantages.
The contribution of this paper is providing an extensive comparisons and analysis of the aggregation and quantization methods applied to binary local descriptors also providing a novel formulation of Fisher Vectors built using the Bernoulli Mixture model (BMM), referred to as BMMFV. Moreover, we investigate the combination of BMMFVs and other encodings of binary features with the Convolutional Neural Network razavian2014cnn features as other case of use of binary feature aggregations. We focus on cases where, for efficiency issues rublee11 ; heinly12 , the binary features are extracted and used to represent images. Thus, we compare aggregations of binary features in order to find the most suitable techniques to avoid the direct matching. We expect this topic to be relevant for application that uses binary features on devices with low CPU and memory resources, as for instance mobile and wearable devices. In these cases the combination of aggregation methods with binary local features is very useful and led to scale up image search on large scale, where direct matching is not feasible.
This paper extends our early work on aggregations of binary features amato16
by a) providing a formulation of the Fisher Vector built using the Bernoulli Mixture Model (BMM) which preserve the structure of the traditional FV built using a Gaussian Mixture Model (existing implementations of the FV can be easily adapted to work also with BMMs); b) comparison of the BMMFV against the other stateoftheart aggregation approaches on two standard benchmarks (INRIA Holidays
^{1}^{1}1Respect to the experimental setting used in our previous work amato16 , we improved the computation of the local features before the aggregation phase which allowed us to obtain better performances for BoW and VLAD on the INRIA Holidays dataset than that reported in amato16 . jegou08 and Oxford5k philbin07 ); c) evaluation of the BMMFV on the top of several binary local features (ORB rublee11 , LATCH levi15_LATCH , AKAZE akaze ) whose performances have not been yet reported on benchmark for image retrieval; d) evaluation of the combination of the BMMFV with the emerging Convolutional Neural Network (CNN) features, including experiments on a large scale. The results of our experiments show that the use of aggregation and quantization methods with binary local descriptors is generally effective even if, as expected, retrieval performance is worse than that obtained applying the same aggregation and quantization methods directly to nonbinary features. The BMMFV approach provided us with performance results that are better than all the other aggregation methods on binary descriptors. In addition, our results show that some aggregation methods led to obtain very compact image representation with a retrieval performance comparable to the direct matching, which actually is the most used approch to evaluate the similarity of images described by binary local features. Moreover, we show that the combinations of BMMFV and CNN improve the latter retrieval performances and achieves effectiveness comparable with that obtained combining CNN and FV built upon SIFTs, previous proposed in chandrasekhar2015 . The advantage of combining BMMFV and CNN instead of combining traditional FV and CNN is that BMMFV relies on binary features whose extraction is noticeably faster than SIFT extraction.The paper is organized as follows. Section 2 offers an overview of other articles in literature, related to local features, binary local features, and aggregation methods. Section 3 discusses how existing aggregation methods can be used with binary local features. It also contains our approach for applying Fisher Vectors on binary local features and how combining it with the CNN features. Section 4 discusses the evaluation experiments and the obtained results. Section 5 concludes.
2 Related Work
The research for effective representation of visual feature for images has received much attention over the last two decades. The use of local features, such as SIFT lowe04 and SURF bay06
, is at the core of many computer vision applications, since it allows systems to efficiently match local structures between images. To date, the most used and cited local feature is the Scale Invariant Feature Transformation (SIFT)
lowe04 . The success of SIFT is due to its distinctiveness that enable to effectively find correct matches between images. However, the SIFTs extraction is costly due to the local image gradient computations. In bay06 integral images were used to speed up the computation and the SURF feature was proposed as an efficient approximation of the SIFT. To further reduce the cost of extracting, representing, and matching local visual descriptors, researchers have investigated the binary local descriptors. These features have a compact binary representation that is not the result of a quantization, but rather is computed directly from pixelintensity comparisons. One of the early studies in this direction was the Binary Robust Independent Elementary Features (BRIEF) calonder10 . Rublee et al. rublee11 proposed a binary feature, called ORB (Oriented FAST and Rotated BRIEF), whose extraction process is an order of magnitude faster than SURF, and two orders faster than SIFT according to the experimental results reported in rublee11 ; Miksik2012 ; heinly12 . Recently, several other binary local features have been proposed, such as BRISK leutenegger11 , AKAZE akaze , and LATCH levi15_LATCH .Local features have been widely used in literature and applications, however since each image is represented by thousands of local features there is a significant amount of memory consumption and time required to compare local features within large databases. Aggregation techniques have been introduced to summarize the information contained in all the local features extracted from an image into a single descriptor. The advantage is twofold: 1) reduction of the cost of image comparison (each image is represented by a single descriptor rather than thousands descriptors); 2) aggregated descriptors have been proved to be particularly effective for image retrieval and classification task.
By far, the most popular aggregation method has been the BagofWord (BoW) sivic03 . BoW was initially proposed for matching object in video and has been studied in many other papers, such as csurka04 ; philbin07 ; jegou10:improvingBoW ; jegou10:VLAD
, for classification and CBIR tasks. BoW uses a visual vocabulary to quantize the local descriptors extracted from images; each image is then represented by a histogram of occurrences of visual words. The BoW approach used in computer vision is very similar to the BoW used in natural language processing and information retrieval
salton86 , thus many text indexing techniques, such as inverted files witten99 , have been applied for image search. Search results obtained using BoW in CBIR have been improved by exploiting additional geometrical information philbin07 ; perdoch09 ; tolias11:SpeededUp ; zhao13 , applying reranking approaches philbin07 ; jegou08 ; chum07 ; tolias13:queryExp or using better encoding techniques, such as the Hamming Embedding jegou08 , soft/multipleassignment philbin08 ; vanGemert08 ; jegou10:improvingBoW , sparse coding yang09 ; boureau10 , localityconstrained linear coding wang10 and spatial pyramids lazebnik06 .Recently, alternative encodings schemes, like the Fisher Vectors (FVs) perronnin07 and the Vector of Locally Aggregated Descriptors (VLAD) jegou10:VLAD , have attracted much attention because of their effectiveness in both image classification and largescale image search. The FV uses the Fisher Kernel framework jaakkola98 to transform an incoming set of descriptors into a fixedsize vector representation. The basic idea is to characterize how a sample of descriptors deviates from an average distribution that is modeled by a parametric generative model. The Gaussian Mixture Model (GMM) mclachlan2000
is typically used as generative model and might be understood as a “probabilistic visual vocabulary”. While BoW counts the occurrences of visual words and so takes in account just 0order statistics, the FV offers a more complete representation by encoding higher order statistics (first, and optionally second order) related to the distribution of the descriptors. The FV results also in a more efficient representation, since fewer visual words are required in order to achieve a given performance. However, the vector representation obtained using BoW is typically quite sparse while that obtained using the Fisher Kernel is almost dense. This leads to some storage and input/output issues that have been addressed by using techniques of dimensionality reduction, such as the Principal Component Analysis (PCA)
bishop06 , compression with product quantization gray98 ; jegou11:PQ and binary codes perronnin10 . In chandrasekhar2015 a fusion of FV and CNN features razavian2014cnn ; DeCaf was proposed and other works perronnin2015 ; Simonyan2013 ; Sydorov2014 have started exploring the combination of FVs and CNNs by defining hybrid architectures.The VLAD method, similarly to BoW, starts with the quantization of the local descriptors of an image by using a visual vocabulary learned by means. Differently from BoW, VLAD encodes the accumulated difference between the visual words and the associated descriptors, rather than just the number of descriptors assigned to each visual word. Thus, VLAD exploits more aspects of the distribution of the descriptors assigned to a visual word. As highlighted in jegou12 , VLAD might be viewed as a simplified nonprobabilistic version of the FV. In the original scheme jegou10:VLAD , as for the FV, VLAD was normalized. Subsequently a power normalization step was introduced for both VLAD and FV jegou12 ; perronnin10 . Furthermore, PCA dimensionality reduction and product quantization were applied and several enhancements to the basic VLAD were proposed arandjelovic13:allAbVALD ; chen11 ; delhumeau13 ; zhao13 .
The aggregation methods have been defined and used almost exclusively in conjunction with local features that have a realvalued representation, such as SIFT and SURF. Few articles have addressed the problem of modifying the stateoftheart aggregation methods to work with the emerging binary local features. In galvez11 ; zhang13 ; grana13 ; lee15 the use of ORB descriptors was integrated into the BoW model by using different clustering algorithms. In galvez11
the visual vocabulary was calculated by binarizing the centroids obtained using the standard
means. In zhang13 ; grana13 ; lee15 the means clustering was modified to fit the binary features by replacing the Euclidean distance with the Hamming distance, and by replacing the mean operation with the median operation. In van14 the VLAD image signature was adapted to work with binary descriptors: means is used for learning the visual vocabulary and the VLAD vectors are computed in conjunction with an intranormalization and a final binarization step. Recently, also the FV scheme has been adapted for the use with binary descriptors: Uchida et al. uchida13 derived a FV where the Bernoulli Mixture Model was used instead of the GMM to model binary descriptors, while Sanchez and Redolfi sanchez15generalized the FV formalism to a broader family of distributions, known as the exponential family, that encompasses the Bernoulli distribution as well as the Gaussian one.
3 Image Representations
In order to decide if two images contain the same object or have a similar visual content, one needs an appropriate mathematical description of each image. In this section, we describe some of the most prominent approaches to transform an input image into a numerical descriptor. First we describe the principal aggregation techniques and the application of them to binary local features. Then, the emerging CNN features are presented.
3.1 Aggregation of local features
In the following we review how quantization and aggregation methods have been adapted to cope with binary features. Specifically we present the BoW sivic03 , the VLAD jegou10:VLAD and the FV perronnin07 approaches.
3.1.1 BagofWords
The Bag of (Visual) Words (BoW) sivic03 uses a visual vocabulary to group together the local descriptors of an image and represent each image as a set (bag) of visual words. The visual vocabulary is built by clustering the local descriptors of a dataset, e.g. by using means kmeans . The cluster centers, named centroids, act as the visual words
of the vocabulary and they are used to quantize the local descriptors extracted from the images. Specifically, each local descriptor of an image is assigned to its closest centroid and the image is represented by a histogram of occurrences of the visual words. The retrieval phase is performed using text retrieval techniques, where visual words are used in place of text word and considering a query image as disjunctive termquery. Typically, the cosine similarity measure in conjunction with a term weighting scheme, e.g. term frequencyinverse document frequency (tfidf), is adopted for evaluating the similarity between any two images.
BoW and Binary Local Features
In order to extend the BoW scheme to deal with binary features we need a cluster algorithm able to deal with binary strings and Hamming distance. The medoids kaufman87 are suitable for this scope, but they requires a computational effort to calculate a full distance matrix between the elements of each cluster. In grana13 it was proposed to use a voting scheme, named majority, to process a collection of binary vectors and seek for a set of good centroids, that will become the visual words of the BoW model. An equivalent representation is given also in zhang13 ; lee15 , where the BoW model and the means clustering have been modified to fit the binary features by replacing the Euclidean distance with the Hamming distance, and by replacing the mean operation with the median operation.
3.1.2 Vector of Locally Aggregated Descriptors
The Vector of Locally Aggregated Descriptors (VLAD) was initially proposed in jegou10:VLAD . As for the BoW, a visual vocabulary is first learned using a clustering algorithm (e.g. means). Then each local descriptor of a given image is associated with its nearest visual word in the vocabulary and for each centroid the differences of the vectors assigned to are accumulated: The VLAD is the concatenation of the residual vectors , i.e. . All the residuals have the same size which is equal to the size of the used local features. Thus the dimensionality of the whole vector is fixed too and it is equal to .
VLAD and Binary Local Features
A naive way to apply the VLAD scheme to binary local descriptors is treating binary vectors as a particular case of realvalued vectors. In this way, the means algorithm can be used to build the visual vocabulary and the difference between the centroids and the descriptors can be accumulated as usual. This approach has also been used in van14 , where a variation to the VLAD image signature, called BVLAD, has been defined to work with binary features. Specifically, the BVLAD is the binarization (by thresholding) of a VLAD obtained using powerlaw, intranormalization, normalization and multiple PCA. Thereafter we have not evaluated the performance of the BVLAD because the binarization of the final image signature is out of the scope of this paper.
Similarly to BoW, various binarycluster algorithms (e.g. medoids and majority) and the Hamming distance can be used to build the visual vocabulary and associate each binary descriptor to its nearest visual word. However, as we will see, the use of binary centroids may provide less discriminant information during the computation of the residual vectors.
3.1.3 Fisher Vector
The Fisher Kernel jaakkola98 is a powerful framework adopted in the context of image classification in perronnin07 as efficient tool to encode image local descriptors into a fixedsize vector representation. The main idea is to derive a kernel function to measure the similarity between two sets of data, such as the sets of local descriptors extracted from two images. The similarity of two sample sets and is measured by analyzing the difference between the statistical properties of and , rather than comparing directly and
. To this scope a probability distribution
with some parametersis first estimated on a training set and it is used as generative model over the the space of all the possible data observations. Then each set
of observations is represented by a vector, named Fisher Vector, that indicates the direction in which the parameter of the probability distribution should be modified to best fit the data in . In this way, two samples are considered similar if the directions given by their respective Fisher Vectors are similar. Specifically, as proposed in jaakkola98 , the similarity between two sample sets and is measured using the Fisher Kernel, defined as where is the Fisher Information Matrix (FIM) and is referred to as the score function.The computation of the Fisher Kernel is costly due the multiplication by the inverse of the FIM. However, by using the Cholesky decomposition , it is possible to rewritten the Fisher Kernel as an Euclidean dotproduct, i.e. where is the Fisher Vector (FV) of perronnin10 .
Note that the FV is a fixed size vector whose dimensionality only depends on the dimensionality of the parameter . The FV is further divided by in order to avoid the dependence on the sample size sanchez13 and normalized because, as proved in perronin10:improvingFK ; sanchez13 , this is a way to cancelout the fact that different images contain different amounts of imagespecific information (e.g. the same object at different scales).
The distribution , which models the generative process in the space of the data observation, can be chosen in various way. The Gaussian Mixture Model (GMM) is typically used to model the distribution of nonbinary features considering that, as pointed in mclachlan2000 , any continuous distribution can be approximated arbitrarily well by an appropriate finite Gaussian mixture. Since the Bernoulli distribution models an experiment that has only two possible outcomes (0 and 1), a reasonable alternative to characterize the distribution of a set of binary features is to use a Bernoulli Mixture Model (BMM).
FV and Binary Local Features
In this work we derive and test an extension of the FV built using BMM, called BMMFV, to encode binary features. Specifically, we chose to be multivariate Bernoulli mixture with components and parameters :
(1) 
where
(2) 
and
(3) 
To avoid enforcing explicitly the constraints in (3), we used the softmax formalism krapac11 ; sanchez13 for the weight parameters:
Given a set of dimensional binary vectors and assuming that the samples are independent we have that the score vector with respect to the parameter is calculated (see Appendix A) as the concatenation of
where is the occupancy probability
(or posterior probability). The occupancy probability
represents the probability for the observation to be generated by the th Bernoulli and it is calculated asThe FV of is then obtained by normalizing the score by the matrix , which is the square root of the inverse of the FIM, and by the sample size . In the Appendix B we provide an approximation of FIM under the assumption that the occupancy probability is sharply peaked on a single value of for each descriptor , obtained following an approach very similar to that used in sanchez13 for the GMM case. By using our FIM approximation, we got the following normalized gradient:
The final BMMFV is the concatenation of and for , and is therefore of dimension .
GMMFV sanchez13 
BMMFV (our formalization) 
BMMFV (Uchida et. al uchida13 ) 
( not explicitly derived in uchida13 ) 
An extension of the FV by using the BMM has been also carried in uchida13 ; sanchez15 . Our approach differs from the one proposed in uchida13 in the approximation of the square root of the inverse of the FIM (i.e.,
) . It is worth noting that our formalization preserves the structure of the traditional FV derived by using the GMM, where Gaussian means and variances are replaced by Bernoulli means
and variances (see Table 1).In sanchez15 , the FV formalism was generalized to a broaden family of distributions knows as exponential family that encompasses the Bernoulli distribution as well as the Gaussian one. However, sanchez15 lacks in an explicit definition of the FV and of the FIM approximation in the case of BMM which was out of the scope of their work. Our formulation differs from that of sanchez15 in the choice of the parameters used in the gradient computation of the score function ^{2}^{2}2A Bernoulli distribution of parameter
can be written as exponential distribution
where is the natural parameter. In sanchez15 the score function is computed considering the gradient w.r.t. the natural parameters while in this paper we used the gradient w.r.t. the standard parameter of the Bernoulli (as also done in uchida13 ). . A similar difference holds also for the FV computed on the GMM, given that in sanchez15the score function is computed w.r.t. the natural parameters of the Gaussian distribution rather than the mean and the variance parameters which are typically used in literature for the FV representation
perronnin10 ; perronnin07 ; sanchez13 . Unfortunately, the authors of sanchez15 didn’t experimentally compare the FVs obtained using or not the natural parameters.Sànchez sanchez13 highlights that the FV derived from GMM can be computed in terms of the following order and order statistics: , . Our BMMFV can be also written in terms of these statistics as
We finally used powerlaw and normalization to improve the effectiveness of the BMMFV approach.
3.2 Combination of Convolutional Neural Network Features and Aggregations of Binary Local Feature
Convolutional Neural Networks (CNNs) LeCun2015 have brought breakthroughs in the computer vision area by improving the stateoftheart in several domains, such as image retrieval, image classification, object recognition, and action recognition. Depp CNN allows a machine to automatically learn representations of data with multiple levels of abstraction which can be used for detection or classification tasks. CNNs are neural networks specialized for data that has a gridlike topology as image data. The applied discrete convolution operation results in a multiplication by a matrix which has several entries constrained to be equal to other entries. Three important ideas are behind the success CNNs: sparse connectivity, parameter sharing, and equivariant representations DeepLearning2016 .
In image retrieval, the activations produced by an image within the top layers of the CNN have been successfully used as a highlevel descriptors of the visual content of the image DeCaf . The results reported in razavian2014cnn
shows that these CNN features, compared by using the Euclidean distance, achieve stateoftheart quality in terms of mAP. Most of the papers reporting results obtained using the CNN features maintain the Rectified Linear Unit (ReLU) transform
DeCaf ; razavian2014cnn ; chandrasekhar2015 , i.e., negative activations values are discarded replacing them with 0. Values are typically normalized babenko2014neural ; razavian2014cnn ; chandrasekhar2015 and we did the same in this work. In Section 4.2 we describe the CNN model used in our experiments.Recently, in chandrasekhar2015 it has been shown that the information provided by the FV built upon SIFT helps to further improve the retrieval performance of the CNN features and a combination of FV and CNN features has been used as well chandrasekhar2015 ; amato16:JOCCH . However, the benefits of such combinations are clouded by the cost of extracting SIFTs that can be considered to high with respect to the cost of computing the CNN features (see Table 2). Since the extraction of binary local features is up two times faster than SIFT, in this work we also investigate the combination of CNN features with aggregations of binary local feature, including BMMFV.
We combined BMMFV and CNN using the following approach. Each image was represented by a couple , where and were respectively the CNN descriptor and the BMMFV of the image. Then, we evaluated the distance between two couples and as the convex combination between the distances of the CNN descriptors (i.e. ) and the BMMFV descriptors (i.e. ). In other words, we defined the distance between two couples and as
(4) 
with . Choosing corresponds to use only FV approach, while correspond to use only CNN features. Please note that in our case both the FV and the CNN features are normalized so the distance function between the CNN descriptors has the same range value of the distance function between the BMMFV descriptors.
Similarly, combinations between CNN features and other image descriptors, such as GMMFV, VLAD, and BoW can be considered by using the convex combination of the respective distances. Please note that whenever the range of the two used distances is not the same, the distances should be rescaled before the convex combination (e.g. divide each distance function by its maximum value).
CNN  FV Encoding  SIFT  ORB  

Computing time
per image 
ms  ms []  ms  ms 
ms [] 
Average time costs for computing various image representations using a CPU implementation. The cost of computing the CNN feature of an image was estimated using a prelearned AlexNet model and the Caffe framework
caffe2014 on an Intel i7 3.5 GHz. The values related to the FV refers only to the cost of aggregating the local descriptors of an image into a single vector and do not encompass the cost of extracting the local features, neither the learning of the Gaussian or the Bernoulli Mixture Model which is calculated offline. The cost of computing FV varies proportionally with , where is the number of local features extracted from an image, is the number of mixtures of Gaussian/Bernoulli, and is the dimensionality of each local feature; we reported the approximate cost for and and on an Intel i7 3.5 GHz. The cost of SIFT/ORB local feature extraction was estimated according to heinly12 by considering about features per image.4 Experiments
In this section we evaluate and compare the performance of the techniques described in this paper to aggregate binary local descriptors. Specifically, in the Subsection 4.3 we compare the BoW, the VLAD, the FV based on the GMM, and the BMMFV approach to aggregate ORB binary features. Since the BMMFV achieved the best results over the other tested approaches, in the Subsection 4.4 we further evaluate the performance of the BMMFVs using different binary features (ORB, LATCH, AKAZE) and combining them with the CNN features. Finally, in the Subsection 4.5, we report experimental results on large scale.
In the following, we first introduce the datasets used in the evaluations (Subsection 4.1) and we describe our experimental setup (Subsection 4.2). We then report results and their analysis.
4.1 Datasets
The experiments were conducted using two benchmark datasets, namely INRIA Holidays jegou08 and Oxford5k philbin07 , that are publicly available and often used in the context of image retrieval jegou10:VLAD ; zhao13 ; jegou08 ; arandjelovic12:rootsift ; perronnin10 ; jegou12 ; tolias14 .
INRIA Holidays jegou08 is a collection of images which mainly contains personal holidays photos. The images are of high resolution and represent a large variety of scene type (natural, manmade, water, fire effects, etc). The dataset contains 500 queries, each of which represents a distinct scene or object. For each query a list of positive results is provided. As done by the authors of the dataset, we resized the images to a maximum of pixels ( pixels for the smaller dimension) before extracting the local descriptors.
Oxford5k philbin07 consists of images collected from Flickr. The dataset comprise distinct Oxford buildings together with distractors. There are query images: queries for each building. The collection is provided with a comprehensive ground truth. For each query there are four image sets: Good (clear pictures of the object represented in the query), OK (images where more that of the object is clearly visible), Bad (images where the object is not present) and Junk (images where less than of the object is visible or images with high level of distortion).
As in many other articles, e.g. jegou10:VLAD ; jegou08 ; philbin08 ; jegou12 , all the learning stages (clustering, etc.) were performed offline using independent image collections. Flickr60k dataset jegou08 was used as training set for INRIA Holidays. It is composed of images randomly extracted from Flickr. The experiments on Oxford5k were conducted performing the learning stages on Paris6k dataset philbin08 , that contains high resolution images obtained from Flickr by searching for famous Paris landmarks.
For largescale experiments we combined the Holidays dataset with the 1 million MIRFlickr dataset huiskes08 , used as distractor set as also done in jegou08 ; Amato2016 . Compared to Holidays, the Flickr datasets is slightly biased, because it includes lowresolution images and more photos of humans.
4.2 Experimental settings
In the following we report some details on how the features for the various approaches were extracted.
 Local features.

In the experiments we used ORB rublee11 , LATCH levi15_LATCH , and AKAZE akaze binary local features that were extracted by using OpenCV (Open Source Computer Vision Library)^{3}^{3}3http://opencv.org/. We detected up to local features per image.
 Visual Vocabularies and Bernoulli/Gaussian Mixture Models.

The visual vocabularies used for building the BoW and VLAD representations were computed using several clustering algorithms, i.e. medoids, majority and means. The means algorithm was applied to the binary features by treating the binary vectors as realvalued vectors. The parameters of the BMM and of the GMM (where is the number of mixture components and
is the dimension of each local descriptor) were learned independently by optimizing a maximumlikelihood criterion with the Expectation Maximization (EM) algorithm
bishop06 . EM is an iterative method that is deemed to have converged when the change in the likelihood function, or alternatively in the parameters , falls below some threshold . As stopping criterion we used the convergence in norm of the mean parameters, choosing . As suggested in bishop06 , the BMM/GMM parameters used in EM algorithm were initialized with: (a) for the mixing coefficients and ; (b) random values chosen uniformly in the range , for the BMM means ; (c) centroids precomputed using means for the GMM means ; (d) mean variance of the clusters found using means for the diagonal elements of the GMM covariance matrices.All the learning stages, i.e. means, medoids, majority and the estimation of GMM/BMM, were performed using in order of 1M descriptors randomly selected from the local features extracted on the training sets (namely Flickr60k for INRIA Holidays and Paris6k for Oxford5k).
 BoW, VLAD, FV.

The various encodings of the local feature (as well as the visual vocabularies and the BMM/GMM) were computed by using our Visual Information Retrieval library that is publicly available on GitHub^{4}^{4}4https://github.com/ffalchi/it.cnr.isti.vir. These representations are all parametrized by a single integer . It corresponds to the number of centroids (visual words) used in BoW and VLAD, and to the number of mixture components of GMM/BMM used in FV representations.
For the FVs, we used only the components associated with the mean vectors because, as happened in the nonbinary case, we observed that the components related to the mixture weights do not improve the results.
As a common postprocessing step perronin10:improvingFK ; jegou12 , both the FVs and the VLADs were powerlaw normalized and subsequently normalized. The powerlaw normalization is parametrized by a constant and it is defined as . In our experiments we used .
We also applied PCA to reduce VLAD and FV dimensionality. The projection matrices were estimated on the training datasets.  CNN features.

We used the pretrained HybridNet zhou2014 model, downloaded from the Caffe Model Zoo^{5}^{5}5https://github.com/BVLC/caffe/wiki/ModelZoo. The architecture of HybridNet is the same as the BVLC Reference CaffeNet^{6}^{6}6https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet which mimics the original AlexNet krizhevsky2012 , with minor variations as described in caffe2014 . It has weight layers ( convolutional + fullyconnected). The model has been trained on categories ( scene categories from Places Database zhou2014 and
object categories from ImageNet
deng2009 ) with about million images.In the test phase we used Caffe and we extracted the output of the first fullyconnected layer (fc6) after applying the Rectified Linear Unit (ReLU) transform. The resulting dimensional descriptors were normalized.
As preprocessing step we warped the input images to the canonical resolution of RGB (as also done in DeCaf ).
 Feature comparison and performance measure.

The cosine similarity in conjunction with a term weighting scheme (e.g., tfidf) is adopted for evaluating the similarity between BoW representations, while the Euclidean distance is used to compare VLAD, FV and CNNbased image signatures. Please note that the Euclidean distance is equivalent to the cosine similarity whenever the vectors are normalized, as in our case^{7}^{7}7 To search a database for the objects similar to a query we can use either a similarity function or a distance function. In the first case, we search for the objects with greatest similarity to the query. In the latter case, we search for the objects with lowest distance from the query. A similarity function is said to be equivalent to a distance function if the ranked list of the results to query is the same. For example, the Euclidean distance between two vectors () is equivalent to the cosine similarity () whenever the vectors are  normalized (i.e. ). In fact, in such case, , which implies that the ranked list of the results to a query is the same (i.e., iff )..
The image comparison based on the direct matching of the local features (i.e. without aggregation) was performed adopting the distance ratio criterion proposed in lowe04 ; heinly12 . Specifically, candidate matches to local features of the image query are identified by finding their nearest neighbors in the database of images. Matches are discarded if the ratio of the distances between the two closest neighbors is above the 0.8 threshold. Similarity between two images is computed as the percentage of matching pairs with respect to the total local features in the query image.
The retrieval performance of each method was measured by the mean average precision (mAP). In the experiments on INRIA Holidays, we computed the average precision after removing the query image from the ranking list. In the experiments on Oxford5k, we removed the junk images from the ranking before computing the average precision, as recommended in philbin07 and in the evaluation package provided with the dataset.
4.3 Comparison of Various Encodings of Binary Local Features
In Table 3 we summarize the retrieval performance of various aggregation methods applied to ORB features, i.e. the BoW, the VLAD, the FV based on the GMM, and the BMMFV. In addition, in the last line of the table we reports the results obtained without any aggregation, that we refer to as the direct matching of local features, which was performed adopting the distance ratio criterion as previously described in the Subsection 4.2.
In our experiments the FV derived as in uchida13 obtained very similar performance to that of our BMMFV, thus we have reported just the results obtained by using our formulation. Furthermore, we have not experimentally evaluated the FVs computed using the gradient with respect to the natural parameters of a BMM or a GMM as described in sanchez15 , because the evaluation of the retrieval performance obtained using or not the natural parameters in the derivation of the score function is a more general topic which reserve to be further investigated outside the specific context of the encodings binary local features.
Method  Local Feature  Learning method  dim  mAP  
Holidays  Oxford5k  
BoW  ORB  means  20,000  20,000  44.9  22.2 
BoW  ORB  majority  20,000  20,000  44.2  22.8 
BoW  ORB  medoids  20,000  20,000  37.9  18.8 
VLAD  ORB  means  64  16,384  47.8  23.6 
PCA 1,024  46.0  23.2  
PCA 128  30.9  19.3  
VLAD  ORB  majority  64  16,384  32.4  16.6 
VLAD  ORB  medoids  64  16,384  30.6  15.6 
FV  ORB  GMM  64  16,384  42.0  20.4 
PCA 1,024  42.6  20.3  
PCA 128  35.5  19.6  
FV  BMM  64  16,384  49.6  24.3  
PCA 1,024  51.3  23.4  
PCA 128  44.6  19.1  
Noaggr.  ORB      38.1  31.7 
Method  Local Feature  Learning method  dim  mAP  
Holidays  Oxford5k  
BoW  SIFT  means  20,000  20,000  40.4   
BoW  SIFT PCA 64  means  20,000  20,000  43.7  35.4 
VLAD  SIFT  means  64  8,192  52.6   
PCA 128  51.0    
VLAD  SIFT PCA 64  means  64  4,096  55.6  37.8 
PCA 128  55.7  28.7  
FV  SIFT  GMM  64  8,192  49.5   
PCA 128  49.2    
FV  SIFT PCA 64  GMM  64  4,096  59.5  41.8 
PCA 128  56.5  30.1 
Among the various baseline aggregation methods (i.e. without using PCA), the BMMFV approach achieves the best retrieval performance, that is a mAP of 49.6% on Holidays and 24.3% on Oxford. PCA dimensionality reduction from to components, applied on BMMFV, marginally reduces the mAP on Oxford5k, while on Holiday allows us to get 51.3% that is, for this dataset, the best result achieved between all the other aggregation techniques tested on ORB binary features.
Good results are also achieved using VLAD in conjunction with means, which obtains a mAP of 47.8% on Holidays and 23.6% on Oxford5k.
The BOW representation allows to get a mAP of 44.9%/44.2%/37.9% on Holidays and 22.2%/22.8%/18.8% on Oxford5k using respectively means/majority/medoids for the learning of a visual vocabulary of visual words.
The GMMFV method gives results slight worse than BoW: 42.0% of mAP on Holidays and 20.4% of mAP on Oxford5k. The use of PCA to reduce dimensions from to lefts the results of GMMFV on Oxford5k substantially unchanged while slightly improved the mAP on Holidays (42.6%).
Finally, the worst performance are that of VLAD in combination with vocabularies learned by majority (32.4% on Holidays and 16.6% on Oxford) and medoids (30.6% on Holidays and 15.6% on Oxford).
It is generally interesting to note that on INRIA Holidays, the VLAD with means, the BoW with means/majority, and the FVs are better than direct match. In fact, mAP of direct matching of ORB descriptors is 38.1% while on Oxford5k the direct matching reached a mAP of 31.7%.


In Table 5 we also report the performance of our derivation of the BMMFV varying the number of Bernoulli mixture components and investigating the impact of the PCA dimensionality reduction in the case of .
In Table (6(a)) we can see that with the Holidays dataset, the mAP grows from 32.0% when using only 4 mixtures to 54.7% when using . On Oxford5k, mAP varies from 14.3% to 27.4%, respectively, for and .
Table (6(b)) shows that the best results are achieved when reducing the full size BMMFV to with a mAP of 52.6% for Holidays and 25.1% for Oxfrod5k.
Analysis of the results
Summing up, the results show that in the context of binary local features the BMMFV outperforms the compared aggregation methods, namely the BoW, the VLAD and the GMMFV. The performance of the BMMFV is an increasing function of the number of Benoulli mixtures. However, for large , the improvement tends to be smaller and the dimensionality of the FV becomes very large (e.g. dimensions using ). Hence, for high values of , the benefit of the improved accuracy is not worth the computational overhead (both for the BMM estimation and for the cost of storage/comparison of FVs).
The PCA reduction of BMMFV is effective since it can provide a very compact image signature with just a slight loss in accuracy, as shown in the case of (Table 6(b)). Dimension reduction does not necessarily reduce the accuracy. Conversely, limited reduction tend to improve the retrieval performance of the FV representations.
For the computation of VLAD, the means results are more effective than majority/medoids clustering, since the use of nonbinary centroids gives more discriminant information during the computation of the residual vectors used in VLAD.
For the BoW approach, means and majority performs equally better than medoids. However, the majority is preferable in this case because the cost of the quantization process is significantly reduced by using the Hamming distance, rather than Euclidean one, for the comparison between centroids and binary local features.
Both BMMFV and VLAD, with only , outperform BoW. However, as happens for nonbinary features (see Table 4), the loss in accuracy of BoW representation is comparatively lower when the variability of the images is limited, as for the Oxford5k dataset.
As expected, BMMFV outperforms GMMFV, since the probability distribution of binary local features is better described using mixtures of Bernoulli rather than mixtures of Gaussian. The results of our experiments also show that the use of BMMFV is still effective even if compared with the direct matching strategy. In fact, the retrieval performance of BMMFV on Oxford5k is just slightly worse than traditional direct matching of local feature, while on INRIA Holidays the BMMFV even outperforms the direct matching result.
For completeness, in Table 4, we also report the results of the same baseline encodings approaches applied to nonbinary features (both fullsize SIFT and PCAreduced to 64 components) taken from literature jegou10:VLAD ; jegou12 . As expected, aggregation methods in general exhibit better performance in combination with SIFT/SIFTPCA then with ORB, expecially for the Oxford5k dataset. However, it is worth noting that on the INRIA Holidays the BMMFV outperforms the BoW on SIFT/SIFTPCA and reach similar performance of the FV built upon SIFTs.
The FV and VLAD get considerable benefit from performing PCA of SIFT local descriptors before the aggregation phase as the PCA rotation decorrelate the descriptors components. This suggest that techniques, such as VLAD with kmeans and GMMFV, which treat binary vectors as realvalued vectors, may also benefit from the use of PCA before the aggregation phase.
In conclusion, it is important to pointout that there are several applications where binary features need to be used to improve efficiency, at the cost of some effectiveness reduction heinly12 . We showed that in this case the use of the encodings techniques represent a valid alternative to the direct matching.
4.4 Combination of CNNs and Aggregations of Binary Local Feature
In this section we evaluate the retrieval performance of the combination of CNN features with the aggregations of binary local feature, following the approach described in Section 3.2. We considered the INRIA Holidays dataset and we used the the output of the first fullyconnected layer (fc6) of the HybridNet zhou2014 model as CNN feature. In fact, in chandrasekhar2015 several experiments on the INRIA Holidays have shown that HybridNet fc6 achieve better mAP result than other outputs (e.g. pool5, fc6, fc7, fc8) of several pretrained CNN models: the OxfordNet simonyan2014 , the AlexNet krizhevsky2012 , the PlacesNet zhou2014 and the HybridNet itself.
Figure 1 shows the mAP obtained by combining HybridNet fc6 with different aggregations of ORB binary features, namely the BMMFV, the GMMFV, the VLAD, and the BoW. Interestingly, with the exception of the GMMFV, the retrieval performance obtained after the combination is very similar for the various aggregation techniques. This, on the one hand confirms that the GMMFV is not the best choice for encoding binary features, and on the other hand, since each aggregation technique computes statistical summaries of the same set of the local descriptors, suggests that the additional information provided by the various aggregated descriptors helps almost equally to improve the retrieval performance of the CNN feature. Thus, in the following we further investigate combinations of CNNs and the BMMFV that, even for a shot, reaches the best performance for all the tested parameter .
Method  Dim  mAP  
ORB  LATCH  AKAZE  ORB  LATCH  AKAZE  
BMMFV (K=64)  16,384  16,384  32,768  0  49.6  46.3  43.7 
Combination of
BMMFV (K=64) and HybridNet fc6 
20,480  20,480  36,864  0.1  66.4  64.7  59.2 
0.2  74.8  73.8  68.7  
0.3  77.4  76.8  74.3  
0.4  79.1  77.5  77.3  
0.5  79.2  78.3  78.0  
0.6  79.0  78.5  79.2  
0.7  78.7  77.7  78.7  
0.8  77.8  76.7  77.5  
0.9  76.4  76.3  76.2  
HybridNet fc6  4,096  1  75.5 
In Table 6 we report the mAP obtained combining the HybridNet fc6 feature with the BMMFV computed for three different kind of binary local features, namely ORB, LATCH and AKAZE, using mixtures of Bernoulli. It is worth noting that all the three BMMFVs give a similar improvement when combined with the HybridNet fc6, although they have rather different mAP results (see first row of Table 6) which are substantially lower than that of CNN (last row of Table 6). The intuition is that the additional information provided by using a specific BMMFV rather than using the CNN feature alone, do not depend very much on the used binary feature.
For each tested BMMFV seems that exist an optimal to be used in the convex combination (equation (4)). When ORB binary features were used, the optimal was obtained around , which correspond to give the same importance to both FV and CNN feature. For the less effective BMMFVs built upon LATCH and AKAZE, the optimal was , which means that the CNN feature is used with slightly more importance than BMMFV during the convex combination.
The use of ORB or AKAZE led to obtain the best performance that was 79.2% of mAP. This results in a relative improvement of 4.9% respect to the single use of the CNN feature, that in our case was 75.5%. So we obtain the same relative improvement of chandrasekhar2015 but using a less expensive FV representation. Indeed, in chandrasekhar2015 the fusion of HybridNet fc6 and a FV computed on 64dimensional PCAreduced SIFTs, using mixtures of Gaussian, have led to obtain a relative improvement of respect to the use of the CNN feature alone (see also Table 8).
However, the cost for integrating traditional FV built upon SIFTs with CNN features may be considered too high, especially for systems that need to process image in real time. For example, according to heinly12 and as showed in the table 2, the SIFTs extraction (about features per image), the PCAreduction to dimensions, and the FV aggregation with requires more than seconds per image, while the CNN feature extraction is 4 times faster (i.e., about ms per image). On the other hand, extracting ORB binary features (about features per image, each of dimension ) and aggregating them using a BMMFV with requires less than ms that is in line with the cost of CNN extraction ( ms). In our tests, the cost for integrating the already extracted BMMFV and the CNN features was negligible in the search phase, using a sequential scan to search a dataset, also thanks to the fact that both BMMFV and CNN features are computed using the not too costly Euclidean distance.
Since as observed in levi15_LATCH ; akaze the ORB extraction is faster than LATCH and AKAZE, in the following we focus just on ORB binary feature. In figure 2 we show the results obtained by combining HybridNet fc6 with the BMMFVs obtained using . We observed that the performance of the CNN feature is improved also when it is combined with the less effective BMMFV built using Bernoulli. The BMMFV with achieve the best effectiveness (mAP of 79.5%) for . However, since the cost for computing and storing FV increase with the number of Bernoulli, the improvement obtained using respect to that of doesn’t worth the extra cost of using a bigger value of .
The BMMFV with is still high dimensional, so to reduce the cost of storing and comparing FV, we also evaluated the combination after the PCA dimensionality reduction. As already observed, limited dimensionality reduction tends to improve the accuracy of the single FV representation. In fact, the BMMFV with achieved a mAP of when reduced from to dimensions. However, as shown in Table 7 and Table 8, when the PCAreduced version of the BMMFV was combined with HybriNet fc6, the overall relative improvement in mAP was , which is less than that obtained using the fullsized BMMFV. These result is not surprising given that after the dimensionality reduction we may have a loss of the additional information provided by the FV representation during the combination with the CNN feature.
Finally, in Table 8 we summarizes the relative improvement achieved by combining BMMFV and HybriNet fc6, and we compare the obtained results with the relative improvement achieved in chandrasekhar2015 , where the more expensive FV built upon SIFTs was used. We observed that BMMFV led to achieve similar or even better relative improvements with an evident advantage from the computational point of view, because it uses binary local features.
Method  Dim  mAP  
FV full dim  FV PCAreduced  FV full dim  FV PCAreduced  
BMMFV (K=64)  16,384  4,096  0  49.6  52.6 
Combination of
BMMFV (K=64) and HybridNet fc6 
20,480  8,192  0.1  66.4  66.3 
0.2  74.8  73.9  
0.3  77.4  77.3  
0.4  79.1  78.5  
0.5  79.2  78.4  
0.6  79.0  78.5  
0.7  78.7  78.1  
0.8  77.8  77.7  
0.9  76.4  76.4  
HybridNet fc6  4,096  1  75.5 
FV
method 
Local
Feature 
K  dim  Relative improvement 
BMMFV  ORB  128  32,768  5.2 
BMMFV  ORB  64  16,384  4.9 
BMMFV  AKAZE  64  32,768  4.9 
BMMFV  LATCH  64  16,384  4.0 
BMMFV+ PCA  ORB  64  4,096  3.9 
BMMFV  ORB  32  8,192  3.5 
FVchandrasekhar2015  SIFTPCA64  256  32,768  4.9 
4.5 LargeScale Experiments
Method  Dim  mAP  

FV full dim  FV PCAreduced  FV full dim  FV PCAreduced  
angle=45,raise=3pt27pt4ptHolidays  angle=45,raise=3pt27pt4ptHolidays+ MIRFlickr  angle=45,raise=3pt27pt4ptHolidays  angle=45,raise=3pt27pt4ptHolidays+ MIRFlickr  
BMMFV (K=64)  16,384  16,384  0  49.6  31.0  52.6  34.9 
Combination of
BMMFV (K=64) and HybridNet fc6 
20,480  8,192  0.1  66.4  47.0  66.3  50.7 
0.2  74.8  59.3  73.9  61.9  
0.3  77.4  64.0  77.3  65.6  
0.4  79.1  67.1  78.5  67.2  
0.5  79.2  66.5  78.4  66.9  
0.6  79.0  65.7  78.5  65.7  
0.7  78.7  64.4  78.1  64.4  
0.8  77.8  62.5  77.7  62.8  
0.9  76.4  60.7  76.4  60.8  
HybridNet fc6  4,096  1  75.5  59.1  75.5  59.1  
Maximum relative mAP improvement  4.9%  13.4%  4.0%  13.7% 
In order to evaluate the behavior of feature combinations on a large scale, we have used a set of up to one million images. More precisely, as in jegou08 , we merged the INRIA Holidays dataset with a public largescale dataset (MIRFlickr1M huiskes08 ) used as distraction set; the mAP was measured using the Holidays groundtruth.
Table 9 reports results obtained using both the BMMFV alone and the combinations with the HybridNet fc6 CNN feature. Given the results reported in the previous section we focus on the BMMFV encoding of ORB binary features. All the feature combinations show an improvement with respect to the single use of the CNN feature (mAP of ) or BMMFV (mAP of respectively using the full length/PCAreduced descriptor). This reflects the very good behavior of feature combinations also in the largescale case.
The mAP reaches a maximum using between 0.4 and 0.5, that is giving (quite) the same weight to BMMFV and CNN feature during the combination. The results obtained using the full length BMMFV and the PCAreduced version are similar. The latter performs slightly better and achieved a maximum of 67.2% of mAP that correspond to 13.7% of relative mAP improvement respect to use the CNN feature alone. It is worth noting that the relative mAP improvement obtained in the largescale setting is much greater than that obtained without the distraction set. This suggests that the information provided by the BMMFV during the combination helps in discerning the visual content of images particularly in presence of distractor images.
Since the computational time of extracting binary features is much faster than others, the computational gain of combining CNN features with BMMFV encodings of ORB over traditional FV encodings of SIFT is especially notable in the largescale scenario. For example, the process for extracting SIFTs from the INRIA Holidays+ MIRFlickr dataset ( images) would have required more than 13 days (about 1,200 ms per image) while ORB extraction took less than 8 hours (about 26 ms per image).
5 Conclusion
Motivated by recent results obtained on one hand with the use of aggregation methods applied to local descriptors, and on the other with the definition of binary local features, this paper has performed an extensive comparisons of techniques that mix the two approaches by using aggregation methods on binary local features. The use of aggregation methods on binary local features is motivated by the need for increasing efficiency and reducing computing resources for image matching on a large scale, at the expense of some degradation in the accuracy of retrieval algorithms. Combining the two approaches lead to execute image retrieval on a very large scale and reduce the cost for feature extraction and representation. Thus we expect that the results of our empirical evaluation are useful for people working with binary local descriptors.
Moreover, we investigated how aggregations of binary local features work in conjunction with the CNN pipeline in order to improve the latter retrieval performance. We showed that the BMMFV built upon ORB binary features can be profitable use to this scope, even if a relative small number of Bernoulli is used. In fact, the relative improvement in the retrieval performance obtained combining CNN features with the BMMFV is similar to that previously obtained in chandrasekhar2015 where a combination of the CNN features with the more expensive FV built on SIFT was proposed. Experimental evaluation on large scale confirms the effectiveness and scalability of our proposal.
It is also worth mentioning that the BMMFV approach is very general and could be applied to any binary feature. Recent works based on CNNs suggest that binary features aggregation technique could be further applied to deep features. In fact, on one hand, local features based on CNNs, aggregated with VLAD and FV approaches, have been proposed to obtain robustness to geometric deformations
Ng_2015_CVPR_Workshops ; Uricchio_2015_ICCV_Workshops . On the other hand, binarization of global CNN features have been also proposed in Lin_2015_CVPR_Workshops ; Lai_2015_CVPR . Thus, as a future work, we plan to test the BMMFV approach over binary deep local descriptors leveraging on the local and binary approaches mentioned above.Acknowledgements.
This work was partially founded by: EAGLE, Europeana network of Ancient Greek and Latin Epigraphy, cofounded by the European Commission, CIPICTPSP.2012.2.1  Europeana and creativity, Grant Agreement n. 325122; and Smart News, Social sensing for breakingnews, cofounded by the Tuscany region under the FARFAS 2014 program, CUP CIPE D58C15000270008.References
 (1) Alcantarilla, P.F., Nuevo, J., Bartoli, A.: Fast explicit diffusion for accelerated features in nonlinear scale spaces. In: In British Machine Vision Conference (BMVC) (2013)
 (2) Amato, G., Falchi, F., Gennaro, C., Vadicamo, L.: Deep Permutations: Deep Convolutional Neural Networks and PermutationBased Indexing, pp. 93–106. Springer International Publishing, Cham (2016). DOI 10.1007/9783319467597˙7. URL http://dx.doi.org/10.1007/9783319467597_7
 (3) Amato, G., Falchi, F., Vadicamo, L.: How effective are aggregation methods on binary features? In: Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, vol. 4, pp. 566–573 (2016)
 (4) Amato, G., Falchi, F., Vadicamo, L.: Visual recognition of ancient inscriptions using convolutional neural network and fisher vector. J. Comput. Cult. Herit. 9(4), 21:1–21:24 (2016). DOI 10.1145/2964911. URL http://doi.acm.org/10.1145/2964911

(5)
Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve
object retrieval.
In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2911–2918 (2012)
 (6) Arandjelovic, R., Zisserman, A.: All about VLAD. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 1578–1585 (2013). DOI 10.1109/CVPR.2013.207
 (7) Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Computer Vision–ECCV 2014, pp. 584–599. Springer (2014). DOI 10.1007/9783319105901˙38. URL http://dx.doi.org/10.1007/9783319105901_38
 (8) Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: A. Leonardis, H. Bischof, A. Pinz (eds.) Computer Vision  ECCV 2006, Lecture Notes in Computer Science, vol. 3951, pp. 404–417. Springer Berlin Heidelberg (2006). DOI 10.1007/11744023˙32. URL http://dx.doi.org/10.1007/11744023_32
 (9) Bing images. URL http://www.bing.com/images/

(10)
Bishop, C.M.: Pattern Recognition and Machine Learning.
Information Science and Statistics. Springer (2006)  (11) Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning midlevel features for recognition. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2559–2566 (2010)
 (12) Calonder, M., Lepetit, V., Strecha, C., Fua, P.: Brief: Binary robust independent elementary features. In: K. Daniilidis, P. Maragos, N. Paragios (eds.) Computer Vision  ECCV 2010, Lecture Notes in Computer Science, vol. 6314, pp. 778–792. Springer Berlin Heidelberg (2010)
 (13) Chandrasekhar, V., Lin, J., Morère, O., Goh, H., Veillard, A.: A practical guide to cnns and fisher vectors for image instance retrieval. CoRR abs/1508.02496 (2015). URL http://arxiv.org/abs/1508.02496
 (14) Chen, D., Tsai, S., Chandrasekhar, V., Takacs, G., Chen, H., Vedantham, R., Grzeszczuk, R., Girod, B.: Residual enhanced visual vectors for ondevice image matching. In: Signals, Systems and Computers (ASILOMAR), 2011 Conference Record of the Forty Fifth Asilomar Conference on, pp. 850–854 (2011). DOI 10.1016/j.sigpro.2012.06.005. URL http://dx.doi.org/10.1016/j.sigpro.2012.06.005
 (15) Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: Automatic query expansion with a generative feature model for object retrieval. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 1–8 (2007)
 (16) Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. Workshop on statistical learning in computer vision, ECCV 1(122), 1–2 (2004)
 (17) Datta, R., Li, J., Wang, J.Z.: Contentbased image retrieval: Approaches and trends of the new age. In: Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR ’05, pp. 253–262. ACM, New York, NY, USA (2005)
 (18) Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P.: Revisiting the VLAD image representation. In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013, pp. 653–656. ACM, New York, NY, USA (2013). DOI 10.1145/2502081.2502171. URL http://doi.acm.org/10.1145/2502081.2502171
 (19) Deng, J., Dong, W., Socher, R., Li, L., Li, K., FeiFei, L.: Imagenet: A largescale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255 (2009). DOI 10.1109/CVPR.2009.5206848
 (20) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. CoRR abs/1310.1531 (2013). URL http://arxiv.org/abs/1310.1531
 (21) GalvezLopez, D., Tardos, J.: Realtime loop detection with bags of binary words. In: Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pp. 51–58 (2011)
 (22) van Gemert, J.C., Geusebroek, J.M., Veenman, C.J., Smeulders, A.W.: Kernel codebooks for scene categorization. In: D. Forsyth, P. Torr, A. Zisserman (eds.) Computer Vision  ECCV 2008, Lecture Notes in Computer Science, vol. 5304, pp. 696–709. Springer Berlin Heidelberg (2008)

(23)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning (2016).
URL http://www.deeplearningbook.org. Book in preparation for MIT Press  (24) Google googles. URL http://www.google.com/mobile/goggles/
 (25) Google images. URL https://images.google.com/
 (26) Grana, C., Borghesani, D., Manfredi, M., Cucchiara, R.: A fast approach for integrating ORB descriptors in the bag of words model. In: C.G.M. Snoek, L.S. Kennedy, R. Creutzburg, D. Akopian, D. Wüller, K.J. Matherson, T.G. Georgiev, A. Lumsdaine (eds.) IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics (2013)
 (27) Gray, R.M., Neuhoff, D.L.: Quantization. Information Theory, IEEE Transactions on 44(6), 2325–2383 (1998). DOI 10.1109/18.720541. URL http://dx.doi.org/10.1109/18.720541
 (28) Hamming, R.W.: Error detecting and error correcting codes. The Bell System Technical Journal 29(2), 147–160 (1950). DOI 10.1002/j.15387305.1950.tb00463.x
 (29) Heinly, J., Dunn, E., Frahm, J.M.: Comparative evaluation of binary features. In: Computer Vision  ECCV 2012, Lecture Notes in Computer Science, pp. 759–773. Springer Berlin Heidelberg (2012)
 (30) Householder, A.: The Theory of Matrices in Numerical Analysis. A Blaisdell book in pure and applied sciences: introduction to higher mathematics. Blaisdell Publishing Company (1964)
 (31) Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. ACM, New York, NY, USA (2008)

(32)
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers.
In: In Advances in Neural Information Processing Systems 11, pp. 487–493. MIT Press (1998). URL http://dl.acm.org/citation.cfm?id=340534.340715  (33) Jégou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. In: D. Forsyth, P. Torr, A. Zisserman (eds.) European Conference on Computer Vision, LNCS, vol. I, pp. 304–317. Springer (2008)
 (34) Jégou, H., Douze, M., Schmid, C.: Improving bagoffeatures for large scale image search. International Journal of Computer Vision 87(3), 316–336 (2010). DOI 10.1007/s1126300902852. URL http://dx.doi.org/10.1007/s1126300902852
 (35) Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(1), 117–128 (2011). DOI 10.1109/TPAMI.2010.57
 (36) Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: IEEE Conference on Computer Vision & Pattern Recognition (2010). DOI 10.1109/CVPR.2010.5540039
 (37) Jégou, H., Perronnin, F., Douze, M., Sànchez, J., Pérez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(9), 1704–1716 (2012). DOI 10.1109/TPAMI.2011.235
 (38) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014). DOI 10.1145/2647868.2654889. URL http://doi.acm.org/10.1145/2647868.2654889
 (39) Kaufman, L., Rousseeuw, P.: Clustering by means of medoids. In: Y. Dodge (ed.) An introduction to L1norm based statistical data analysis, Computational Statistics & Data Analysis, vol. 5 (1987)
 (40) Krapac, J., Verbeek, J., Jurie, F.: Modeling Spatial Layout with Fisher Vectors for Image Categorization. In: ICCV 2011  International Conference on Computer Vision, pp. 1487–1494. IEEE, Barcelona, Spain (2011)
 (41) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: F. Pereira, C. Burges, L. Bottou, K. Weinberger (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)
 (42) Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
 (43) Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2 (2006)
 (44) LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015). DOI 10.1038/nature14539
 (45) Lee, S., Choi, S., Yang, H.: Bagofbinaryfeatures for fast image representation. Electronics Letters 51(7), 555–557 (2015)
 (46) Leutenegger, S., Chli, M., Siegwart, R.: Brisk: Binary robust invariant scalable keypoints. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2548–2555 (2011)
 (47) Levi, G., Hassner, T.: LATCH: learned arrangements of three patch codes. CoRR abs/1501.03719 (2015)
 (48) Lin, K., Yang, H.F., Hsiao, J.H., Chen, C.S.: Deep learning of binary hash codes for fast image retrieval. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2015)
 (49) Lloyd, S.: Least squares quantization in pcm. Information Theory, IEEE Transactions on 28(2), 129–137 (1982). DOI 10.1109/TIT.1982.1056489. URL http://dx.doi.org/10.1109/TIT.1982.1056489
 (50) Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004). DOI 10.1023/B:VISI.0000029664.99615.94. URL http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94
 (51) McLachlan, G., Peel, D.: Finite Mixture Models. Wiley series in probability and statistics. Wiley (2000)
 (52) Miksik, O., Mikolajczyk, K.: Evaluation of local detectors and descriptors for fast feature matching. In: Pattern Recognition (ICPR), 2012 21st International Conference on, pp. 2681–2684 (2012)
 (53) Perd’och, M., Chum, O., Matas, J.: Efficient representation of local geometry for large scale object retrieval. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 9–16 (2009)
 (54) Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pp. 1–8 (2007). DOI 10.1109/CVPR.2007.383266
 (55) Perronnin, F., Larlus, D.: Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3743–3752 (2015)
 (56) Perronnin, F., Liu, Y., Sànchez, J., Poirier, H.: Largescale image retrieval with compressed fisher vectors. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3384–3391 (2010). DOI 10.1109/CVPR.2010.5540009
 (57) Perronnin, F., Sànchez, J., Mensink, T.: Improving the fisher kernel for largescale image classification. In: Computer Vision  ECCV 2010, Lecture Notes in Computer Science, vol. 6314, pp. 143–156. Springer Berlin Heidelberg (2010). DOI 10.1007/9783642155611˙11. URL http://dx.doi.org/10.1007/9783642155611_11
 (58) Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Computer Vision and Pattern Recognition (CVPR), 2007 IEEE Conference on, pp. 1–8 (2007). DOI 10.1109/CVPR.2007.383172
 (59) Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8 (2008). DOI 10.1109/CVPR.2008.4587635
 (60) Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features offtheshelf: an astounding baseline for recognition. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pp. 512–519. IEEE (2014). DOI 10.1109/CVPRW.2014.131
 (61) Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2564–2571 (2011)
 (62) Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGrawHill, Inc., New York, NY, USA (1986)
 (63) Sànchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision 105(3), 222–245 (2013). DOI 10.1007/s112630130636x. URL http://dx.doi.org/10.1007/s112630130636x
 (64) Sànchez, J., Redolfi, J.: Exponential family fisher vector for image classification. Pattern Recognition Letters 59, 26 – 32 (2015). DOI http://dx.doi.org/10.1016/j.patrec.2015.03.010
 (65) Simonyan, K., Vedaldi, A., Zisserman, A.: Deep fisher networks for largescale image classification. In: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems 26, pp. 163–171. Curran Associates, Inc. (2013)
 (66) Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556 (2014). URL http://arxiv.org/abs/1409.1556
 (67) Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV ’03, vol. 2, pp. 1470–1477. IEEE Computer Society (2003). DOI 10.1109/ICCV.2003.1238663
 (68) Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Contentbased image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)
 (69) Sydorov, V., Sakurada, M., Lampert, C.H.: Deep fisher kernels  end to end learning of the fisher kernel gmm parameters. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
 (70) Tolias, G., Avrithis, Y.: Speededup, relaxed spatial matching. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1653–1660 (2011). DOI 10.1109/ICCV.2011.6126427
 (71) Tolias, G., Furon, T., Jégou, H.: Orientation covariant aggregation of local descriptors with embeddings. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (eds.) Computer Vision  ECCV 2014, Lecture Notes in Computer Science, vol. 8694, pp. 382–397. Springer International Publishing (2014)
 (72) Tolias, G., Jégou, H.: Local visual query expansion: Exploiting an image collection to refine local descriptors. Research Report RR8325 (2013). URL https://hal.inria.fr/hal00840721
 (73) Uchida, Y., Sakazawa, S.: Image retrieval with fisher vectors of binary features. In: Pattern Recognition (ACPR), 2013 2nd IAPR Asian Conference on, pp. 23–28 (2013)
 (74) Ullman, S.: HighLevel Vision  Object Recognition and Visual Cognition. MIT Press (1996)
 (75) Uricchio, T., Bertini, M., Seidenari, L., Del Bimbo, A.: Fisher encoded convolutional bagofwindows for efficient image retrieval and social image tagging. In: The IEEE International Conference on Computer Vision (ICCV) Workshops (2015)
 (76) Van Opdenbosch, D., Schroth, G., Huitl, R., Hilsenbeck, S., Garcea, A., Steinbach, E.: Camerabased indoor positioning using scalable streaming of compressed binary image signatures. In: IEEE International Conference on Image Processing (2014)
 (77) Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Localityconstrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3360–3367 (2010)
 (78) Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann (1999)
 (79) Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1794–1801 (2009)
 (80) YueHei Ng, J., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2015)
 (81) Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach, Advances in Database Systems, vol. 32. Springer (2006)
 (82) Zhang, Y., Zhu, C., Bres, S., Chen, L.: Encoding local binary descriptors by bagoffeatures with hamming distance for visual object categorization. In: P. Serdyukov, P. Braslavski, S. Kuznetsov, J. Kamps, S. Rüger, E. Agichtein, I. Segalovich, E. Yilmaz (eds.) Advances in Information Retrieval, Lecture Notes in Computer Science, vol. 7814, pp. 630–641. Springer Berlin Heidelberg (2013)
 (83) Zhao, W., Jégou, H., Gravier, G.: Oriented pooling for dense and nondense rotationinvariant features. In: BMVC  24th British Machine Vision Conference (2013)

(84)
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database.
In: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (eds.) Advances in Neural Information Processing Systems 27, pp. 487–495. Curran Associates, Inc. (2014)
Appendix A Score vector computation
In the following, we have reported the computation of the score function , defined as the gradient of the loglikelihood of a data with respect to the parameters of a Bernoulli Mixture Model. Throughout this appendix we have used notation to represent the Iverson bracket which equals one if the arguments is true, and zero otherwise.
Under the independence assumption, the Fisher score with respect to the generic parameter is expressed as: To compute , we first observe that
(5) 
and
(6) 
Hence, the Fisher score with respect to the parameter is obtained as
(7) 
and the Fisher score related to the parameter is
(8) 
Appendix B Approximation of the Fisher Information Matrix
Our derivation of the FIM is based on the assumption (see also perronnin10 ; sanchez13 ) that for each observation the distribution of the occupancy probability is sharply peaking, i.e. there is one Bernoulli index such that and , . This assumption implies that
and then
(9) 
where is the Iverson bracket.
The elements of the FIM are defined as:
(10) 
Hence, the FIM is symmetric and can be written as block matrix
By using the definition of the occupancy probability (i.e. ) and the fact that is the distribution of a dimensional Bernoulli of mean , we have the following useful equalities:
(11)  
(12)  
(13)  
(14)  
Comments
There are no comments yet.