I Introduction
Image retrieval based on visual queries is a topic of intensive research interest since it finds many applications in visual search [1, 2, 3], detection of copyright violations [4], recommendation services [5] and object or person identification [6]. For much of the past decade, the stateoftheart for contentbased image retrieval was to encode the image by first describing salient points using a locallyinvariant feature descriptor, such as SIFT [7] or an image decomposition (e.g., wavelets [8, 9, 10, 11]). As such, a visual vocabulary is learned offline using Kmeans or mixtureofGaussians (MoG) clustering [12], which quantizes the feature space into cells (visual words). The SIFT cell assignments of each database (or query) image are then produced and aggregated in order to obtain a compact representation that can be used for visualquery based retrieval. Notable contributions in this domain have relied on the bagofwords (BoW) image representation [13], where the SIFTs assigned to each visual word are aggregated into a histogram used for retrieval purposes. Amongst the successful extensions to BoW are feature softassignment [14], spatial matching methods [15, 16, 17] and indexing methods [13, 18].
Despite the success of BoW approaches, their large storage and memory access requirements make them unsuitable for image retrieval within large image datasets (e.g., tens of millions of images). For such problems, the vector of locally aggregated descriptors (VLAD) [19] was introduced, as a nonprobabilistic variant of the Fisher vector image descriptor [20] that encodes the distribution of SIFT assignments according to cluster centers. VLAD has been shown to achieve very competitive retrieval performance to BoW methods with ordersofmagnitude reduction in complexity and memory footprint, i.e., requiring 16–256 bytes per image instead of the tensofkilobytes required by BoW methods [19]. With such a reduced memory footprint, it has been shown that a standard multicore server can load and retain the VLADs of a billionimage dataset in its random access memory [21]. This facilitates the scaleup of visual search to big data by using standard cloud computing clusters comprising groups of tens or even hundreds of such servers [22].
However, with increasing dataset sizes and complexity in retrieval, there is a need for deeper models to learn more complex representations and with larger learning capacity. To this end, deep convolutional neural networks (CNNs) have recently come to the forefront in visual recognition [23, 24, 25, 26]. Deep CNNs, as well as hybrid neural network variants like the FVNN approach of Peronnin et al. [27]
, have the potential to go beyond “shallow” learned encodings like VLAD because they are scalable with training. For example, deep CNNs trained discriminatively on a large and diverse labelled dataset like ImageNet
[28] have been shown to outperform Fisher vectors for image classification [26]. In addition, recent work [29, 2, 30, 31]demonstrates that features extracted from intermediate layers of a deep CNN are actually transferable to image retrieval. The aggregated features tend to provide a rich semantic representation of the image, which in the case of retrieval, has been shown to offer comparable, if not substantially better performance to VLAD and Fisher vector descriptors
[29, 32, 33]. However, deep CNNs are not without some disadvantages: beyond the high computational cost that is inherent with a large training set and numerous layers [26, 34], the CNN activations lack geometric invariance [33, 35], which pertains to the VLAD descriptor and its variants remaining a viable option, especially for finegrained search.In this paper, we are interested in the problem of designing a visualquery based retrieval system that is capable of handling both small and largesize “object”, or, more broadly, regionofinterest (ROI) queries over image datasets. Given a ROI representing a visual query, the proposed system should return all images from the database containing this query, with matching complexity and storage requirements that remain of the order of standard encodings. This is considerably more challenging than wholeimage retrieval systems, as the query object may be occluded or distorted, or be seen from different viewpoints and distances in relevant images [15]. This is also the reason why the original VLAD proposal does not perform as well for this problem [21]. We therefore propose a new Voronoibased encoding (VE), in which we spatially partition the image, using a hierarchical Kmeans, into Voronoi cells and thus compute multiple descriptors over cells. We couple this with an adaptive search algorithm that minimizes the overall computation for similarity identification by first finding the cells most representative to the query and then deriving a novel singlescore metric for the image over these cells. We propose a novel product quantization framework (based on symmetric distance computation) for our proposal. Finally, we show that our proposed framework is agnostic to the descriptor basis by testing on both a Voronoibased VLAD descriptor and Voronoibased deep CNN feature descriptor and assessing performance against their respective stateoftheart variants. Overall, our system design for object retrieval adheres to the following principles:

The system should provide for substantial improvement over the base descriptor’s (VLAD or CNN) mean Average Precision (mAP) when ROI queries are small relative to the image size.

The system should maintain competitive mAP to the base descriptor representations under ROI queries occupying a sizeable proportion (or the entirety) of images.

The system should be amenable to bigdata processing, i.e., its descriptors’ size and matching complexity should be comparable to the base descriptor.
In the following section we discuss the background and related work, with Table I summarizing the nomenclature. In Section III we present the offline and online components of our proposed system and Section IV presents the extension of the proposed approach to quantized representations. Section V presents experimental results for FastVDCNN on the Holidays dataset [36] and FastVVLAD on the Caltech Cars (Rear) dataset [37], and Section VI draws concluding remarks.
Ii Background and Related Work
Iia Vector of Locally Aggregated Descriptors
VLAD is a fixedsize compact image representation that stores firstorder information associated with clusters of image salient points [19, 38]. In essence, VLAD is intrinsically related to the Fisher vector image descriptor [39].
In the offline part of the VLAD encoding, based on a training set of dimensional SIFT descriptors derived from training images, a visual word vocabulary is first learned using Kmeans clustering. This vocabulary comprises clusters with dimensional centroids .
For each new test image (out of a test dataset comprising images), interest points are detected (using an affine invariant detector) and described using dimensional SIFT descriptors, thus forming a descriptor ensemble . The descriptors , , are assigned to the nearest cluster in the vocabulary via a cluster assignment function . VLAD then stores the residuals of the SIFT assignments from their associated centroids. The VLAD dimensional encoding for the th cluster, , is given by [19, 38]:
(1) 
The VLAD encodings for each cluster are concatenated into a single descriptor with fixed dimension , which is independent of the number of the SIFT descriptors found in the image. The VLAD vectors are then sign squarerooted and normalized [38] and the vectors across all images of the test dataset are thus aggregated into a single matrix .
In a practical system, the SIFT descriptor length is typically 128; if the feature space is coarsely quantized with
set to 64, then the VLAD image descriptor has 8192 dimensions. Further dimensionality reduction is achieved with principal component analysis (PCA) (learned on an independent training set), thus further minimizing the memory footprint per image descriptor
[40, 41]. The projection matrix used by VLAD comprises only thelargest eigenvectors of the covariance matrix
[40, 41]. The projected VLAD, , of each image in the test dataset is then normalized, thereby completing the offline part of the VLAD generation.During online ROIquery based retrieval, after the VLAD encoding and projection of the ROI query has been carried out, the similarity between that and the (projected) VLAD of a test dataset image, and , can be simply measured using the squared Euclidean distance [38]. With normalized vectors, this is a monotonic function of the inner product, such that:
(2) 
where the similarity score ranges between 1 (completely dissimilar) to 1 (perfect match).
Symbol  Definition 

dimensions of unprojected descriptor  
dimensions of PCAprojected and truncated descriptor and descriptor blocks (resp.)  
number of trainingset images  
number of testdataset images  
PCA projection matrix  
diag. eigenvalue matrix, diag. eigenvalue submatrix 

, and  PCAprojected descriptor of a query ROI and a test image (resp.), and whiteningandnormalization based product quantization (WNPQ) descriptor of the same 
number of bits for quantized descriptor & constituent block  
number of quantization centroids per descriptor and descriptor block  
similarity score between descriptor “des1” and “des2”  
number of quantization subspaces (blocks) for Product Quantization (PQ)  
PQ codebook per quantization block ,  
number of levels (scales) used for Voronoibased encoding (VE)  
number of Voronoi cells per level ,  
number of Voronoi cells in VE  
level that Phase 1 exits in FastVE adaptive search,  
similarity score for cell with maximum similarity to the query per level ,  
difference between number of interest points in query and cell corresponding to ,  
L1normalized Gaussian weighting per level for FastVE,  
number of cells accessed in FastVE  
covariance matrix per descriptor block , 
IiB MultiVLAD
For ROIbased retrieval, VLAD and the similarity measure of (2) will produce suboptimal results for small ROI, because information encoded from the remaining parts of the dataset image will distort the similarity scoring [21].
Lazebnik et al. [42] introduced the concept of spatially partitioning an image into a rectangular grid over multiple scales and encoding per block, as a method of incorporating spatial information; this has found application in both image classification [42, 43] and retrieval [44]. Similarly, the recentlyproposed MultiVLAD descriptor [21] attempts to improve VLAD performance for small ROI by spatially partitioning the dataset images into a rectangular grid over three scales and computing a VLAD descriptor per block.
At the finest scale (level 2), nine VLADs are encoded over a 33 rectangular grid. At medium scale (level 1), four VLADs are encoded over a 22 grid, where each block is composed of 22 blocks from the finest scale. Finally, a single VLAD is encoded over the whole dataset image (level 0). At each scale, MultiVLAD excludes featureless regions near image borders by adjusting the grid boundary. Moreover, each VLAD is PCA projected and truncated to a 128dimensional vector. The similarity is thus computed between the VLAD encoded over the query ROI and each of the 14 VLAD descriptors via (2) and the dataset image is assigned a similarity score to the ROI equal to the maximum similarity over its constituent VLADs.
For ROI queries occupying about 11% of image real estate, the MultiVLAD descriptor has been shown to outperform the single (128 14)D VLAD (computed over the whole image) in terms of mAP. However, MultiVLAD achieves 20% lower mAP than the (128 14)D VLAD when queries occupy a sizeable proportion of the image [21]. In addition, it incurs a 14fold penalty in storage and matching complexity in comparison to the baseline 128D VLAD.
IiC Deep Convolutional Neural Networks for Retrieval
Deep CNNs are feedforward neural networks comprising multiple layers, and typically trained for classification on large
apriori labelled datasets, such as ImageNet [28]. It has recently been shown that extracted features from intermediate layers are transferable to other visual recognition tasks, including image retrieval [29, 30, 31]. While descriptors derived from these extracted features have been shown to match or outperform “shallow” learned methods, such as VLAD, they suffer due to lack of geometric invariance [33, 35]. Recent works [33, 32, 45, 30] have proposed patchbased methods for overcoming the lack of geometric invariance of the descriptor in instance retrieval. Our proposal is more inline with grid based spatial search methods of Carlsson et al. [30, 32], as we are not explicitly computing a global descriptor over extracted features from multiple patches like CNN+VLAD [33] and CKNmix [45] (which require additional computational preprocessing, e.g., for learning encoding centers). Therefore, we compare performance against a generic gridbased spatial search, which we refer to as MultiCNN.A MultiCNN descriptor can be devised analogously to MultiVLAD, i.e., by dividing the image at level into an grid and computing the similarity score between two images as the global maximum inner product over all partitions. For the general case of levels, the total number of partitions (incl. the whole image as level 0) is:
(3) 
IiD Product Quantization
In order to further reduce the search complexity and the required memory footprint when handling large datasets, dimensional vectors are typically quantized to produce compact bit representations [46].
Consider a dimensional query vector . A global Kmeans clustering approach can be used to map to vector in codebook . For quantizer with centroids, the total number of bits used to encode is . However, for a 64bit encoding of a 128D query vector (0.5 bits per dimension), centroids must be learned using Kmeans, which is clearly infeasible. The learning and storage requirements of this quantization problem can be reduced either via traditional approximate nearest neighbor (ANN) algorithms [47, 48, 49, 50], or more recent advances [51, 52, 53]. In this paper, we focus on an efficient method for ANN search, named product quantization (PQ) [46], which uses multiple subquantizers rather than a single global quantizer [46]. PQ considers each unquantized vector as the concatenation of subvectors, , each with equal dimension . Each subvector is encoded from its own subcodebook , learned using Kmeans and considered to be of size for all , . As such, the new codebook is the Cartesian product of the subcodebooks, with total size :
(4) 
Crucially, via this vector partitioning approach, the learning complexity and storage requirement is reduced to . The total number of bits used to encode each is now given by , where is the number of bits used to encode each subvector , i.e., .
Previous work proposed PQ with asymmetric distance computation (ADC) [19, 38], which only encodes the vectors of the test dataset, and PQ with symmetric distance computation (SDC) [46], where both query and test vectors are quantized. By not encoding the query vectors, ADC reduces the overall quantization distortion, thus enhancing the discriminatory power of the system. On the other hand, in SDC the distances between any two subcodewords in the th subspace are precomputed and stored in a lookup table, thus enabling efficient ANN search by simple lookup table accesses. Experimental results [19, 38], have shown that ADC and SDC variants of PQbased VLAD achieve comparable retrieval performance to unquantized VLAD representations with four to tenfold reduction in storage and search complexity.
Iii Proposed Voronoibased Encoding and its Fast Online Implementation
The Voronoibased encoding proposed in Subsection IIIA constitutes the offline component of our system. Subsection IIIB describes the proposed acceleration for online Voronoibased ROI query search and possibilities for memory compaction to reduce storage requirements.
Iiia Voronoibased Encoding and Compact Descriptors
Instead of spatially partitioning the images into a rectangular grid, we propose to partition the image into Voronoi cells over levels (scales), using hierarchical spatial Kmeans clustering. The key intuition is that objects that may constitute ROI queries tend to appear as clusters of salient points, potentially interspersed with featureless regions in the image. Therefore, a ROIoriented partitioning must attempt to adaptively isolate these spatial clusters at multiple levels.
Initially, the entire image is encoded; this comprises level 0 of the Voronoibased encoding. For level 1, a spatial Kmeans is computed over the interest point locations in the whole image, which effectively partitions the image into Voronoi cells. Next, for level 2, a spatial Kmeans is computed over the interest point locations within each level1 Voronoi cell, thus partitioning each cell into constituent cells. In general, for level , , each of the cells of the previous level is partitioned into cells, with . A base descriptor, whether this be VLAD or aggregated deep CNN features, is encoded over each cell following the description of Section 2.1, giving a total of
(5) 
encodings per image. When PCAprojecting each cell descriptor, we aggregate each level into a single matrix .
A threelevel Voronoi partitioning for an image from the Caltech Cars image dataset with is illustrated in Fig. 1. The detected points are shown in color in the left image of Fig. 1, and the level1 and level2 Voronoi cells are superimposed with dashed lines on the middle and right image (resp.), with their corresponding descriptors appearing with different colors.
In essence, there are two variables to consider when implementing a Voronoibased encoding; the number of levels and the number of Voronoi cells , , to encode. For the purposes of this paper, we will consider to be constant for all levels
. In addition, it is worth noting that for the Voronoibased encoding, we construct a single PCA projection matrix using the entire images of the training set. This is because we found that there is very little gain in retrieval performance when learning separate PCA projection matrices for each Voronoi partition level, mostly due to sufficient variability in scale of ROI in the training images alone. Finally, given we are dealing with PCA on high dimensional data, for the case where the unprojected cell descriptor dimension,
, is greater than the training size , we use the manipulation described by Bishop [54]. In essence, we define the covariance matrix for the training descriptor matrix as , which provides for a lower dimensionalmatrix to work with. Following singular value decomposition, we then rotate the derived projection matrix,
, into the original covariance data space to obtain the projection matrix, , using the equivalence:(6) 
where is the diagonal matrix of eigenvalues of .
We conclude this subsection by summarizing the VLAD and deep CNN descriptors utilized for each Voronoi partition.
IiiA1 Voronoibased VLAD (VVLAD)
We require a detector that is robust to scale and viewpoint changes, while also detecting enough points in salient regions to allow for reliable partitioning. Therefore, for VVLAD, we use the Hessian Affine detector [55, 56], which is based on the multiscale determinant of the Hessian matrix (computed locally), and detects affine covariant regions. SIFT descriptors are produced based on the detected points. It is worth noting that: (i) salient point detection is an implicit step in each VLAD computation and not additional processing; (ii) unlike MultiVLAD, there is no need to preprocess the image and exclude featureless regions. As shown in the example of Fig. 1, smaller Voronoi cells are adaptively formed around regions of tight clusters of detected points.
IiiA2 Voronoibased Deep CNN (VDCNN)
In this case, the salient point detection constitutes additional preprocessing. Nevertheless, this can be achieved efficiently by using the FAST corner detector [57]
, which classifies a pixel as a corner based on its relative intensity to a set of contiguous pixels. As for the case of VVLAD, the image is partitioned into Voronoi cells based on the location of detected points. Since the deep CNN must take a rectangular input image segment, we compute a bounding box over the constituent points of each cell, and then resize and subtract an average image, as per convention, before feeding into the pretrained deep CNN. Given that the cells are treated independently, the feedthrough can be done in parallel, using multiple copies of the network. In terms of the deep CNN descriptor specifics, we use the CNNS architecture
[58]pretrained on ILSVRC2012 with batch normalization
[59]. The network is sufficiently deep to provide a rich semantic representation of the image/image partitions without overfitting to the classification task. The conventional approach to generating a feature descriptor from the network is to simply extract one of the fullyconnected layers [29, 26, 32]. Instead, we extract the last maxpooling layer (Layer 13) of the network, which precedes the fullyconnected network and should be less tuned to the classification task. From this layer, we generate a 512D feature descriptor by averaging the CNN activations over the spatial dimensions. We can also (optionally) apply PCAprojection and truncation to achieve further compaction to 128 dimensions.
IiiB Fast Online Implementation: Adaptive Search and Image Similarity Score
Conventionally, we could assign an image score as the global maximum similarity to a query over cells, using (2) for each cell. However, the proposed Voronoi partitioning essentially gives us a tree of spatial Voronoi cells where, for levels, “leaf” Voronoi cells exist at the bottom of the tree. Given that there is inherent mutual information between a cell and its constituent cells, rather than accessing data for all levels and measuring similarity over all cells of the tree indiscriminately, we can design an adaptive search with toptobottom tree pruning to find the most relevant Voronoi cells to the query. This reduces the overall execution time and memory accesses when performing a retrieval task, which makes our proposal applicable to very large image databases that would contain millions of images. The toptobottom search is carried out in two phases.
Phase1: Considering the cell of level with maximum similarity to the query [measured via (2)], in Phase1 of the search, we assume that either this cell or a constituent cell within it (at level ) will attain high similarity to the query. If the cell of level is found to attain the highest similarity to the query, we terminate the search for that image at level and proceed to Phase2. On the other hand, if we find that a constituent cell of level attains the maximum similarity, we repeat Phase1 for that cell and its constituent cells at the next level (), until we reach the bottom of the tree, in which case we move to Phase2.
Phase2: Let us denote the maximum similarity found by Phase1 for each level as and assume that Phase1 exited at level , . Rather than assigning as the similarity score between the ROI query and the test image in the dataset, we perform a weighted sum over all . To this end, we first compute the difference , , between the number of interest points in the query and the number of interest points in the image dataset cell corresponding to . This difference is subsequently used within a scaled inverse function. The weight for () is thus defined as:
(7) 
where controls the order (set as the modal order of magnitude over all ). The weight vector over all levels is normalized so that the image score can be ranked independently of the level at which Phase1 terminated. Denoting the normalized weight as , the proposed similarity score between a ROI query and dataset image after Phase2 is:
(8) 
For example, for a threelevel partition, if a query object is small relative to the image size, we expect that the total number of interest points over the query would be comparable to that of a level2 cell. Hence, the level2 maximum dot product should receive the largest weighting when computing the similarity score. This is expected to be a more robust similarity scoring than just taking a global maximum over all (as in MultiVLAD) as the similarity score, since we account for relevant information from all levels.
Summary: We term this twophase search coupled with the Voronoipartitioning as Fast Voronoibased encoding (FastVE), because it reduces the expected number of cells that are accessed at runtime. The upper bound for the matching complexity is now:
(9) 
inner products per image instead of the inner products required using a global maximum similarity measure that considers all cells. Due to the weights of (8), per image , along with the Voronoibased encoding we also store the number of interest points per cell, comprising additional values.
It is worth noting that further storage compaction of the FastVE is feasible using level projection. Via level projection, we can adhere to memory constraints of a practical deployment for verylarge image datasets by only storing the PCA projected cell descriptors for the last level, and computing the cell descriptors for levels at runtime by aggregating smallercell descriptors. Given that such storage compaction is of secondary importance in the overall unquantized and quantized FastVE design, we include its details as supplementary information in Appendix A.
Iv Product Quantization for Efficient VE Search
Given that quantized descriptor representations offer significantlyhigher compaction than unquantized ones, we extend VE and FastVE to quantized representations via a speciallydesigned product quantization framework.
Iva Product Quantization based on Symmetric Distance Computation for Voronoibased Encoding
We consider PQ based on SDC for the proposed VE approach^{1}^{1}1refer to Section IID for nomenclature and symbol definitions, where both the query vector and test dataset vector are quantized [46]. We opted for SDCbased rather than ADCbased PQ because ADCbased methods require the precomputation and storage of distances between VE query and test vectors, which is not feasible in a largescale image retrieval system where potentially any image could form a query.
In SDCbased PQ, the nearest neighbour to can be approximated by optimizing the distance function . The distance function is typically the squared Euclidean distance [46]:
with the number of subquantizer blocks of (4).
The key intuition behind the modified PQ for VE is to treat the constituent Voronoi cells as images and apply PQ on each query and test cell. A single PQ codebook is learned using Kmeans clustering on a training set. Each cell descriptor from the test dataset is thus considered as a concatenation of subvectors of elements each, with each subvector being encoded from its corresponding subcodebook . In this way, there is also no dependency on the level , as we quantize the cell descriptors from a single PQ codebook. All possible distance values between the th and th subcodebook vectors in the th subspace, , are precomputed and stored in a lookup table, thus enabling efficient ANN search by simple lookup table accesses.
As the subspaces are orthogonal, we normalize the product quantization of each cell descriptor’s subquantizer block , , by normalizing the columns of the PQ subcodebooks individually before computing and storing the distance values. For the th subcodebook vector in the th subspace, the normalization term is given by . As such, the distance value to be stored between the th and th subcodebook vectors is
(11) 
Quantizing from the normalized PQ subcodebooks, the distance function between a subspace normalized and quantized query cell descriptor and a subspace normalized and quantized test cell descriptor is now simply
(12) 
which is analogous to the squared Euclidean distance of normalized vectors. Importantly, this bounds the similarity score between 1 and 1, which facilitates performance comparisons.
Fig. 2 illustrates two indicative examples of SDCbased PQ on the and dimensional VLAD, both with and without subspace normalization. The retrieval performance is measured in terms of mean average precision (mAP) on the Holidays dataset [36], using wholeimage queries. It is evident from the results of the figure that, for given dimension , subspace normalization actually improves retrieval performance, effectively peaking close to . At this block size, the subspace dimensionality is sufficient such that each subspace is optimally regularized. In addition, we observe that the performance margin between the VLAD descriptor and its subspace normalized counterpart increases significantly with dimension .
Essentially, we want the block size to be large enough that we encode over a sufficient number of bits; however, beyond a certain block size, we end up normalizing over too few dimensions. In this regard, it is interesting to consider the limit case, where i.e., . There, subspace normalization results in storing just the sign per cell descriptor. In this extreme case, the similarity between cells can be computed very efficiently by using the Hamming distance, i.e., without accessing any lookup tables.
Concerning storage requirements, assuming that the components of the unquantized cell representation are kept as 32bit floatingpoint numbers, their offline storage requirement is bits per test image. On the other hand, our productquantized cell descriptor requires bits per test image, which is independent of the dimension, . In addition, for the entire test dataset, the total storage requirement for the quantization lookup tables is . As the test dataset grows in size, this value becomes negligible in comparison to the storage requirement for the productquantized descriptors.
Finally, with regards to the search complexity, the inner products have been replaced by read accesses to the lookup tables. As such, the product quantized FastVE now has an upper bound on complexity of reads, which is independent of the descriptor dimension per cell.
IvB Optimal Bit Allocation via Whitening and Subspace Normalization
Given the presence of multiple cells in the Voronoibased encoding, it is important to derive an appropriate bit allocation strategy that minimizes the quantization distortion.
Assumption 1. We consider successive samples of each subspacenormalized VE component (dimension) (
) to be modelled by independent, normallydistributed, random variables, with corresponding variance
.Under Assumption 1, the normalized random vectors of all subspaces , , can then be represented by independent and identicallydistributed multivariate Gaussians^{2}^{2}2The Gaussian assumption is necessary for some of the theoretical derivations, but is also proven to hold in practice [60, 61]., with corresponding diagonal covariance matrices . The ratedistortion function for independent, normallydistributed random variables [62] can be extended to the multivariate case in order to derive the optimal bit allocation strategy for VE. This leads to the following proposition.
Proposition 1.
Under Assumption 1, optimal bit allocation after subspace normalization in VE can be achieved by balancing the variances of the subspaces.
Proof:
See Appendix B. ∎
Indeed, recent work [61] employs an optimized product quantization (OPQ) that effectively leads to balanced subspace variances by assigning principal components to a subspace with the objective of balancing the product of eigenvalues per subspace. This corresponds to performing a permutation of the principal components to achieve balanced variances. Jegou et al. [38] propose balancing the component variance with a random orthogonal rotation, but this removes the decorrelation achieved by PCA. A different approach is proposed by Brandt et al. [60]: one can achieve a constant quantization distortion per subspace by varying the number of bits assigned to each principal component, at the cost of increased training and runtime complexity. Finally, SpyromitrosXioufis et al. [63] consider the effects of applying a random orthogonal rotation on PCAprojected and whitened VLAD vectors prior to product quantization. However, whitening inherently balances the subspace variances by setting
to the identity matrix for all
, which also preserves decorrelation and mitigates descriptor bias from visual word/component cooccurrences [40, 41]. As such, we propose a simple and effective solution for the bit allocation that adheres with the theoretical result of Proposition 1: we use a whitening approach after PCA (and prior to the product quantization), together with the subspace normalization described in the previous section (and shown to be beneficial by the tests of Fig. 2). Specifically, per cell, we can express the relationship between a projected descriptor and its whitened and normalized counterpart in the th subspace as:(13) 
where is the diagonal subspace matrix of eigenvalues of the trainingset covariance matrix, with associated with the th largest eigenvector and .
The advantage of using whitening and normalization against previous approaches is that there is no need for any additional preprocessing, such as learning a rotation matrix or variability in the bit allocation across the principal components. We term our approach whitening & normalization based product quantization (WNPQ).
V Experimental Evaluation
Va Datasets
We measure performance on the Holidays and Caltech Cars (Rear) test image datasets. For both datasets, a set of predefined queries and handannotated ground truth is used.
Caltech + Stanford Cars [37, 64]:
This test dataset consists of 1155 (360 ) 240) photographs of cars taken from the rear. Subsequently, we test on a subset of 416 images from the Caltech Cars (Rear) dataset, from which we select 10 images and perform three tests: (i) we mimic a surveillance test by selecting only the license plates as ROIqueries; (ii) we select as midscale ROIqueries a section of the car trunk, and (iii) use the whole images as queries. An example of the query subset is given in the left part of Fig. 3. For the license plate test, we manually create “good” and “junk” groundtruth files over matching images [15]; “junk” ground truth comprises any image in which the query (i.e., the license plate) is barely visible or not distinguishable by the interest point detector. To provide a more rigorous and diversified test, we combine the Caltech Cars subset with another independent set of 1000 distractor images from the Stanford Cars dataset [64], comprising various car models and orientations, giving the Caltech + Stanford Cars dataset.
Holidays [36]: The Holidays test dataset consists of 1491 images, mainly consisting of holiday photos. There are 500 “whole image” provided queries of a distinct scene or object. In order to test on a smaller scale, we also select salient regions from a subset of 40 query images as ROI queries into our system. An example ROI query with its corresponding matching image set is shown in the right part of Fig. 3.
VB Setup
Unless stated otherwise, all vectors are whitened and renormalized postPCA. The retrieval performance is measured by creating a ranked list and computing the mAP over all queries. Matching complexity is defined as the number of multiplyaccumulate (MAC) operations for unquantized descriptors, or the number of lookup table reads for quantized descriptors. Per descriptor, we report the matching complexity averaged over all tests and normalized to the baseline 128D descriptor complexity, along with the descriptor storage size, in bytes.
Caltech + Stanford Cars: Due to the specificity of the Caltech + Stanford Cars dataset, together with the lower ROI resolutions, using a deep CNN pretrained on ImageNet is not a viable option. For example, ImageNet (ILSVRC2012 dataset [59]) does not contain any substantial number of images (and associated labels) corresponding to car license plates; therefore, the pretrained deep CNN descriptor will not be suitable for such images. For these reasons, we have established that, in this dataset, the utilized deep CNN descriptor is outperformed by VLAD descriptor variants, particularly on licenseplate queries. Thus, we use this dataset to test how the proposed Voronoibased encoding performs with the “shallow” learned VLAD descriptor of Subsection IIIA.1.
For the VLAD computation, we follow the design of Subsection IIA. The PCA projection matrix, visual word centers and PQ codebook are learned on an independent dataset of 2000 car images from the Stanford Cars dataset [64]. For FastVVLAD, we set: , , , with 128D VLAD per cell and compile a ranked list from the relevant similarity score. For VLAD, we use: 128D, 768D and 1664D sizes, in order to align the VLAD matching complexity with that of the FastVVLAD descriptor. We set for the MultiVLAD descriptor, such that it is inline with our Voronoi partitioning. This results in a (128 14)D descriptor size per image [21], as derived from (3).
WNPQ Parameter Selection:
We set for the quantized 128D VLADs. For the 768D and 1664D quantized VLADs, we respectively set the block size to and . For the quantized VVLAD, we set for all cell VLADs. Finally, quantized MultiVLAD also uses . These settings for the block size were chosen to align the matching complexity of the quantized 1664D VLAD with that of the quantized FastVVLAD, whilst providing the 768D VLAD as a solution with midrange complexity. Notably, we fix for all experiments. Higher values for increase the computational load for each block quantizer, whilst increasing the storage requirement of the lookup tables (, which is an important detriment as these tables need to be sufficiently small to fit in cache memory [46].
Holidays: The Holidays dataset provides a less controlled test for our system. The scenes in the Holidays dataset are better represented by a deep CNN architecture trained on ImageNet, particularly due to their high resolution. Similar to prior work [65], we have confirmed that deep CNNs substantially outperform VLAD descriptors for this dataset. Therefore, we use this dataset to test how the proposed Voronoibased encoding performs with the deep CNN descriptor of Subsection IIIA.2.
For the utilized CNNS architecture [58], all images and image partitions are resized to and fed into the network after subtracting an average image. The final feature descriptor is 512D, which can then be normalized, PCAprojected to 128D and whitened. However, following a similar approach to the instance retrieval pipeline on the VLAD descriptor, we normalize, signsquare root and renormalize the feature descriptor prior to PCA and whitening, with the intention of minimising the burstiness of dimensions and thus adding to descriptor invariance [38]. It is worth noting that contrary to recent works [65, 66], we do not manually rotate the images in the Holidays dataset as we do not deem this to be a fair representation of data ‘in the wild’.
Parameter selection for the Voronoi partitioning: For the VE of all cases, the FAST corner detector [57] is used, and we learn the PCA projection matrix and PQ codebook on a subset of 4000 images from the ILSVRC2010 validation set. For FastVDCNN, we set: , , with a 128D CNN feature descriptor per cell and compile a ranked list from the relevant similarity score. We compare this with a 128D and the full unprojected 512D CNN feature descriptors. For MultiCNN, we use the same grid partitioning as MultiVLAD, with , thus producing a D size per image.
WNPQ Parameter Selection: We set for the quantized 128D CNN feature descriptors. For the 512D quantized CNN feature descriptor, we set the block size to . For the quantized FastVDCNN, we set for all cell descriptors. Finally, quantized MultiCNN also uses . As with the VLAD descriptors, we fix for all experiments to keep the storage requirement for the lookup tables constant.
VC Results with Unquantized Descriptors
This section summarises performance using unquantized descriptors on the Caltech + Stanford Cars and Holidays dataset.
Caltech + Stanford Cars:
Table II summarizes the retrieval performance of all unquantized VLAD methods on the Caltech + Stanford Cars dataset. The first observation is that the FastVVLAD descriptor offers competitive performance to the larger 1664D VLAD, whilst decreasing the matching complexity by more than 50%. In addition, FastVVLAD performs significantly better on license plate queries than both the 128D VLAD and its 768D VLAD complexity counterpart, yielding respective mAP gains of over 200% and 41%. FastVVLAD maintains consistentlygood mAP even with the larger ROIs of car trunks and wholeimage queries, and is only outperformed on wholeimage queries by VLAD by (up to) a 7% margin. Finally, FastVVLAD maintains competitive performance to MultiVLAD on all query types, whilst offering lower dimensionality and matching complexity.
Holidays:
Table III summarises the retrieval performance for the 500 wholeimage queries and 40 smaller ROI queries on the Holidays dataset. Interestingly, the FastVDCNN remains competitive on whole image queries. This is attributed to the FastVDCNN similarity score of (8) that considers all partition levels, which provides robustness against false positives. The FastVDCNN is found to outperform MultiCNN for whole image queries and maintain very competitive performance on ROI queries, while offering more than 50% reduction in the matching complexity^{3}^{3}3We have also validated that this saving translates to practical runtime saving: by adding a large distractor set (thereby scaling the dataset size to 150K images), we found that FastVDCNN based retrieval is 40% faster than MultiCNN retrieval, with execution time comparable to the baseline 512D CNN feature descriptor.. FastVDCNN was also found to substantially outperform the lower dimensional CNN feature descriptors for ROI queries (gains exceeding 50% in mAP). Given that the utilized CNNS descriptor derived from Layer 13 is limited to 512 dimensions [58], we also benchmarked using the first fully connected layer (FC1), which allows for a large 4096D feature descriptor. Nevertheless, the FC1 descriptor performed significantly worse than our 512D Layer 13 descriptors for both query ROI and whole images, scoring mAP of 28.3% and 71.4%, respectively. This serves as an additional validation for our choice for the utilized CNN layer.
VD Results with Quantized Descriptors
We now consider performance when integrating quantization into all approaches under consideration.
WNPQ against other quantization methods: We first consider the performance of the proposed WNPQ method against other stateoftheart methods, namely the parametric optimized product quantization (OPQ) [61] and product quantization with a random rotation preprocessing (RRPQ)[38]. As the OPQ and RRPQ descriptors are not normalized, we use the squared Euclidean distance metric for these methods and compare retrieval performance on both datasets. The results of Fig. 4 show that the proposed WNPQ method outperforms RRPQ and, for the majority of the tests, also outperforms OPQ. Essentially, the WNPQ maintains its high retrieval performance when the dimensionality is increased from 128D to the 768D and 1664D VLAD descriptors. To ensure a fair comparison, and because the proposed WNPQ was shown to provide for the best overall performance, we use it to quantize all the descriptors under comparison.
Matching Complexity  Descriptor Storage (bytes)  License  

Plates  Trunk  Whole Image  
VLAD [38]  128  1  512  0.148  0.669  0.729 
768  6  3.07k  0.348  0.739  0.780  
1664  13  6.66k  0.512  0.722  0.785  
Proposed FastVVLAD  6.55  6.66k  0.490  0.745  0.728  
MultiVLAD [21]  14  7.17k  0.493  0.780  0.732  
Matching Complexity  Descriptor Storage (bytes)  Query ROI  Whole Image  

CNN (Layer 13)  128  1  512  0.339  0.757 
512  4  2.05k  0.369  0.767  
Proposed FastVDCNN  6.11  6.66k  0.674  0.761  
MultiCNN [21, 58]  14  7.17k  0.678  0.737  
Caltech + Stanford Cars: Table IV summarises the performance of the various descriptors with WNPQ on the Caltech + Stanford Cars dataset. On whole images, coupled with the aggregated similarity score, FastVVLAD offers superior performance to the 128D VLAD, with an mAP gain of 6%. The 1664D VLAD, which is now of comparable complexity to the FastVVLAD, is outperformed on the small license plate queries, with mAP gain of 9%, but remains superior for wholeimage queries. However, it is worth mentioning that the gain from FastVVLAD on small queries outweighs any loss on larger queries, thus making it favorable. Finally, the quantized MultiVLAD offers marginally superior mAP to FastVVLAD, albeit at the cost of twice the matching complexity and higher descriptor storage size^{4}^{4}4It is worth noting that, given we use a single PQ codebook for quantizing all cell components of FastVVLAD and FastVDCNN, all quantized based systems have a bit cost for storing the lookup tables. This means that, for example, although the 1664D VLAD offers a lower storage size to FastVVLAD, there is an additional 1.7MB cost to store the lookup table, versus 262kB for FastVVLAD. However, as mentioned previously, the significance of this additional storage cost diminishes when increasing the test dataset size..
Holidays: For the Holidays dataset, the quantized FastVDCNN maintains its mAP gain on query ROI over the quantized 128D and 512D CNN feature descriptors, while the descriptor storage has been reduced by a factor of 16 compared to its unquantized counterpart. In addition, the FastVDCNN still performs better than quantized MultiCNN on whole image queries, with an mAP gain of 4%.
Matching Complexity  Descriptor Storage (bytes)  License  

Plates  Trunk  Whole Image  
WNPQ VLAD  128  1  32  0.112  0.606  0.626 
768  3  96  0.257  0.663  0.677  
1664  6.5  208  0.404  0.696  0.702  
Proposed WNPQ FastVVLAD  128 13  6.43  416  0.440  0.713  0.661 
WNPQ MultiVLAD  128 14  14  448  0.449  0.769  0.652 
Matching Complexity  Descriptor Storage (bytes)  Query ROI  Whole Image  

WNPQ CNN (Layer 13)  128  1  32  0.248  0.674 
512  4  128  0.280  0.706  
Proposed WNPQ FastVDCNN  128 13  6.13  416  0.550  0.684 
WNPQ MultiCNN  128 14  14  448  0.603  0.656 
VE Further Improvements on Wholeimage Search
The experimental results of the previous section show that FastVVLAD and FastVDCNN clearly outperform their counterparts for ROI image search, while being competitive for wholeimage search. The performance on whole image queries is primarily controlled by the dimension of the level0 (whole image) component. For experiments in the previous section, we set the dimension uniformly across all components of the Voronoibased descriptor, i.e., 128D descriptor per cell. As a result, mAP for the Voronoibased descriptors on whole images is comparable to that of the 128D reference descriptors. One option to tailor performance towards whole image queries or smaller ROI queries is by tapering the dimension across levels; we leave this as a topic for future study.
Another approach to boost performance for whole image queries is by accounting for multiple scales in both the query and dataset images. In other words, rather than applying Voronoi partitioning only on the dataset images, we can also apply Voronoi partitioning on the query image over multiple levels and submit each of the query partitions as a subquery. Notably, using the FastVDCNN for the dataset image encodings, each subquery is matched only against representative cells in the dataset images (i.e., between 4 to 7 cells), which are determined by the adaptive search proposed in Section IIIB. The inner product between the original query image and a dataset image is taken as the average inner product over all subqueries. While this incurs linear increase in the search complexity (by ), this scales better than the quadratic search complexity achieved by Carlsson et al. [32, 30], where exhaustive search amongst all subqueries is carried out.
Table LABEL:tab:mAPHolWhole compares the retrieval performance of the proposed FastVDCNN descriptor against the current stateoftheart on the Holidays dataset that use networks pretrained on ImageNet. The FastVDCNN descriptor is generated under the configuration of Section VB, albeit now also partitioning the queries with and resizing image partitions to . Beyond benchmarking against the gridbased spatial search method of Carlsson et al. [32], we also compare our results with the recentlyproposed CNN+VLAD [33], CKNmix [45], the hybrid FVNN approach of Peronnin et al. [27], as well as lowerdimensional but more computationallyintensive proposals ^{5}^{5}5In particular: the SPoC descriptor [65] offers the best performance to dimensionality, but utilizes a deeper and a more computationally heavy CNN (144M parameters vs 76M parameters for our architecture) and a larger image input size, the RMAC based descriptor uses Siamese learning with supervised whitening, and NetVLAD requires additional processing (soft assignment and normalizations within the NetVLAD layer) to encode VLAD from the network activations. On the contrary, under the chosen configuration, the proposed FastVDCNN approach allocates only 128 dimensions per cell and accesses between 4 to 7 cells for each image subquery. that perform competitively [65, 66, 67]. Evidently, the additional scale and location invariance provided by the Voronoi partitioning leads to the proposed FastVDCNN achieving competitive performance to other CNN derived frameworks and hybrid variants, without manually rotating the images, and despite the fact that our feature descriptor is built directly from a pretrained network and incurs modest computational and storage requirements.
Whole Image  

Proposed FastVDCNN  1.66K (128)  0.821 
FVNN (Peronnin et al.) [27]  4K  0.835 
CNN + VLAD [33]  2K  0.802 
CNN (Carlsson et al.) [32]  4K15K  0.769 
CKNmix [45]  4K  0.829 
SPoC (w/o center prior) [65]  256  0.802 
RMAC [66]  512  0.825 
NetVLAD [67]  256  0.799 
Vi Conclusion
We proposed a novel descriptor design, termed Voronoibased encoding, for regionofinterest image retrieval. We have shown how VE could fit into a practical ROIbased retrieval system via the proposed fast search, memoryefficient design, productquantization based lossy compression techniques, and robust similarity scoring mechanisms. We test retrieval performance on two datasets, using VLAD and a deep CNN as our descriptor basis. Our results show that our approach is descriptor agnostic; the proposed FastVVLAD and FastVDCNN maintain competitive retrieval performance over diverse ROI queries on two datasets and significantly improve on the retrieval performance (or implementation efficiency) of their respective descriptor variants with a grid spatial search, when dealing with smaller ROI queries. Moreover, improved geometric invariance results in competitive retrieval performance to the current stateoftheart on whole image queries.
Appendix A Level Projection for VE Storage Compaction
In order to decrease the storage requirements for unquantized Voronoibased encoded (VE) representations, the descriptor over two constituent cells and (i.e., spatiallyneighboring cells belonging to the same cell of the upper level), can be approximated as:
(14) 
This holds because both PCA and whitening are linear mappings, therefore, if we do not consider the vector truncation and subsequent normalization of the individual cell vectors, the additivity property holds in the projected domain as well. Given that directionality is preserved under normalization, (14) provides an approximation to the normalized encoding computed directly over the two cells. Therefore, we can tradeoff computation for memory by solely storing the lastlevel PCAprojected descriptors (level ) and computing all other cell encodings for all lower levels at runtime via repetitive application of (14) amongst constituent cells and renormalizing before carrying out the similarity measurement of (2). This is an appealing proposition for practical systems because vectorized addition and scaling for normalization is extremely inexpensive in modern SIMDbased architectures. As such, this approach requires storing only cell descriptors, instead of cell vectors. Naturally, there is a dependency on the projection error, which will evidently be greater with less dimensions retained postPCA.
We can integrate product quantization with a modified level projection for quantized VE storage compaction. As before, we only store the lastlevel (quantized) descriptors offline. However, as the inner product satisfies the distributive law, we should now directly approximate the inner product between a query encoding and a level cell descriptor as an normalized summation:
(15) 
where each inner product is read from a look up table.
Appendix B Proof of Proposition 1
Proof:
In order to optimize the bit allocation to the various descriptions (subspaces), we optimize the rate distortion expression:
(16) 
where is the product of the subcomponent distortions, is the determinant of the covariance matrix and is the overall distortion value. The minimum rate for given is derived when all distortions are equal, i.e., .
Using results derived from rate distortion theory [68], for the th subspace can be approximated by:
(17) 
where is a variable determined by the univariate Gaussian of the normalized components, , and is the average number of bits encoded per dimension. Due to the independence property, the product of in the th subspace yields the variable
, which is now determined by the multivariate Gaussian distribution for the normalized subspace random vectors. This distribution is independent of subspace, and, as such,
is constant for all . Similarly, if the size of the bit encoding and block dimension is fixed per subspace, then is a constant for all . For to be equal for all , must be constant, independent of subspace. ∎References
 [1] R. Arandjelovic and A. Zisserman, “Three things everyone should know to improve object retrieval,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Rec. (CVPR), 2012, pp. 2911–2918.
 [2] S. Bai, S. Sun, X. Bai, Z. Zhang, and Q. Tian, “Smooth neighborhood structure mining on multiple affinity graphs with applications to contextsensitive similarity,” in European Conference on Computer Vision. Springer, 2016, pp. 592–608.
 [3] A. Chadha and Y. Andreopoulos, “Regionofinterest retrieval in large image datasets with voronoi vlad,” in Int. Conf. on Computer Vision Syst. Springer, 2015, pp. 218–227.
 [4] J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang, “Nearduplicate video retrieval: Current research and future trends,” ACM Computing Surveys (CSUR), vol. 45, no. 4, p. 44, 2013.
 [5] X. Xu, W. Geng, R. Ju, Y. Yang, T. Ren, and G. Wu, “Obsir: Objectbased stereo image retrieval,” in Multimedia and Expo (ICME), 2014 IEEE International Conference on. IEEE, 2014, pp. 1–6.
 [6] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person reidentification: A benchmark,” in Proc. of the IEEE Int. Conf. on Comput. Vis., 2015, pp. 1116–1124.
 [7] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” Int. J. of Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
 [8] V. Spiliotopoulos et al., “Quantization effect on vlsi implementations for the 9/7 dwt filters,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2001, ICASSP’01, vol. 2. IEEE, 2001, pp. 1197–1200.
 [9] Y. Andreopoulos and M. van der Schaar, “Incremental refinement of computation for the discrete wavelet transform,” IEEE Trans. on Signal Process., vol. 56, no. 1, pp. 140–157, 2008.
 [10] Y. Andreopoulos et al., “A new method for completetoovercomplete discrete wavelet transforms,” in Proc. 14th IEEE Int. Conf. on Digital Signal Process., DSP 2002, vol. 2. IEEE, 2002, pp. 501–504.
 [11] ——, “A local wavelet transform implementation versus an optimal rowcolumn algorithm for the 2d multilevel decomposition,” in Proc. IEEE Int. Conf. on Image Process., ICIP 2001, vol. 3. IEEE, 2001, pp. 330–333.
 [12] N. Kontorinis et al., “Statistical framework for video decoding complexity modeling and prediction,” IEEE Trans. on Circ. and Syst. for Video Technol., vol. 19, no. 7, pp. 1000–1013, 2009.
 [13] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. Ninth IEEE Int. Conf. on Comput. Vis. IEEE, 2003, pp. 1470–1477.
 [14] J. Philbin et al., “Lost in quantization: Improving particular object retrieval in large scale image databases,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Rec., 2008, pp. 1–8.
 [15] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., 2007, pp. 1–8.
 [16] I. GonzálezDíaz, C. E. BazHormigos, and F. Diazde Maria, “A generative model for concurrent image retrieval and ROI segmentation,” IEEE Trans. Multimedia, vol. 16, no. 1, pp. 169–183, 2014.
 [17] Z. Zhong, J. Zhu, and C. Hoi, “Fast object retrieval using direct spatial matching,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1391–1397, 2015.
 [18] L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Fast image retrieval: Query pruning and early termination,” IEEE Trans. Multimedia, vol. 17, no. 5, pp. 648–659, 2015.
 [19] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Rec., 2010, pp. 3304–3311.
 [20] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies for image categorization,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., 2007.
 [21] R. Arandjelovic and A. Zisserman, “All about VLAD,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., 2013, pp. 1578–1585.
 [22] P. Lu, Q. Sun, K. Wu, and Z. Zhu, “Distributed online hybrid cloud management for profitdriven multimedia cloud computing,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1297–1308, 2015.
 [23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., 2014, pp. 580–587.
 [24] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
 [25] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei, “Largescale video classification with convolutional neural networks,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., 2014, pp. 1725–1732.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [27] F. Perronnin and D. Larlus, “Fisher vectors meet neural networks: A hybrid classification architecture,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., 2015, pp. 3743–3752.
 [28] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “ImageNet: A largescale hierarchical image database,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn. IEEE, 2009, pp. 248–255.
 [29] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes for image retrieval,” in Europ. Conf. in Comput. Vis. (ECCV). Springer, 2014, pp. 584–599.
 [30] H. Azizpour, A. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “Factors of transferability for a generic convnet representation,” 2014.
 [31] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems, 2014, pp. 3320–3328.
 [32] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features offtheshelf: an astounding baseline for recognition,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn. Workshops, 2014, pp. 806–813.
 [33] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multiscale orderless pooling of deep convolutional activation features,” in Europ. Conf. in Comput. Vis. (ECCV). Springer, 2014, pp. 392–407.
 [34] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [35] V. Chandrasekhar, J. Lin, O. Morère, H. Goh, and A. Veillard, “A practical guide to cnns and fisher vectors for image instance retrieval,” arXiv preprint arXiv:1508.02496, 2015.
 [36] H. Jegou et al., “Hamming embedding and weak geometric consistency for large scale image search,” in Europ. Conf. in Comput. Vis. Springer, 2008, pp. 304–317.
 [37] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scaleinvariant learning,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., vol. 2, 2003, pp. II–264.
 [38] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Trans. on Patt. Anal. and Mach. Intell., vol. 34, no. 9, pp. 1704–1716, 2012.
 [39] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Largescale image retrieval with compressed fisher vectors,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn., 2010, pp. 3384–3391.
 [40] H. Jégou and O. Chum, “Negative evidences and cooccurences in image retrieval: The benefit of PCA and whitening,” in Proc. Europ. Conf. in Comput. Vis. (ECCV). Springer, 2012, pp. 774–787.
 [41] O. Chum and J. Matas, “Unsupervised discovery of cooccurrence in sparse high dimensional data,” in Proc. IEEE Int. Conf. on Comput. Vis. and Pat. Recogn. (CVPR). IEEE, 2010, pp. 3416–3423.
 [42] S. Lazebnik et al., “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Rec., vol. 2, 2006, pp. 2169–2178.

[43]
E. Mantziou, S. Papadopoulos, and Y. Kompatsiaris, “Largescale semisupervised learning by approximate Laplacian eigenmaps, VLAD and pyramids,” in
14th Int. Workshop on Image Anal. for Mult. Interactive Services (WIAMIS). IEEE, 2013, pp. 1–4.  [44] R. Zhou, Q. Yuan, X. Gu, and D. Zhang, “Spatial pyramid VLAD,” in IEEE Vis. Comm. and Image Proc. Conf. IEEE, 2014, pp. 342–345.
 [45] M. Paulin, J. Mairal, M. Douze, Z. Harchaoui, F. Perronnin, and C. Schmid, “Convolutional patch representations for image retrieval: an unsupervised approach,” International Journal of Computer Vision, to appear.
 [46] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. on Patt. Anal. and Mach. Intell., vol. 33, no. 1, pp. 117–128, 2011.
 [47] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Localitysensitive hashing scheme based on pstable distributions,” in Proc. of the twentieth annual symposium on Computational geometry. ACM, 2004, pp. 253–262.
 [48] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
 [49] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image databases for recognition,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn. IEEE, 2008, pp. 1–8.
 [50] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Complementary hashing for approximate nearest neighbor search,” in 2011 IEEE Int. Conf. on Comput. Vis. (ICCV). IEEE, 2011, pp. 1631–1638.
 [51] Y. Li, R. Wang, H. Liu, H. Jiang, S. Shan, and X. Chen, “Two birds, one stone: Jointly learning binary code for largescale face image retrieval and attributes prediction,” in Proc. of the IEEE Int. Conf. on Comp. Vis., 2015, pp. 3819–3827.
 [52] D. Song, W. Liu, R. Ji, D. A. Meyer, and J. R. Smith, “Top rank supervised binary coding for visual search,” in Proc. of the IEEE Int. Conf. on Comput. Vis., 2015, pp. 1922–1930.
 [53] T. Ji, X. Liu, C. Deng, L. Huang, and B. Lang, “Queryadaptive hash code ranking for fast nearest neighbor search,” in Proc. of the ACM Int. Conf. on Multimedia. ACM, 2014, pp. 1005–1008.
 [54] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.
 [55] K. Mikolajczyk et al., “A comparison of affine region detectors,” Int. J. of Comput. Vis., vol. 65, no. 12, pp. 43–72, 2005.
 [56] K. Mikolajczyk and C. Schmid, “An affine invariant interest point detector,” in Proc. Europ. Conf. in Comput. Vis. (ECCV). Springer, 2002, pp. 128–142.
 [57] E. Rosten and T. Drummond, “Machine learning for highspeed corner detection,” in Europ. Conf. in Comput. Vis. Springer, 2006, pp. 430–443.
 [58] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” arXiv preprint arXiv:1405.3531, 2014.
 [59] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [60] J. Brandt, “Transform coding for fast approximate nearest neighbor search in high dimensions,” in Proc. IEEE Int. Conf. on Comput. Vis. and Patt. Recogn. IEEE, 2010, pp. 1815–1822.
 [61] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization,” IEEE Trans. on Patt. Anal. and Mach. Intell., vol. 36, no. 4, pp. 744–755, 2014.
 [62] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
 [63] E. SpyromitrosXioufis, S. Papadopoulos, I. Y. Kompatsiaris, G. Tsoumakas, and I. Vlahavas, “A comprehensive study over VLAD and product quantization in largescale image retrieval,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1713–1728, 2014.
 [64] J. Krause, M. Stark, J. Deng, and L. FeiFei, “3d object representations for finegrained categorization,” in IEEE Int. Conf. on Comput. Vis. Workshops (ICCVW). IEEE, 2013, pp. 554–561.

[65]
A. Babenko and V. Lempitsky, “Aggregating local deep features for image retrieval,” in
Proc. of the IEEE Int. Conf. on Comput. Vis., 2015, pp. 1269–1277.  [66] F. Radenović, G. Tolias, and O. Chum, “CNN image retrieval learns from BoW: Unsupervised finetuning with hard examples,” arXiv preprint arXiv:1604.02426, 2016.
 [67] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” arXiv preprint arXiv:1511.07247, 2015.
 [68] A. Gersho and R. M. Gray, Vector quantization and signal compression. Springer Science & Business Media, 2012, vol. 159.
 [69] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 539–546.
 [70] O. Chum et al., “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in Proc. IEEE Int. Conf. on Comput. Vis., 2007, pp. 1–8.
 [71] R. M. Gray, “Vector quantization,” IEEE ASSP Mag., vol. 1, no. 2, pp. 4–29, 1984.
 [72] G. Shakhnarovich, T. Darrell, and P. Indyk, “Nearestneighbor methods in learning and vision: Theory and practice, chapter 3,” 2006.
Comments
There are no comments yet.