1 Introduction
Finding a single vector representing a set of local descriptors extracted from an image is an important problem in computer vision. This single vector representation provides several important benefits. First, it contains the power of local descriptors, such as a set of SIFT descriptors
[1]. Second, the representation vectors can be used in image retrieval problem (comparison using standard metrics such as Euclidean distance), or in classification problem (input to robust classifiers such as SVM). Furthermore, they can be readily used with the recent advanced indexing techniques
[2, 3] for large scale image retrieval problem.There is a wide range of methods for finding a single vector to represent a set of local vectors proposed in the literature: bag-of-visual-words (BoW) [4], Fisher vector [5], vector of locally aggregated descriptor (VLAD) [6] and its improvements [7, 8], residual enhanced visual vector [9], super vector coding [10]
, vector of locally aggregated tensor (VLAT)
[11, 12] which is higher order (tensor) version of VLAD, triangulation embedding (Temb) [13], sparse coding [14], local coordinate coding (LCC) [15], locality-constrained linear coding [16] which is fast version of LCC, local coordinate coding using local tangent (TLCC) [17] which is higher order version of LCC. Among these methods, VLAD [18] and VLAT [12] are well-known embedding methods used in image retrieval problem [18, 12] while TLCC [17] is one of the successful embedding methods used in image classification problem.VLAD is designed for image retrieval problem while TLCC is designed for image classification problem. They are derived from different motivations: for VLAD, the motivation is to characterize the distribution of residual vectors over Voronoi cells learned by a quantizer; for TLCC, the motivation is to linearly approximate^{1}^{1}1The “linear approximation” means that the nonlinear function defined on is approximated by a linear function (w.r.t. ) defined on where . a nonlinear function in high dimensional space. Despite these differences, we show that VLAD is actually a simplified version of TLCC based on our original analysis. The consequence of this analysis is significant: we can depart from the idea of linear approximation of function to develop powerful embedding methods for the image retrieval problem.
In order to compute the single representation, all aforementioned methods include two main steps in the processing: embedding and aggregating. The embedding step uses a visual vocabulary (a set of anchor points) to map each local descriptor to a high dimensional vector while the aggregating step converts the set of mapped high dimensional vectors to a single vector. This paper focuses on the first step. In particular, we develop a new embedding method which can be seen as the generalization of TLCC and VLAT.
In the next sections, we first present a brief description of TLCC. Importantly, we derive the relationship between TLCC and VLAD. We then present our motivation for designing the new embedding method.
1.1 Tlcc
TLCC [17] is designed for image classification problem. Its goal is to linearly approximate a nonlinear smooth function , e.g. a nonlinear classification function, defined on a high dimensional feature space . Note that is an implicit function and we do not need to know its form explicitly. TLCC’s approach finds an embedding scheme : mapping each as
(1) |
such that can be well approximated by a linear function, namely . To solve above problem, TLCC’s authors relied on the idea of coordinate coding defined below. They showed that with a sufficient selection of coordinate coding, the function can be linearly approximated.
Definition 1.1
Coordinate Coding [15]
A coordinate coding of a point is a pair ^{2}^{2}2 is same for all ., where is a set of anchor points (bases), and is a map of to such that
(2) |
It induces the following physical approximation of in :
(3) |
A good coordinate coding should ensure that closes to ^{3}^{3}3Although the reconstruction error condition for a good coordinate coding, i.e, closes to , is not explicit mentioned in the original definition of coordinate coding, it can be inferred from objective functions of LCC [15] and TLCC [17]..
Let be coordinate coding of . Under assumption that is Lipschitz smooth, the authors showed (in lemma 2.2 [17]) that, for all
(4) | |||||
1.2 TLCC as generalization of VLAD
Although TLCC is designed for classification problem and its motivation is different from VLAD, TLCC can be seen as a generalization of VLAD. Specifically, if we add the following constraint to
(6) |
then we have . The RHS of (4) becomes
(7) |
where is anchor point corresponding to the nonzero element of . One solution for minimizing (7) under constraints (2) and (6
) is K-means algorithm. When K-means is used, we have
(8) |
where is set of anchor points learned by K-means. Now, considering (5), if we ignore its first term, i.e., removing components attached with , we have which becomes the embedding used in VLAD.
1.3 Motivation for designing new embedding method
The relationship between TLCC and VLAD suggests that if we can find such that can be well linearly-approximated (), then can be a powerful feature for image retrieval problem. In TLCC’s approach, by departing from assumption that is Lipschitz smooth, is approximated using only by its first order approximation at the anchor points, i.e., is approximated as sum of weighted tangents at anchor points. In their work [17], the authors do not show how to generalize the approximation using higher order information.
In this paper, we propose to use the idea of Taylor expansion for function approximation. We propose an general formulation which allows to linearly approximate a nonlinear function using not only first order but also higher order information.
The embedded vectors, resulted by the proposed function approximation process, are used as new image representations in our image retrieval framework. In following sections, we note our Function Approximation-based embedding method as FAemb. In order to facilitate the use of the embedded features in large scale image search problem, we further derive its fast version. The main idea is to relax the function approximation bound such that the embedded features can be efficiently computed, i.e., in an analytic form. The proposed embedding methods are evaluated in image search context under various settings: when the images are represented by medium length vectors, short vectors, or binary vectors. The experimental results show that the proposed methods give a performance boost over the state of the art on the standard public image retrieval benchmarks.
Our previous work introduced FAemb method in [19]. This paper discusses substantial extension to our previous work: We detail the computational complexity of FAemb (Section 3.2). We propose the fast version of FAemb, i.e., FAST-FAemb (Section 4). We add a number of new experiments, i.e., results on large scale datasets (Oxford105k and Flickr1M), results when the single representation is compressed to compact binary codes (see Section 5.4
). We also add new experiments when Convolutional Neural Networks (CNN) features are used instead of SIFT local features to describe the images; the comparison to the recent CNN/deep learning-based image retrieval is also provided (Section
5.5).2 Preliminaries
In this section, we review related background to prepare for detail discussion of our new embedding method in Section 3.
Taylor’s theorem for high dimensional variables
Definition 2.1
Multi-index [20]: A multi-index is a -tuple of nonnegative integers. Multi-indices are generally denoted by :
where . If is a multi-index, we define
where
Theorem 2.2
(Taylor’s theorem for high dimensional variables [20]) Suppose : of class of ^{4}^{4}4It means that all partial derivatives of up to (and including) order exist and continuous. on . If and , then
(9) |
where is Lagrange remainder given by
(10) |
for some .
Corollary 2.3
If is of class of on and for and , then
(11) |
3 Embedding based on function approximation (FAemb)
3.1 Derivation of FAemb
Our embedding method is inspired from the function approximation based on Taylor’s theorem. It comes from the following lemma.
Lemma 3.1
If : is of class of on and is Lipschitz continuous with constant and is coordinate coding of , then
(12) | |||||
If , then (12) becomes
(13) | |||||
In the case of , is approximated as sum of its weighted tangents at anchor points.
If , then (12) becomes
(14) |
where
(15) |
and is vectorization function flattening the matrix to a vector by putting its consecutive columns into a column vector. is Hessian matrix.
In the case of , is approximated as sum of its weighted quadratic approximations at anchor points. Note that in both (13) and (3.1), we do not need to know the explicit form of function . In the rest, we put the interest on the function approximation bound.
In order to achieve a good approximation, the coding should be selected such that the RHS of (13) and (3.1) are small enough.
The result derived from (13) is that, with respect to the coding , a high dimensional nonlinear function in can be approximated by linear form where can be defined as and the embedded vector can be defined as
(16) |
where is a nonnegative scaling factor to balance two types of codes.
In order to make a good approximation of , in following sections, we put our interest on case where is approximated by using up to second-order derivatives defined by (3.1). The result derived from (3.1) is that the nonlinear function can be approximated by linear form where can be defined as and the embedded vector -FAemb can be defined as
(17) |
where are nonnegative scaling factors to balance three types of codes.
As mentioned in the previous section, to get a good approximation of , the RHS of (3.1) should be small enough. Furthermore, from the definition of coordinate coding 1.1, should ensure that the reconstruction error should be small. Putting these two criteria together, we find which minimize the following constrained objective function
(18) |
where is the parameter that regulates to the importance of the function approximation bound in the objective function.
Equivalently, given a set of training samples (descriptors) , let be coefficient corresponding to base of sample ; be coefficient vector of sample ; . We find which minimize the following constrained objective function
(19) |
3.2 The offline learning of coordinate coding and the online embedding
3.2.1 The offline learning of coordinate coding via alternating optimization
In order to minimize (19), we propose to iteratively optimize it by alternatingly optimizing with respect to and while holding the other fixed.
For learning the anchor points , the optimization is an unconstrained regularized least squares problem. We use trust-region method [21] to solve this problem.
For learning the coefficients , the optimization is equivalent to a regularized least squares problem with linear constraint. This problem can be solved by optimizing over each sample individually. To find of each sample , we use Newton’s method [22].
The offline learning of coordinate coding for FAemb is summarized in the Algorithm 1. In the Algorithm 1, , , are values of , , at iteration , respectively. The objective function value after each iteration in the Algorithm 1 always does not increase (by the decreasing or unchanging of the objective value on both and steps). It can also be validated that the objective function value is lower-bounded, i.e., not smaller than . Those two points indicate the convergence of our algorithm. The empirical results show that the Algorithm 1 takes a few iterations to converge. Fig. 1 shows the example of the convergence of the algorithm.
3.2.2 The online embedding and its complexity
After learning anchor points , given a new descriptor , we achieve by minimizing (18) using learned . From , we get the embedded vector -FAemb by using (17).
The computational complexity of the online embedding
The computational complexity of the online embedding depends on the computation of the coefficient using Newton’s method and the computation of (given ) using (17). It is worth noting that in our experiments, the number of anchor points is less than the dimension of descriptor.
Computing :
FAemb uses Newton’s method [22] for finding .
The main cost in the iteration of Newton’s method lies in (i) computing the Hessian of objective function (19) and (ii) computing the Newton step .
The complexity for computing Hessian of (19) w.r.t. is . For finding the updating step , Newton’s method solves the following equation
(20) |
where is solution at the iteration.
The size of 1st, 2nd, 3rd matrices in (20) is , and , respectively. So, the complexity for solving (20) is . Overall, the complexity in one iteration of Newton’s method is .
For the stopping of Newton’s method, follow [22], we define a tolerance on the objective function, i.e., the algorithm is terminated when , where denotes the optimum value of objective function; is solution at the iteration. [22] showed that this stopping criterion is equivalent to , where is Newton decrement at the iteration and defined by the following equation which takes only in complexity.
(21) |
Given a coordinate coding with 8 anchor points, a tolerance , we experiment on 100k descriptors and have the observation that 50 iterations on average for meeting the stopping criterion^{5}^{5}5The step-size for updating , i.e., at each iteration is selected by empirical experiments and equals to 0.1 in our experiments.. Overall, the complexity of FAemb for finding is .
Computing (using (17), given ): From (17), the complexity mainly depends on the computing the tensor between and which takes . So, the computational complexity for computing is .
From above analysis, we find that the computational complexity of the whole embedding process of FAemb is dominated by the computing of .
3.3 Relationship to other methods
Compare to TLCC [17], our assumption on in lemma 3.1 is different from the assumption of TLCC (lemma [17]), i.e., our assumption only needs that is Lipschitz continuous while TLCC assumes that all are Lipschitz continuous, . Our objective function (18) is different from TLCC (4), i.e., we rely on norm of () in the second term while TLCC uses norm; we solve the constraint on the coefficient in our optimization process while TLCC does not. FAemb approximates using up to its second order derivatives while TLCC approximates only using its first order derivatives.
FAemb can also be seen as the generalization of VLAT [11]. Similar to the relationship of TLCC and VLAD presented in Section 1.2, if we add the constraint (6) to , the objective function (19) will become
(22) | |||||
where is anchor point corresponding to the nonzero element of .
If we relax norm in the second term of into norm, we can use K-means algorithm for minimizing (22). After learning by using K-means, given an input descriptor , we have
(23) |
Now, consider (17), if we ignore the first and the second terms, i.e., removing components attached with and , we have: which becomes the embedding used in VLAT.
In practice, to make a fair comparison between FAemb and VLAT, we remove the first and the second terms of (17). This makes the embedded vectors produced by two methods have same dimension. It is worth noting that in (17), as matrix is symmetric, only the diagonal and upper part are kept while flattening it into vector. The size of VLAT and FAemb is then .
4 Fast embedding based on function approximation (F-FAemb)
FAemb needs an iterative optimization at the online embedding step. While FAemb is applicable for small/medium-size datasets, it may not be suitable for large scale datasets. In this section, we develop the fast version of FAemb. The main idea is to find reasonable relaxation for the function approximation bound of FAemb (i.e., the RHS of (3.1)) such that the coefficient vector can be efficiently computed, i.e., it can be computed in a closed-form.
4.1 Derivation of F-FAemb
The relaxed bound is based on the following observation
(24) |
Thus,
(25) |
We define the relaxed bound as
(26) |
(25) means that the relaxed bound is still upper bound of the function approximation, i.e, the LHS of (3.1). This relaxed bound allows analytic solution for the embedding as shown in Section 4.2.
Similar to FAemb, in order to ensure a good reconstruction error (which is necessary condition for a good coordinate coding) and a good function approximation, we jointly minimize over the reconstruction error and the bound . Specifically, the coordinate coding is learned by minimizing the following constrained objective function
(27) |
4.2 The offline learning of coordinate coding and the online embedding
4.2.1 The offline learning of coordinate coding via alternating optimization
Similar to FAemb, in order to optimize (27), we alternatingly optimize with respect to and while holding the other fixed.
For learning the anchor points , the optimization problem is unconstrained regularized least squares. We use trust-region method [21] for solving.
For computing the coefficients , we can solve over each sample individually. The optimization problem is equivalent to a regularized least squares problem with linear constraint. Thus, we achieve the closed-form for the solution.
Let ; ; . Let
(28) |
We have the closed-form for the coefficient vector as
(29) |
where
is identity matrix having size of
; is column vector having elements equaling to .It is worth noting that although the function bound of F-FAemb is the relaxed version of the function bound of FAemb, F-FAemb provides an optimum solution on the coefficient vector while FAemb does not. This explains for the results that the performance F-FAemb is competitive to FAemb in our experiments.
4.2.2 The online embedding and its complexity
After learning anchor points , given a new input , we use (29) for computing the coefficient vector . After getting the coefficient vector, the embedded vector is achieved by using (17). Similar to FAemb, the values of , in (17) are assigned to 0.
Computing : From (29), the computational complexity for computing , and is , and , respectively. So the overall computational complexity for computing is .
Computing : As presented in Section 3.2.2, the computational complexity for computing is .
As in our experiments, the number of anchor points is less than the dimension of descriptor, the complexity of the whole embedding process of F-FAemb is dominated by the computing of .
4.3 The computational complexity comparison between FAemb/F-FAemb and other methods
FAemb | F-FAemb | VLAT | Fisher | |
45 | 45 | 45 | 64 | |
16 | 16 | 16 | 128 | |
8.32 | 0.78 | 0.25 | 1.42 |
In this section, we compare the computational complexity to embed a local descriptor of FAemb/F-FAemb and other methods which also use high order (i.e., second order) information for embedding such as VLAT [11, 12], Fisher [23, 18].
The fifth row of Table I presents the asymptotic complexity (hence the constant of the complexity is different for each method). Note that, although the dimension of local descriptors () and the number of anchor points () of methods are different, the dimension of the embedded vector produced by methods are comparable. It is worth noting that for Fisher[23, 18], although the complexity is
, it has a large constant, i.e., by computing posterior probabilities, the gradient with respect to both the mean and the standard deviation.
The sixth row of Table I presents the timing to embed a local descriptor. For Fisher, we use the implementation provided by VLFeat [24], where the implementation is optimized with mex files. For standard VLAT [11], we re-implement it as there is no Matlab implementation available. The experiments are run on a processor core (2.60 GHz Intel CPU). We report the CPU times which is larger than elapsed ones because CPU time accumulates all active threads. It is worth noting that we measure the timing when computing the VLAT/Fisher for each local descriptor separately. The reason is presented in Section 5.3.
From row of Table I, F-FAemb is times faster than FAemb. F-FAemb is slower than VLAT while it is faster than Fisher. In practical, F-FAemb takes less than 1 to embed an image having 1,000 local descriptors. This efficient computation allows F-FAemb to be used in large scale problems and in applications requiring fast retrieval. It is worth noting that the embedding can be further speeded up by optimizing the implementation, i.e., using mex files. In our experiments, when using mex file implementation for computing (17), F-FAemb takes only ms to embed a local descriptor.
5 Experiments
This section presents results of the proposed FAemb and F-FAemb embedding methods and compare them to the state of the art. Specifically, in Section 5.3, we compare FAemb, F-FAemb to other methods: VLAD [18], Fisher [18, 5], Temb [13] and VLAT [11] when the same test bed are used. In Section 5.4, we compare FAemb, F-FAemb to the state of the art under various setting, i.e., when the images are represented by mid-size vectors, short vectors, or binary vectors. Furthermore, in Section 5.5, we present the results when Convolutional Neural Network features are used as the local features to describe the images. The comparison to the recent CNN/deep learning-based image retrieval are also provided.
5.1 Dataset and evaluation metric
INRIA holidays [25] consists of 1,491 images of different locations and objects, 500 of them being used as queries. The search quality is measured by mean average precision (mAP), with the query removed from the ranked list.
In order to evaluate the search quality on large scale, we merge Holidays dataset with 1M negative images downloaded from Flickr [26], forming the Holidays+Flickr1M dataset. For this large scale dataset, following common practice [13], we evaluate search quality on the short representations of the aggregated vectors. For all learning stages, we use a subset from the independent dataset Flickr60k provided with Holidays.
Oxford buildings [27] consists of 5,063 images of buildings and 55 query images corresponding to 11 distinct buildings in Oxford. Each query image contains a bounding box indicating the region of interest. When local SIFT features are used, we follow the standard protocol [7, 8, 13]: the bounding boxes are cropped and then used as the queries.
This dataset is often referred to as Oxford5k. The search quality is measured by mAP computed over the 55 queries. Images are annotated as either relevant, not relevant, or junk, which indicates that it is unclear whether a user would consider the image as relevant or not. Following the recommended configuration [7, 13, 6], the junk images are removed from the ranking before computing the mAP.
In order to evaluate the search quality on large scale, Oxford5k is extended with 100k negative images [27], forming the Oxford105k dataset. For all learning stages, we use the Paris6k dataset [28].
5.2 Implementation details
5.2.1 Local descriptors
Local descriptors are detected by the Hessian-affine detector [29] and described by the SIFT local descriptor [1]. RootSIFT variant [30] is used in all our experiments. For VLAT, FAemb, F-FAemb, at beginning, the SIFT descriptors are reduced from 128 to 45 dimensions using PCA. For experiments with Fisher Vector, follow [18], we reduce SIFT descriptors to 64 dimensions using PCA. This makes the dimension of VLAT, FAemb, and F-FAemb comparable to dimension of compared embedding methods.
5.2.2 Whitening and aggregating the embedded vectors
Whitening Successful instance embedding methods consist of several feature post-processing steps. In [13], authors showed that by applying the whitening processing, the discriminative power of embedded vectors can be improved, hence improving the retrieval results. In particular, given , we achieve whitened embedded vectors by
(30) |
where is the
largest eigenvalue.
is matrix formed by the largest eigenvectors associated with the largest eigenvalues of the covariance matrix computed from learning embedded vectors
.In [13], the authors further indicated that by discarding some first components associated with the largest eigenvalues of , the localization of whitened embedded vectors will be improved. We apply this truncation operation in our experiments. The setting of this truncation operation is detailed in Section 5.3.
Aggregating
Let be set of local descriptors describing the image. Sum-pooling [31]
and max-pooling
[32, 33] are two common methods for aggregating set of whitened embedded vectors of the image to a single vector. Sum-pooling lacks discriminability because the aggregated vector is more influenced by frequently-occurring uninformative descriptors than rarely-occurring informative ones. Max-pooling equalizes the influence of frequent and rare descriptors. However, classical max-pooling approaches can only be applied to BoW or sparse coding features. Recently, in [13], the authors introduced a new aggregating method named democratic aggregation applied to image retrieval problem. This method bears similarity to generalized max-pooling [34] applied to image classification problem. Democratic aggregation can be applied to general features such as VLAD, Fisher vector[18], Temb[13]. The authors [13] showed that democratic aggregation achieves better performance than sum-pooling. The main idea of democratic aggregation is to find a weight for each such that(31) |
In summary, the process for producing the single vector from the set of local descriptors describing the image is as follows. First, we map each and whiten , producing . We then use democratic aggregation to aggregate vectors to the single vector by
(32) |
5.2.3 Power-law normalization
The burstiness visual elements [35], i.e., numerous descriptors almost similar within the same image, strongly affects the measure of similarity between two images. In order to reduce the effect of burstiness, we follow the common practical setting [6, 13]: applying power-law normalization [36] to the final image representation and subsequently normalize it. The power-law normalization is applied to each component of by , where is a constant. We standardly set in our experiments.
5.2.4 Rotation normalization and dimension reduction
The power-law normalization suppresses visual burstiness but not the frequent co-occurrences [37] that also corrupts the similarity measure. In order to reduce the effect of co-occurrences, we follow [37, 13], i.e., rotating data with a whitening matrix learned on aggregated vectors from the learning set. The results with the applying of this rotation are noted as +RN. When evaluating the short representations, we keep only first components, after RN, of aggregated vectors.
5.3 Impact of parameters and comparison between embedding methods
In this section, we compare FAemb, F-FAemb to other state-the-of-the art methods including VLAD [18], Fisher [18, 5], Temb [13] and VLAT [11] under same test bed, i.e., the whitening, the democratic aggregation, and the power-law normalization are applied for all five embedding methods. We reimplement VLAD, VLAT in our framework. For Fisher, we use VLFeat library [24]. It is worth noting that, in the design of these methods, the embedding and sum aggregating is combined into one formulation. Hence to apply the whitening and the democratic aggregation, we first apply VLAD/VLAT/Fisher embedding for each local feature separately. We then apply whitening and democratic aggregation on set of embedded vectors as usual. For Temb [13], we use the implementation provided by the authors.
Follow the suggestion in [13], for Temb and VLAD methods, we discard first (=128) components of . The final dimension of is therefore . For Fisher, we discard first components of ; this makes the dimension of Fisher equals to the dimension of VLAD and Temb. For VLAT, FAemb and F-FAemb methods, we discard first components of . The final dimension of is therefore . The value of in (19) and (27) is selected by empirical experiments and is fixed to for all FAemb, F-FAemb results reported bellow.
The comparison between the implementation of VLAD, VLAT and Fisher in this paper and their improved versions [18, 12] on Holidays dataset is presented in Table II. It is worth noting that even with a lower dimension, the implementation of VLAD/VLAT/Fisher in our framework (RootSIFT descriptors, VLAD/VLAT/Fisher embedding, whitening, democratic aggregation and power-law normalization) achieves better retrieval results than their improved versions reported by the authors [18, 12].
method | mAP | |
---|---|---|
VLAD [18] | 16,384 | 58.7 |
VLAD (this paper) | 8,064 | 67.4 |
VLAD (this paper) | 16,256 | 68.3 |
Fisher[18] | 16,384 | 62.5 |
Fisher (this paper) | 8,064 | 68.2 |
Fisher (this paper) | 16,256 | 69.3 |
VLAT_{improved} [12] | 9,000 | 70.0 |
VLAT (this paper) | 7,245 | 70.9 |
VLAT (this paper) | 15,525 | 72.7 |
5.3.1 Impact of parameters
The main parameter here is the number of anchor points . The analysis for this parameter is shown in Fig. 2 and Fig. 3 for Holidays and Oxford5k datasets, respectively. We can see that the mAP increases with the increasing of for all four methods. For all methods, the improvement tends to be smaller for larger . This phenomenon has been discussed in [13]. For larger vocabularies, the interaction between descriptors is less important than for small ones. For VLAT, FAemb and F-FAemb, we do not report the results for as with this setting, the democratic aggregation is very time consuming. It has been indicated in [13] that when dimension of the embedded vector is high, e.g. , the benefit of democratic aggregation is not worth the computational overhead.
5.3.2 Comparison between embedding methods
We find the following observations are consistent on both Holidays and Oxford5k datasets.
The mAP of FAemb is slightly better than the mAP of F-FAemb at small , i.e., . When is large, i.e. , F-FAemb and FAemb achieve very competitive results.
At same , FAemb, F-FAemb, and VLAT have same dimension. However, FAemb and F-FAemb improve the mAP over VLAT by a fair margin. For examples, at , the improvement of FAemb over VLAT is +1.8% and +3.9% on Holidays and Oxford5k, respectively. At , the improvement is about +3% on both datasets.
At comparable dimensions, FAemb and F-FAemb significantly improve the mAP over VLAD, Temb, Fisher. For examples, comparing FAemb at () with VLAD/Temb at (), the gain of FAemb over VLAD/Temb is +7.5%/+2% on Holidays and +8.1%/+5% on Oxford5k.
5.4 Comparison with the state of the art
In this section, we compare our framework with benchmarks having similar representation, i.e., they represent an image by a single vector. Due to the efficient computation of F-FAemb, it not only allows F-FAemb to use more anchor points for the function approximation but also allows F-FAemb to work on large scale datasets. Thus, we put more interest on F-FAemb when comparing to the state of the art. The main differences between the compared embedding methods are shown in Table III.
In VLAT_{improved} [12], VLAD_{LCS} [7] and CVLAD [38], PCA and sum pooling are applied on Voronoi cells separately. Then, pooled vectors are concatenated to produce the single representation. In addition to methods listed in Table III, we also compare with the recent embedding method VLAD_{LCS}+Exemplar SVM (VLAD_{LCS}+ESVM) [39] and Convolutional Neural Network features. We consider the recent work [40] as the baseline for CNN-based image retrieval. In [39], the authors use the exemplar SVMs (linear SVMs trained with one positive example only and a vast collection of negative examples) as encoders. For each image, its VLAD_{LCS} [7]
representation is used as positive example for training an exemplar SVM. The weight vector (hyperplane) of the trained exemplar SVM is used as new representation. In
[40], the authors use the deep Convolutional Neural Network (CNN) model proposed in [41]for extracting image presentation. The network is first trained on ImageNet dataset
[42]. It is then retrained on the Landmarks dataset [40] containing images which are more relevant to the Holidays and the Oxford5k datasets. The activation values invoked by an image within top layers of the network are used as the image representation. It is worth noting that training CNN [40] is a supervised training task coming with challenges including: (i) the requirement for large amounts of labeled training data. According to [40], the collecting of the labeled Landmarks images is a nontrivial task; (ii) the high computational cost and the requiring of GPUs. Contrary to CNN [40], the training for our embedding is totally unsupervised, requiring of only several ten thousands of unlabeled images and without requiring of GPUs. It is also worth noting that in [40], when evaluating on Holidays dataset, the authors rotate all images in the dataset to the normal orientation; when evaluating on Oxford5k dataset, they use the full queries, instead of using the cropped queries. Both of these improve their results. In our experiments, we follow the literature [18, 8, 7, 13], i.e., using the original images for Holidays and cropped queries for Oxford5k.Method | Local | Do PCA/ | Aggr. |
---|---|---|---|
desc. | whitening? | method | |
BoW [18] | SIFT | No | Sum |
VLAD [18] | SIFT | No | Sum |
Fisher [18] | SIFT | No | Sum |
VLAD_{intra} [8] | RSIFT | No | Sum |
VLAT_{improved} [12] | SIFT | PCA | Sum |
VLAD_{LCS} [7] | RSIFT | PCA | Sum |
CVLAD [38] | RSIFT | PCA | Sum |
Temb [13] | RSIFT | Whitening | Democratic |
FAemb | RSIFT | Whitening | Democratic |
F-FAemb | RSIFT | Whitening | Democratic |
5.4.1 Evaluation on Holidays and Oxford5k datasets
Method | mAP | |||
---|---|---|---|---|
Hol. | Ox5k | |||
BoW [18] | 200k | 200k | 54.0 | 36.4 |
VLAD [18] | 128 | 8,192 | 55.6 | 37.8 |
VLAD [18] | 256 | 16,384 | 58.7 | - |
Fisher [18] | 256 | 16,384 | 62.5 | - |
VLAD_{LCS} [7] | 64 | 8,192 | 65.8 | 51.7 |
VLAD_{LCS}+ESVM [39] | 64 | 8,192 | 78.3 | 57.5 |
VLAD_{intra} [8] | 64 | 8,192 | 56.5 | 44.8 |
VLAD_{intra} [8] | 256 | 32,536 | 65.3 | 55.8 |
CVLAD [38] | 32 | 32,768 | 68.8 | 42.7 |
VLAT_{improved} [12] | 64 | 9,000 | 70.0 | - |
CNN [40] | - | 4,096 | 79.3 | 54.5 |
Temb [13] | 64 | 8,064 | 72.2 | 61.2 |
Temb [13] | 128 | 16,256 | 73.8 | 62.7 |
FAemb | 8 | 7,245 | 72.7 | 63.6 |
FAemb | 16 | 15,525 | 75.8 | 67.7 |
F-FAemb | 8 | 7,245 | 72.2 | 63.4 |
F-FAemb | 16 | 15,525 | 75.5 | 67.6 |
F-FAemb | 32 | 32,085 | 77.0 | 70.7 |
With rotation normalization | ||||
Temb +RN [13] | 64 | 8,064 | 77.1 | 67.6 |
Temb +RN [13] | 128 | 16,256 | 76.8 | 66.5 |
FAemb +RN | 8 | 7,245 | 76.2 | 66.7 |
FAemb +RN | 16 | 15,525 | 78.7 | 70.9 |
F-FAemb +RN | 8 | 7,245 | 75.5 | 66.1 |
F-FAemb +RN | 16 | 15,525 | 78.6 | 70.3 |
F-FAemb +RN | 32 | 32,085 | 80.7 | 74.2 |
Table IV shows the results of FAemb, F-FAemb, and the compared methods on Holidays and Oxford5k datasets.
Without RN post-processing, F-FAemb outperforms or is competitive to most compared methods. CNN features [40] achieve best performance on the Holidays dataset; its mAP is higher than F-FAemb () . However, on the Oxford5k dataset, F-FAemb outperforms CNN features [40] by a fair margin, i.e., .
When RN is used, it boosts performance for all Temb, FAemb and F-FAemb. The performance of F-FAemb +RN at is slightly lower than Temb+RN at . However, at higher dimension, i.e., , the performance of F-FAemb + RN outperforms all performances of Temb+RN a fair margin.
The efficient computation of F-FAemb allows it to use high number of anchor points, i.e., ; and at this setting, F-FAemb +RN outperforms all compared methods on both datasets. The gain is more significant on the Oxford5k dataset, i.e., F-FAemb +RN outperforms the recent embedding method VLAD_{LCS}+ESVM [39] and outperforms the CNN features [40] . It is worth noting that the dimension of the CNN features is lower than ones of F-FAemb +RN. We evaluate the performance of F-FAemb +RN in case of short representation in Section 5.4.3. It is worth noting that in [43], the authors report strong results on Holidays dataset, i.e., mAP = 84. However, in that work, to describe an image patch, they use two types of local descriptors: HOG feature [44] and color feature [36]. They apply VLAT for each type of descriptor separately, and concatenate two resulted embedded vectors, producing the single representation of 1.7M dimension. Hence, that work may not directly compare to ours and other works in Table III in which only SIFT feature is used.
5.4.2 Evaluation on large scale dataset: Oxford105k
Method | mAP | |
---|---|---|
Oxford105k | ||
VLAD_{LCS} [7] | 8,192 | 45.6 |
Temb+RN [13] | 8,064 | 61.1 |
CNN [40] | 4,096 | 51.2 |
F-FAemb +RN | 7,245 | 64.3 |
F-FAemb +RN | 15,525 | 68.1 |
The Oxford105k dataset is used for large scale testing in a few benchmarks [13, 7, 40]. The comparative mAP between methods is shown in Table V. The results show that even with a lower dimension, the proposed F-FAemb (at ) outperforms the compared methods (VLAD_{LCS}, Temb+RN) by a large margin. The best result of F-FAemb (i.e. at ) sets up state-of-the-art performance on this large scale dataset. It outperforms the current state of the art, i.e., Temb+RN [13] .
5.4.3 Evaluation on short representation
Method | mAP | ||||
---|---|---|---|---|---|
Hol. | Ox5k | Ox105k | Hol.+Fl1M | ||
Temb+RN [13] | 1,024 | 72.0 | 56.2 | 50.2 | 49.4 |
F-FAemb +RN | 1,024 | 70.8 | 58.2 | 53.2 | 68.5 |
Temb+RN [13] | 512 | 70.0 | 52.8 | 46.1 | 46.9 |
F-FAemb +RN | 512 | 69.0 | 53.9 | 50.9 | 65.3 |
Temb+RN [13] | 256 | 65.7 | 47.2 | 40.8 | 43.7 |
F-FAemb +RN | 256 | 67.5 | 45.6 | 44.5 | 61.9 |
Temb+RN [13] | 128 | 61.5 | 40.0 | 33.9 | 38.7 |
VLAD_{LCS} [7] | 128 | - | 32.2 | 26.2 | 39.2 |
F-FAemb +RN | 128 | 63.0 | 39.4 | 37.1 | 58.0 |
CNN [40] | 4,096 | 79.3 | 54.5 | 51.2 | - |
F-FAemb +RN | 4,096 | 74.1 | 63.7 | 62.2 | 72.5 |
As the F-FAemb features are high-dimensional, a question of their performance on short representations arises. In this section, we evaluate the performance of F-FAemb at short representations achieved by keeping only first components after the rotation normalization of aggregated vectors. Table VI reports comparative mAP for varying dimensionality.
Compare to Temb+RN at the same dimension, on Holidays dataset, Temb+RN is slightly better than our method at D = 1024 and 512, while at D=256 and 128, F-FAemb+RN outperforms Temb+RN. On Oxford5k dataset, our method outperforms Temb+RN at D = 1024 and 512l, while at D=256 and 128, Temb+RN is slightly better than ours. On large scale datasets Oxford105k and Holidays+Flickr1M, our method significantly improves the mAP over Temb+RN. On Oxford105k, the gains are , , and at , , and dimensions, respectively. On Holidays+Flickr1M, the gains are , , , and at , , , and dimensions, respectively.
Compare to CNN features having dimension, the performance of CNN features is higher than our method
on Holidays dataset, but we see much larger variances on Oxford5k and Oxford105k datasets. The gains of our method over CNN features on Oxford5k and Oxford105k are
and , respectively.5.4.4 Evaluation on binary representation
Two main problems which need to be considered in large scale image search are fast searching and efficient storage. An attractive approach for handling those problems is to represent each image by very compact codes, i.e., binary codes [45, 46, 47, 48, 49].
In this section, we further evaluate the performance of the proposed F-FAemb when the single representation is compressed to compact binary codes. In order to achieve compact codes for the single representation, we use the state-of-the-art hashing method Iterative Quantization (ITQ) [45]. The ITQ has two main steps: the first step is to apply PCA for dimensionality reduction; the second step is to seek an optimal rotation matrix which rotates the projected data to binary values such that the quantization error is minimized.
We compare our F-FAemb to the recent embedding method Temb [13] when both of them are compressed to binary codes.
Dataset | Method | Code length (bits) | |||
---|---|---|---|---|---|
128 | 256 | 512 | 1024 | ||
Holidays | Temb[13] | 39.2 | 46.5 | 53.0 | 57.3 |
F-FAemb | 40.1 | 47.9 | 54.5 | 59.7 | |
Ox5k | Temb[13] | 27.1 | 33.1 | 38.5 | 43.4 |
F-FAemb | 26.4 | 33.8 | 40.7 | 45.9 | |
Ox105k | Temb[13] | 25.9 | 31.6 | 37.7 | 42.9 |
F-FAemb | 24.2 | 32.0 | 38.5 | 44.7 |
The comparative results are presented in Table VII. On Holidays dataset, F-FAemb achieves better results than Temb for all code lengths; the improvement increases with the increasing of the code length. On Oxford5k and Oxford105k datasets, Temb is better than F-FAemb at low code length, i.e., 128-bit codes. However, F-FAemb outperforms Temb when the number of bits is increase, i.e. ; the improvement is more clear at high code lengths.
5.5 Results when CNN are used as local features
In this section we further evaluate the proposed F-FAemb when the image is described by a set of CNN features which are state-of-the-art image representation for various computer vision tasks [50].
5.5.1 Configuration
Specifically, instead of using set of local SIFT features to describe the image as previous experiments, we extract CNN activations for local patches at multiscale levels. We then take the union of all the patches from the image, regardless of scale. This union set can be considered as local features describing the image. We use the output of the last fully connected layer of the pretrained AlexNet model [51] as CNN features representing for patches. We extract CNN activations at 3 levels. For the first level, we simply take 4096-dimensional CNN activations for the whole image. For the second and the third levels, we extract CNN activations for all ,
patches sampled with a stride of 30 pixels. In order to make the computation of the embedding more efficient, we use PCA to reduce 4096-dimensional features to 45-dimensional features. The same processing (i.e., F-FAemb embedding, whitening, democratic aggregating, rotation normalization, power normalization) is applied on the set of CNN features to produce the single representation.
5.5.2 Comparison to the state of the art
We compare our CNN features-based F-FAemb with the state of the art which use Convolutional Neural Network, deep learning techniques for image retrieval, i.e., Convolutional Kernel Networks (CKN) [52], the combination of Fisher Vector encoding and Deep Neural Network (FV-DNN) [53], Multiscale Orderless Pooling (MOP-CNN) [54], CNN features (CNN) [50]. We also compare F-FAemb to very recent works: Sum Pooling of Convolutional feature (SPoC) [55], Regional Maximum Activation of Convolutional feature (R-MAC) [56].
Among mentioned works, FV-DNN [53], MOP-CNN [54], CNN [50] rely on the outputs of a fully connected layer, while the recent SPoC [55] and R-MAC [56] apply the pooling (sum pooling or max pooling) on the activations of a convolutional layer for producing the single representation. It is also worth mentioning that the recent work [57] applies the Fisher Vector encoding on the outputs of a convolutional layer. That work, however, evaluates on the texture recognition problem, not on image retrieval.
Note that when evaluating on Oxford5k dataset, FV-DNN [53], CNN [50] report results with “full query”; SPoC [55], CKN [52] report results with both “full query” and “crop query”, while R-MAC [56] reports results of only “crop query”. It is also worth noting that in [50], at the retrieval stage on the Oxford5k, they use the spatial search, i.e., for each image they extract multiple patches of different sizes/scales and compute CNN representation for patches. The distance between a query patch and a reference image is defined as the minimum distance between the query patch and respective reference patches. The distance between the query and the reference image is set to the average distance of each query patch to the reference image. This spatial search is a costly searching and it differs from ours and [52, 53, 54, 55, 56] in which only a single representation is matched per image.
Aforementioned works apply PCA/whitening on the single representation. Thus, for clear presentation, we ignore the +RN notation when presenting results of F-FAemb. The comparative mAP is showed in Table VIII. The short vector representations of F-FAemb (i.e., when ) is achieved by keeping only first components after the rotation normalization step.
Method | mAP | |||
---|---|---|---|---|
Hol. | Oxford5k | Oxford5k | ||
(full query) | (crop query) | |||
F-FAemb | 7,245 | 86.0 | 59.5 | 56.3 |
F-FAemb | 4,096 | 85.4 | 57.8 | 54.0 |
CKN[52] | 4,096 | 82.9 | 55.4 | 56.5 |
FV-DNN[53] | 4,096 | 84.7 | - | - |
CNN[50] | 4,096 | 84.3 | 68.0 | - |
F-FAemb | 2,048 | 83.5 | 53.4 | 50.7 |
MOP-CNN[54] | 2,048 | 80.2 | - | - |
F-FAemb | 512 | 78.1 | 41.2 | 39.3 |
R-MAC[56] | 512 | - | - | 66.9 |
F-FAemb | 256 | 74.0 | 36.6 | 34.5 |
R-MAC[56] | 256 | - | - | 56.1 |
SPoC [55] | 256 | 80.2 | 58.9 | 53.1 |
On the Holidays dataset, at the same dimension, F-FAemb slightly improves over FV-DNN[53], CNN[50] and considerably improves over MOP-CNN[54], CKN[52]. When short representation is applied, i.e. , the recent SPoC [55] outperforms F-FAemb. However, it is worth noting that F-FAemb achieves state-of-the-art results at its full dimension, i.e., mAP = at .
On the Oxford5k dataset, when the “full query” is used, F-FAemb outperforms CKN [52] while it is lower than CNN[50]. However it is worth noting that the spatial search used in [50] is a costly searching as it has two drawbacks. First, all the patch vectors of the image have to be stored. This increases the memory requirements by a factor of where is number of extracted patches per image. Second, the complexity for computing the distance between two images is increased by a factor of . Contrary to [50], F-FAemb and other methods only store a single representation per image and compute only one Euclidean distance when comparing two images. When the short representation is applied, SPoC [55] outperforms F-FAemb. In [55], the authors show that the convolutional features is robust to PCA compression, i.e., applying PCA on the single representation improve the mAP rather than decrease it as other methods. When the “crop query” is used, the recent R-MAC [56] gives very strong results; it outperforms all compared methods.
Compare F-FAemb + SIFT features (Table VI, Table IV) to F-FAemb + CNN features (Table VIII), we have the observation that the used configuration of CNN features (Section 5.5.1) achieves better results than SIFT features on Holidays dataset which contains general images. However, when testing on Oxford5k dataset containing particular objects (buildings), the SIFT features achieve better performance than CNN features. In order to make easy comparison for other researches, we summarize our best results in Table IX, in which CNN features are used for Holidays dataset and SIFT features are used for Oxford5k dataset (crop query).
Method | mAP | ||
---|---|---|---|
Holidays | Oxford5k | ||
(crop query) | |||
F-FAemb | 15,525 | 86.8 | 70.3 |
7,245 | 86.0 | 66.1 | |
4,096 | 85.4 | 63.7 | |
1,024 | 81.4 | 58.2 | |
512 | 78.1 | 53.9 | |
256 | 74.0 | 45.6 |
6 Conclusion
Embedding local features to high dimensional space is a crucial step for producing the single powerful image representation in many state-of-the-art large scale image search systems. In this paper, by departing from the goal of linear approximation of a nonlinear function in high dimensional space, we first propose a novel embedding method. The proposed embedding method, FAemb, can be seen as the generalization of several well-known embedding methods such as VLAD, TLCC, VLAT. In order to speed up the embedding process, we then derive the fast version of FAemb, in which the embedded vector can be efficiently computed, i.e., in the closed-form. The proposed embedding methods are evaluated with different state-of-the-art local features such as SIFT, CNN, in image search context under various settings. The experimental results show that the proposed embedding methods give a performance boost over the state of the art on several standard public image retrieval benchmarks.
Appendix A Proof of Lemma 3.1
By assumption, we have (i) is of class of . Because is Lipschitz continuous with constant , we have . So for , we have (ii) . (i) and (ii) make the condition of the Corollary 2.3 is held.
We have
Comments
There are no comments yet.