I Introduction
Over the past decades, image retrieval has received widespread attention. The representations are shown to be effective for image retrieval [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], which are derived by aggregating ScaleInvariant Feature Transform (SIFT) [16]
features. After that, image retrieval methods based on Convolutional Neural Network (CNN)
[17] achieve excellent performance [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]. These methods represent a image as the description vector, and sort the Euclidean distances between the feature vectors of query and database images as the retrieval results.
Manifolds are the fundamental to perception [31]. For example, to recognize the faces, the brain equates all images from the same manifold but distinguishes between images from different manifolds. %ͬһ ͼ Ĳ ̵ͬ ӽǺ һ In image collection, objects and landmarks are depicted in various conditions, such as different viewing angles and under various illumination. As a consequence, query and relevant images are often connected by a sequence of images, where consecutive images are similar. The descriptors of these images form a manifold in the descriptor space. As shown in Fig. 1, some negative samples are far from query image through the paths in KNN graph (geodesic distances) but are closer to query than some positive samples based on Euclidean distances.
But the data in image collection is incomplete in most situation. These images do not vary smoothly. The data sampled from intrinsic manifolds are too sparse. As a result, the manifolds reconstructed by these sparse data have some holes and are not continuous and smooth. The holes lead to the large calculation error of geodesic distance. Moreover, existing manifold learning methods [32, 33, 34, 35, 36, 37, 38] are not appropriate for image retrieval task, because most of them are unable to process query image and they have much additional computational cost especially for large scale database. Except for IsoMap [32] and LLE [33], these manifold learning methods can not handle new image at query time. And the computational cost for query image is high for IsoMap and LLE due to the computation of kNN of query image.
For the above problems, we propose the iterative manifold embedding (IME) layer to explore the intrinsic manifolds by incomplete data in this paper. The weights of the IME layer are learned offline by unsupervised strategy. In the online query stage, our IME layer maps the features of query images into embedding space with very little or even ignorable additional computational complexity.
The IME layer solves the problem of sample holes from two aspects. (1) The points that share similar neighbours tend to be similar to each other [37]. We employ this natural intuition to improve the topological instability of kNN graph. The information of secondorder proximity suppresses the interference of the sample holes. (2) We utilize the Euclidean distances to correct the calculation error of geodesic distances. Based on the corrected geodesic distances, we embed the data into lowdimensional space and preserve the intrinsic geometry of the data. The above steps are repeated many times to construct the stable and rubout kNN graph by incomplete data. To adapt our algorithm to image retrieval task, we simplify and approximate the IME by linear mapping, called IME layer in this paper. The IME layer is the integration and simplification version of IME, which reduces the computational cost and estimation error of geodesic distances for query images. The query image is embedded with very little or even ignorable additional computational cost by IME layer in the online retrieval stage. Working as the additional fully connected layer, the proposed IME layer can be directly connected to CNNs [39, 26, 27, 28, 29]. For SIFTbased representations [9, 11, 12, 13, 14, 15], IME layer can work as the transform matrix to map the vector representation into lowdimensional space and preserve the original neighbourhood relationships.
We conduct extensive experiments on five public standard image retrieval datasets, including landmarks and logos. Experiments results show that our proposed algorithm for manifoldbased embedding significantly improves the performance of global representation vectors. The proposed IME layer achieves a significant boost in the performance of the related dimension reduction methods and manifold learning methods. Without reranking, our IME layer still outperforms the stateoftheart methods based on search reranking in postprocessing step on most datasets. On a set of five thousand images with 2048 dimensions, IME layer is up to twentyseven times faster than other manifold learning methods [32, 33] at query time. The computational time of IsoMap [32] and LLE [33] increases as the scale of database grows, while the cost of our IME layer does not change. On the large scale dataset that contains 27000 images, our IME layer is more than 120 times faster than other manifold learning methods. Therefore our IME layer is efficient and effective for large scale image retrieval.
The main contributions of this paper are summarized as follows:

We propose a iterative manifold embedding (IME) approach, which explores the intrinsic manifold by incomplete data. To suppress the interference of the sample holes, we employ the secondorder proximity and original Euclidean distances to correct the geodesic distances during the iteration process. By reconstructing the manifold of database images, our IME method reduces the dimensions of the original representation vectors and enhances the discrimination of the embedded representations.

We propose the iterative manifold embedding (IME) layer to simplify and accelerate calculation of IME, which is the integration and simplification version of IME. The weights of IME layer are learned offline according to original representations and embedded representations by ridge regression. With embedding time below 2 milliseconds, the trained iterative manifold embedding layer can be directly connected to CNNs [39, 26, 27, 28, 29] or independently work as the transform matrix to map the SIFTbased representations [9, 11, 12, 13, 14, 15].
The paper is organized as follows. In Section II we discuss the previous work related to manifold learning and manifoldbased methods for image retrieval. Then, we illustrate the formulation of the proposed algorithm and derive the solution in detail in Section III. The experimental results are described in Section IV. Finally, Section V concludes the paper and Section VI introduces our future work.
Ii Related work
In this section, we review several previous related works from two aspects: manifold learning methods and manifoldbased image retrieval methods.
To the best of our knowledge, very few manifold learning methods can be directly applied to image retrieval. Instead, most manifold learning methods pay attention to dimension reduction and data visualizations. Some manifoldbased methods are applied to image retrieval in the search reranking process. Our IME layer embeds the original representations in image representation process, based on the image manifold reconstructed by incomplete data.
Iia Manifold learning
Our work is related to the manifold learning and dimension reduction methods, such as IsoMap [32], LLE [33], Laplacian Eigenmap [34], SNE [35], tSNE [36], LINE [37] and LargeVis [38].
The most popular method for manifold learning may be the IsoMap [32], which preserves shortest graph path distance by MDS [40] method. IsoMap first constructs the kNN graph of data, and then it computes the shortest path distances between all pairs of points according to the kNN graph. Finally, the distance vectors are embedded into a low dimensional space. By exploiting the local symmetries of linear reconstructions, LLE [33] is able to learn the global structure of nonlinear manifolds. Laplacian Eigenmap [34] constructs a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space by geometrically motivated algorithm. SNE [35]
minimizes the KullbackLeibler divergences between the original and induced distributions to preserves neighbour identities as well as possible. After that, the Student tdistribution is used to solve the crowding problem in tSNE
[36]instead of the Gaussian distribution in SNE
[35]. By exploiting the firstorder proximity and the secondorder proximity between the vertices, LINE [37] designs the objective function that preserves both the local and global network structures. Instead of building a large number of trees to obtain a highly accurate kNN graph, LargeVis [38] uses neighbour exploring techniques to improve the accuracy of the graph.These manifold learning methods can not be used directly for image retrieval except for IsoMap [32] and LLE [33], because we can not get the embedded representation for query image. To map a new query image, IsoMap [32] estimates the geodesic distances between query image and database images by constructed kNN graph, and then reduces the dimensions of geodesic distances vector. LLE [33] computes the kNN of a query image and presents the query image by the weighted sum. The computational cost of the embedded representation for query image is high for IsoMap [32] and LLE [33] due to the computation of kNN of query image. The sample holes seriously interfere with the stability of kNN graph. Therefore IsoMap [32] and LLE [33] are not robust, especially for incomplete data. Our IME layer embeds the original representation of query image quickly, and is not sensitive to the interference of sample holes.
IiB Manifoldbased image retrieval
Some manifoldbased methods are applied to image retrieval in the context of convolutional features as the postprocessing, e.g., query expansion [41, 42] and diffusion on region manifold [43, 44, 45]. Manifoldbased methods that leverage the information of image manifold in the search reranking process are introduced into image retrieval and achieve outstanding performance.
A number of the highly ranked results that satisfy strong spatial constraints from the original query are reissued as a new query in query expansion [41] in search reranking process. The average query expansion (AQE) [41] is now used as a standard postprocessing of the image retrieval methods, due to its efficiency and significant performance boost. However, AQE only explores the firstorder neighbourhood of query images. Recursive average query expansion [41] further improve the results by explicitly crawling the image manifold, but it increases much cost of query time. Different with query expansion exploits the manifold of images at query time, diffusion [43, 45] constructs the neighborhood graph of the dataset offline and uses this information at query time to search on the manifold. In the recent work [45], the diffusion on image manifold is used to compute the rerank scoring in the search reranking process.
Working as the postprocessing, these manifoldbased image retrieval methods need much additional computational cost of reranking at query time. Without postprocessing, our IME layer still achieves better performance than these stateoftheart image retrieval methods on most datasets. And the additional computational cost of our IME layer is ignorable (less than 2 milliseconds) in online retrieval stage.
Iii The proposed approach
The diagram of the proposed method is shown in Fig. 2. We learn the weights of iterative manifold embedding (IME) layer in the first offline stage. Then, we embed the original representations into IME representations rapidly by the proposed IME layer in the online image retrieval stage. While constructing the kNN graph, we exploit the information of secondorder proximity to suppress the interference of sample holes. The Euclidean distances of original representations are utilized to correct the calculation error of geodesic distances in the manifold embedding step. In order to better reconstruct the intrinsic manifold by incomplete sampled data, we repeat the kNN graph construction and manifold embedding steps many times. By adequate approximation and simplification, IME layer embeds the original representations into IME representations which preserve the intrinsic manifold of database images with ignorable additional time. IME layer is the integration version of IME, which reduces the estimation error of geodesic distances and the loss of discriminative information in dimension reduction by ridge regression.
In this section, we describe our novel IME layer for image retrieval in detail. Firstly, we show the formulation of proposed iterative embedding method which embeds the original representation according to the intrinsic manifold of images in Section IIIA. Then in Section IIIB the proposed embedding is equivalently implemented by the fully connected layer, which is called IME layer.
Iiia Iterative manifold embedding
The iterative manifold embedding (IME) algorithm has two cyclic steps, which are detailed in Algorithm. 1. In the first step, we construct the kNN graph and calculate the geodesic distances, considering the information of secondorder proximity. Then, the original representations are embedded into the manifoldbased representations which preserve the corrected geodesic distances in the second step.
The first step constructs the kNN graph and calculates the geodesic distances of original representations (where is the ndimensional original representation vector of image and is the database scale).
To construct the firstorder kNN graph , each point is only connected to its k nearest neighbours based on the Euclidean distances between pairs of images in the input space . The neighbourhood relations are coarsely represented as a weighted firstorder graph over the data, with the edges of weight between neighbouring images. The weights of the edges in firstorder graph are defined as the following matrix:
(1) 
Where indicates that the edge between images and is cut in firstorder graph . indicates that image is connected to by the edge with path distance .
To suppress the interference of the sample holes, we exploit the secondorder proximity in the constructed graph . Like the idea in LINE [37], we assumes that the vertices sharing many connections to other vertices are similar to each other. Different with LINE [37], we use this natural intuition to construct the stable graph in a simple but effective way. The weights of the edges in graph are calculated by the weights in firstorder graph :
(2) 
Similar to , indicates that image is not connected to directly in secondorder graph . The weight indicates the distance between directly connected images and in secondorder graph, which is more robust for sparse samples. In the condition that the samples are sparse, the k nearest neighbours based on the Euclidean distances are farfetched and unstable sometimes. The secondorder graph utilizes the natural intuition that the vertices sharing connections to other vertices are similar to each other to ensure the stability of connection.
We estimate the geodesic distances between all pairs of images on the manifold by computing their shortest path distances in the graph . The shortest path is simply computed by FloydWarshall algorithm [46].
Although the information of secondorder proximity is utilized to suppress the interference of sample holes, the geodesic distances still have estimation error. To further reduce the error of reconstruction of image manifold, we correct the matrix of geodesic distances by Euclidean distances in the second step. We embed the representations into a lowdimensional space that best preserves the manifold’s estimated intrinsic geometry. The manifoldbased representations (where is the manifoldbased representation vector of image ) are computed by minimizing the cost function
(3) 
where denotes the norm of matrix and the parameter denotes the strength of the correction. is a conversion function that converts distances to similarity. Given distance matrix , many conversion functions can be used such as quadratic function and tdistribution in practice. The solution of Eq. 3 using quadratic function as conversion function is equivalent to the global optimal solution of the cost function in IsoMap [32]. We compare different conversion functions in Section IV. In order to retain more discriminatory information, we constrain the embedded representations to be orthonormal.
The global minimum of Eq. 3 is achieved by setting the representations to the top eigenvectors of the similarity matrix
(4) 
Let
be the pth largest eigenvalue of the matrix
, and be the th component of the th eigenvector. Then, the th component of the mdimensional manifoldbased representation equals to . The dimension impacts the discrimination of the representation . The selection of is a compromise between computational cost and accuracy.The manifoldbased representation exploits the neighborhood relationships to represent the feature of a image. We only use the small amount of neighbours with high credibility to construct kNN graph in this paper. The constructed kNN graph in first step is sparse and has few negative neighbour pairs that represent different objects. But some important connection paths are cut off. The sample holes do harm to the reliability of graph . The correction operations in Eq. 2 and Eq. 4 solve this problem by exploring the information of secondorder proximity and Euclidean distances. The performance of the embedded representations is improved significantly in this way.
Although these strategies is straightforward, the remarkable performance is achieved with some iterations. We repeat above two steps to improve the stability of the graph . In each iteration, the Euclidean distances are updated based on manifoldbased representations . In Algorithm. 1, the iteration process is stated in detail. After a few iterations, we get the final mdimensional iterative manifold embedding representations (where is the IME representation vector of image ). We map the original representations into a lowdimensional space that preserves the geometry of the image manifold.
IiiB Integration as IME layer
The iterative manifold embedding (IME) method proposed above embeds the original representations into manifoldbased representations
. But if we directly apply IME for query image, the embedding leads to more feature extraction time of query images and more estimation error of geodesic distances. Similar to Isomap
[32], the IME proposed in previous Section IIIA needs to compute the shortest pathes from a query image to all database images and estimate the geodesic distances at query time. They are implemented in image retrieval by linking the query into the graph of geodesic distances of the training data. First the k nearest neighbors of query are found in the training data. Then, the shortest geodesic distances from query to each point in the training data are computed and transformed into similarity vector by conversion function. The similarity vector corrected by Euclidean distance is projected into the IME representation by the eigenvector matrix of training data finally. The additional computational cost is proportional to the database scale. In order to reduce the estimation error and computational cost, the IME method is equivalently implemented and simplified by the fully connected layer in this section, which is called IME layer in this paper.The IME layer can be regarded as transform matrix which integrates the estimation of geodesic distances with dimension reduction. We learn the transform matrix according to the original representations and the IME representations of database images. We minimize the following objective function to calculate the weights of IME layer as the transform matrix .
(5) 
Where and are original representations and IME representations of the d images on retrieval database respectively. and are the dimensions of original representation and IME representation respectively. denotes the norm of matrix . The first term in the objective function is the transformation cost term for minimizing the difference between the representations computed by the IME algorithm and representations mapped by the IME layer. is a regularization term, and controls the weight of the regularization.
In this case, this is equivalent to a ridge regression problem and has a closed form solution. %ͨ Ϊ ã ñϽ⣺ Let gradient =0 to minimize the objective function .
(6) 
Then, this reduces to
(7) 
Where
is the identity matrix. Since
is the dimensions of original representation, which is low, solving this problem (which needs to be solved only once, at learning time) is extremely fast.IME layer is the integration and simplification version of IME. The computational complexity of IME layer is low at both learning and retrieval steps. IME layer simplifies the calculation processes and integrates the estimation of geodesic distances with dimension reduction by ridge regression. Through the integration of IME, we reduce the estimation error and the loss of discriminative information. In IME, the representation of query image is directly computed by the corrected geodesics distances projected by the eigenvector matrix of training data. Therefore the estimation error of geodesics distances affect the result significantly. The performance of IME also depend on the parameters of computing processes very much. The IME layer is the integration of IME, which diminishes the number of parameters and omits the computation of the shortest geodesic distances from query to each point in the training data. The cumulative computation error in intermediate step is avoided by integration. In practice, our integration strategy also can be applied to other manifold learning methods [32, 33, 34, 35, 36, 37, 38] for image retrieval task. For SIFTbased representations [9, 11, 12, 13, 14, 15], IME layer can work as the transform matrix to map the vector representations into embedding space. It also can be directly connected to CNNs [39, 26, 27, 28, 29] as an additional fully connected layer for CNNbased representations.
Iv Experiment
This section presents the experimental setup and investigates the accuracy of our approaches for image retrieval on five public datasets. To evaluate the efficiency and effectiveness of our IME layer, we compare the IME layer with the related manifold learning methods and the stateoftheart image retrieval methods.
Iva Datasets
We evaluate the performance of our IME layer on five standard datasets for image retrieval. Mean average precision (mAP) is used as the performance measure on all datasets.
Two are wellknown image retrieval benchmarks: Oxford5k [47] and Paris6k [48]. Oxford5k contains 5062 images collected from Flickr by searching for particular Oxford landmarks. Paris6k dataset contains 6412 photographs from Flickr associated with Paris landmarks. 55 queries corresponding to 11 buildings are manually annotated. The performance is measured using mean average precision (mAP) over the 55 queries.
For large scale image retrieval, we experiment at Oxford105k and Paris106k datasets which add 100k distractor images from Flickr [47].
The fifth dataset is the recently introduced instance search dataset called INSTRE [49]. It contains various everyday 3D or planar objects from buildings to logos with many variations such as different scales, rotations and occlusions. Some objects cover a small part of the image, making it a challenging dataset. It contains of 28543 images from 250 different object classes. In particular, 100 classes with images retrieved from online sources, 100 classes with images taken by the dataset creators, and 50 classes consisting of pairs from the second category. Different from the original protocol [49] that uses all databases images as queries, we evaluate the performance in the same way as the recent works [45]. The INSTRE dataset is randomly split into 1250 queries, 5 per class, and 27293 database images, while a bounding box defines the query region. The query and the database sets have no overlap.
IvB Implementation details
Our IME layer can combine with both SIFTbased representations and CNNbased representations. For CNNbased representations, we employ the finetuned network for image retrieval [29] to extract the representation vectors. This finetuned ResNet101 produces 2048 dimensional representations. We extract regions at 3 different scales as in RMAC [23], and we additionally include the full image as a region. In this fashion, each image has 21 regions on average. The regional representations are aggregated and renormalized to unit norm in order to construct the original representations, which is exactly as in RMAC [23]. For SIFTbased representations, we employ the triangulation embedding [13] to aggregate the RootSIFT descriptors [50]. In practice, we employ the 8064dimensional representations of which the vocabulary size is 64.
The weights of the correction term and regularization term are set as and respectively, throughout our experiments. Time measurements are reported with a 32core Intel Xean 2.2GHz CPU.
IvC Impact of different components
In this section, we conduct a series of experiments on secondorder proximity, similarity computation, parameters of IME , IME layer and various original representations.
Secondorder proximity. To demonstrate the effectiveness of secondorder proximity information, we compare the results of employing firstorder graph and graph respectively to compute the geodesic distances in Table. I. The weights of the edges in graph G are calculated based on both firstorder and secondorder relationships of database images. Obviously, the performance of graph is consistently better on all datasets. The results show that graph is more reliable to calculate geodesic distance. The secondorder neighbour relationship is effective to suppress the interference of manifold’s sample holes. As a result, the constructed kNN graph is more stable and robust.
Graph  INSTRE  Oxford5k  Oxford105k  Paris6k  Paris106k 

80.7  89.5  85.9  92.5  85.6  
82.4  92.0  87.2  96.6  93.3 
Similarity computation. We compare different conversion functions in Fig. 3, such as quadratic function and tdistribution . These conversion operators are used in IsoMap [32] and tSNE [36] respectively.
Fig. 3
shows the performance versus different functions to convert distances into similarity. The tdistribution performs much better than quadratic function on all datasets. It enhances the discrimination of the moderate geodesic distances and suppresses the difference of large geodesic distances. Due to error accumulation, the calculation error of large geodesic distances is large. The tdistribution reduces the calculation error of similarity between the points that are far apart by suppressing the difference of large geodesic distances, while quadratic function increases the calculation error. Therefore, we employ a Student tdistribution with one degree of freedom (which is the same as Cauchy distribution) as the conversion function in our IME layer.
Parameters of IME. We evaluate the performance of different parameters of IME, such as the number of iterations , dimensions of IME representation , the number of neighbours and the strength of correction .
Fig. 4(a) shows the performance versus the number of iterations . Our IME layer performs well just with a small number of iterations. Too many iterations lead to the overfitting of embedding. We set =2 in the rest experiments in this paper due to the better performance and moderate training time.
The performance comparison of various dimensions of embedded representations is shown in Fig. 4(b). We achieve 92.0 and 96.6 on Oxford5k and Paris6k datasets respectively when we employ the 2048dimensional IME representation. The highdimensional representation preserves more discriminative information, so performance of it is better than lowdimensional representation. As shown in Fig. 4(b) by dotted lines, the mAP of original representation are 83.9 and 93.8 on Oxford5k and Paris6k datasets respectively. Our method achieves better performance than original representation even if the dimensions of final IME representation is one sixteenth of original representation, 128 dimension. The results show that our lowdimensional IME representation still preserves the intrinsic manifold. The reconstructed manifold is effective for image retrieval task to search the similar images.
Fig. 4(c,d) shows the performance versus the number of neighbours to construct the kNN graph. and are the number of neighbours for first and second iterations respectively. The selection of parameter depends on the sparsity of datasets to some degree. The small is better suited for sparser dataset. With small , most of the neighbours are positive. With large , there are many negative neighbours in the graph especially for the sparse dataset. Therefore the constructed kNN graph with too large is unstable for sparse dataset. The calculation error of geodesic distances computed according to the graph with too large is large. But the dense dataset is more adaptable. Even with large , the stability of neighbours in dense dataset are little influenced by noise due to more stable neighbours. There are more similar objects in Paris6k than Oxford6k. The images in Oxford5k are sparser. But the images in Paris6k are denser. Therefore Paris6k is almost unaffected by but Oxford5k degrades with .
We evaluate the effect of various correction weights , and then report the results in Fig. 4(e,f). and are the strength of correction for first and second iterations respectively. The results show that the performance of our method does not heavily rely on the correction weights . We set the correction weights in all the other experiments. The estimation error of geodesic distance is larger in the first iteration. has a more significant impact than . If we do not correct the geodesic distances by Euclidean distances while iterative embedding (that is, ), the results are 89.9 and 95.7 on Oxford5k and Paris6k datasets respectively as shown in Fig. 4(e,f) by dotted lines. datasets respectively. The mAPs of uncorrected geodesic distances () are lower than mAPs of corrected geodesic distances (), 92.0 and 96.6 on Oxford5k and Paris6k datasets respectively. The results demonstrate that the correction is important for the geodesic distances computed by the incomplete data.
mAP  Average additional query time (second)  

Method  Oxford5k  Paris6k  INSTRE  Oxford5k  Paris6k  INSTRE  
PCA [51]  82.6  91.5  62.2  0.002  0.002  0.002  
IsoMap [32]  77.9  91.8  68.6  0.378  0.403  3.483  
LLE [33]  51.7  40.5  42.7  0.054  0.066  0.249  
IME  83.5  93.4  75.9  0.907  0.937  7.659  
IME layer  92.0  96.6  82.4  0.002  0.002  0.002 
IME layer. The proposed IME layer is the integration and simplification version of IME. The accuracy and average additional query time of PCA, IME, IME layer and other manifold learning methods are shown in Table II. IME layer achieves better performance on both mAP and time cost. The additional computation cost of IME layer is roughly identical to PCA. On Oxford5k and Paris6k datasets, our IME layer is more than twentyseven times faster than IsoMap [32] and LLE [33], the manifold learning methods that can be applied to image retrieval, and significantly outperforms them on mAP. The computational cost of IME layer is unrelated to the scale of database, while the cost of IsoMap [32] and LLE [33] is proportional to the number of database images. On a large scale dataset INATRE, our method is more than 120 times faster than them. The performance of IME layer is better than IME. Because the integration reduces the calculation error of the geodesics distances between query image and database images and the loss of discriminative information in dimension reduction. The results demonstrate that our IME layer is effective and efficient for image retrieval.
Various original representations. We do experiments on both SIFT descriptors and CNN features. Table. III presents the results of the accuracy of SIFTbased and CNNbased representations with/without IME layer. The dimensions of final IME representation is reduced to 2048 in this experiment. The results demonstrate that our IME layer is effective for various features. The IME layer can be directly connected with a CNN as the trained fully connected layer. For other features, our IME layer can work as the transform matrix to map the aggregated representations into embedding space.
Feature  Oxford5k  Oxford105k 

SIFT [13]  52.7  27.6 
SIFT+IME layer  62.2  31.3 
CNN [29]  83.9  80.8 
CNN+IME layer  92.0  87.2 
Datasets  
Method  Dimensions  INSTRE  Oxford5k  Oxford105k  Paris6k  Paris106k 
Original representations  
CroW [24]  512  –  68.2  63.2  79.8  71.0 
RMAC [27]  512  47.7  77.7  70.1  84.1  76.8 
RMAC [29]  2048  62.6  83.9  80.8  93.8  89.9 
Dimension reduction and manifold learning  
PCA [51]  512  50.0  78.2  74.7  91.0  85.4 
ICA [52]  512  50.3  77.5  73.7  90.8  85.2 
IsoMap [32]  512  69.7  77.8  64.7  91.8  69.6 
LLE [33]  512  60.2  64.0  47.6  50.7  21.7 
IME layer  512  83.1  91.2  85.1  96.3  92.5 
PCA [51]  1024  58.7  80.8  78.0  91.7  86.8 
ICA [52]  1024  58.7  81.6  78.0  92.1  87.3 
IsoMap [32]  1024  69.1  78.1  65.4  92.1  72.1 
LLE [33]  1024  50.4  58.8  42.3  45.0  16.2 
IME layer  1024  82.8  91.8  86.2  96.5  92.9 
PCA [51]  2048  62.2  82.6  79.1  91.5  86.5 
ICA [52]  2048  62.2  82.7  79.1  91.5  86.5 
IsoMap [32]  2048  68.6  77.9  68.3  91.8  76.4 
LLE [33]  2048  42.7  51.7  34.9  40.5  14.7 
IME layer  2048  82.4  92.0  87.2  96.6  93.3 
Search reranking  
QE [41]  2048  70.5  89.6  88.3  95.3  92.7 
SCSM [53]  2048  71.4  89.1  87.3  95.4  92.5 
Diffusion [45]  2048  80.5  87.1  86.8  96.5  95.4 
IME layer  2048  82.4  92.0  87.2  96.6  93.3 
IvD Comparison with the stateoftheart
We compare with the stateoftheart approaches with global representation. Table IV summarizes the results. Our IME layer significantly outperforms all the existing dimension reduction and manifold learning methods on all datasets. Without postprocessing, our IME layer still outperforms the stateoftheart image retrieval methods with reranking on most datasets.
In the first part of the table, we show results of the methods that employ global representations of images and do not perform any form of spatial verification or query expansion at query time. The 2048dimensional RMAC vectors [29] in the first part are employed as the original CNNbased representations in this paper.
We compare our IME layer with related dimension reduction and manifold learning methods in the second part of the table. Except for IsoMap [32] and LLE [33], other nonlinear manifold learning methods can not be directly applied to image retrieval. We consistently outperform them for various dimensions on all datasets. In one case (namely, on INSTRE), our method is more than 13 mAP points ahead of the best competitor [32]. Our IME layer requires less computational cost compared with IsoMap [32] and LLE [33] in the online image retrieval stage, and same computational cost as PCA [51] and ICA [52].
As shown in the third part of the table, we show the results of stateoftheart methods that employ global representations and perform search reranking (e.g. , spatial verification [53], query expansion (QE) [41] or diffusion [45]) at query time. Without reranking, our method still outperforms the stateoftheart methods with postprocessing on most datasets. These methods with postprocessing [53, 41, 45] contain image scoring and ranking steps more than once. It is worth noting that our IME layer method is much faster than these methods and has comparable performance.
For Oxford105k and Paris106k datasets, we use the incomplete data (the images in Oxford5k and Paris6k respectively) and 5000 noisy data from 100k distractor images on Flickr [47] to learn the weights of IME layer in the offline stage taking a few minutes. In comparison, diffusion [45] employs all 100 thousand images to construct the manifold and takes many hours. Our IME layer has less than 2 milliseconds additional time per query in the online retrieval stage, while diffusion [45] requires about 14 seconds. Employing sparse samples to learn the weights of IME layer, our IME layer still achieves good performance on large scale image retrieval. The results demonstrate that our IME layer is effective to reconstruct image manifold by incomplete data.
In Fig. 5 we present some query examples using original representation and IME representation respectively. The images with gray border are cropped query images. The images with green border are positive retrieval results (groundtruth) and the images with red border are negative results. IME layer significantly improves the retrieval results by mapping the original representations into embedding space. Due to the reconstructed continuous manifold of images, the images that contain the same objects in different viewing angles and various illumination are closer in the embedding space.
V Conclusion
In this paper we propose a manifold learning method called iterative manifold embedding (IME) layer and demonstrate its efficiency and effectiveness for image retrieval. Through the unsupervised strategy, the weights of IME layer are learned by incomplete data.
Our IME layer introduces the manifold learning into image retrieval. We solve the sample holes problem on manifold learning, using the information of secondorder proximity and the correction of geodesic distances by Euclidean distances. In order to reduce the additional computational cost and estimation error of geodesic distances at query time, we integrate the manifoldbased embedding by the approximate linear mapping.
Experiments on five standard retrieval datasets demonstrate that our IME layer significantly outperforms related dimension reduction and manifold learning methods with equivalent or lower computational complexity. Without search reranking, our method still outperforms the stateoftheart methods with search reranking on most datasets.
Vi Future Work
The main limitation of our IME layer is that the offline learning is timeconsuming. Especially for large datasets, the cost of construction of kNN graph and calculation of geodesic distances is large. We will try to speed up the offline learning stage in the future work by parallel computing and other strategies. The IME layer is learned via a linear transform in this work. We will employ nonlinear transform and try to learn more layers endtoend in the future work.
Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant 61531019, Grant 61601462, and Grant 71621002. The authors would like to thank the Associate Editor and the anonymous reviewers for their contributions to improve the quality of this paper.
References

[1]
J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object
matching in videos,” in
IEEE International Conference on Computer Vision
, 2003, p. 1470.  [2] H. Jégou, M. Douze, and C. Schmid, “Improving bagoffeatures for large scale image search,” International journal of computer vision, vol. 87, no. 3, pp. 316–336, 2010.
 [3] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek, “Accurate image search using the contextual dissimilarity measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 2–11, 2010.

[4]
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in
quantization: Improving particular object retrieval in large scale image
databases,” in
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on
, 2008, pp. 1–8.  [5] J. C. van Gemert, C. J. Veenman, A. W. Smeulders, and J. M. Geusebroek, “Visual word ambiguity.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1271–83, 2010.
 [6] Y.H. Kuo, W.H. Cheng, H.T. Lin, and W. H. Hsu, “Unsupervised semantic feature discovery for image object retrieval and tag refinement,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 1079–1090, 2012.
 [7] Y. Gao, M. Shi, D. Tao, and C. Xu, “Database saliency for fast image retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 359–369, 2015.
 [8] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 3360–3367.
 [9] H. J gou, F. Perronnin, M. Douze, J. S nchez, P. P rez, and C. Schmid, “Aggregating local image descriptors into compact codes.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1704–16, 2012.
 [10] E. SpyromitrosXioufis, S. Papadopoulos, I. Y. Kompatsiaris, G. Tsoumakas, and I. Vlahavas, “A comprehensive study over vlad and product quantization in largescale image retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1713–1728, 2014.
 [11] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies for image categorization,” in IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
 [12] F. Perronnin, J. Nchez, and T. Mensink, “Improving the fisher kernel for largescale image classification,” in Computer Vision  ECCV 2010, European Conference on Computer Vision, Heraklion, Crete, Greece, September 511, 2010, Proceedings, 2010, pp. 143–156.
 [13] H. Gou and A. Zisserman, “Triangulation embedding and democratic aggregation for image search,” in Computer Vision and Pattern Recognition, 2014, pp. 3310–3317.
 [14] T.T. Do, Q. D. Tran, and N.M. Cheung, “Faemb: a function approximationbased embedding method for image retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3556–3564.
 [15] S. S. Husain and M. Bober, “Improving largescale image retrieval through robust aggregation of local descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
 [16] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 60, pp. 91—110, 2004.

[17]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”
Neural computation, vol. 1, no. 4, pp. 541–551, 1989.  [18] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features offtheshelf: an astounding baseline for recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813.
 [19] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multiscale orderless pooling of deep convolutional activation features,” in European conference on computer vision. Springer, 2014, pp. 392–407.
 [20] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes for image retrieval,” in European conference on computer vision. Springer, 2014, pp. 584–599.
 [21] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Visual instance retrieval with deep convolutional networks,” ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.

[22]
A. Babenko and V. Lempitsky, “Aggregating local deep features for image retrieval,” in
Proceedings of the IEEE international conference on computer vision, 2015, pp. 1269–1277. 
[23]
G. Tolias, R. Sicre, and H. J gou, “Particular object retrieval with integral maxpooling of cnn activations,”
ICLR, 2016.  [24] Y. Kalantidis, C. Mellina, and S. Osindero, “Crossdimensional weighting for aggregated deep convolutional features,” in European Conference on Computer Vision. Springer, 2016, pp. 685–701.

[25]
A. Chadha and Y. Andreopoulos, “Voronoibased compact image descriptors: Efficient regionofinterest retrieval with vlad and deeplearning based descriptors,”
IEEE Transactions on Multimedia, 2017.  [26] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
 [27] F. Radenovic, G. Tolias, and O. Chum, “Cnn image retrieval learns from bow: Unsupervised finetuning with hard examples,” in European Conference on Computer Vision. Springer, 2016, pp. 3–20.
 [28] A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in European Conference on Computer Vision. Springer, 2016, pp. 241–257.
 [29] A. Gordo, J. Almazan, and J. Revaud, “Endtoend learning of deep visual representations for image retrieval,” 2016.
 [30] X. Jian, S. Cunzhao, Q. Chengzuo, W. Chunheng, and X. Baihua, “Partbased weighting aggregation of deep convolutional features for image,” arXiv preprint arXiv:1705.01247, 2017.
 [31] H. S. Seung and D. D. Lee, “The manifold ways of perception,” Science, vol. 290, no. 5500, pp. 2268–2269, 2000.
 [32] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
 [33] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
 [34] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in NIPS, vol. 14, no. 14, 2001, pp. 585–591.
 [35] G. Hinton and S. Roweis, “Stochastic neighbor embedding,” in NIPS, vol. 15, 2002, pp. 833–840.

[36]
L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,”
Journal of Machine Learning Research
, vol. 9, no. Nov, pp. 2579–2605, 2008.  [37] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web. ACM, 2015, pp. 1067–1077.
 [38] J. Tang, J. Liu, M. Zhang, and Q. Mei, “Visualization largescale and highdimensional data,” arXiv preprint arXiv:1602.00370, 2016.
 [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” ICLR 2015, 2015.
 [40] J. B. Kruskal and M. Wish, Multidimensional Scaling. BOOK ON DEMAND POD, 1978.
 [41] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.
 [42] H. Xie, Y. Zhang, J. Tan, L. Guo, and J. Li, “Contextual query expansion for image retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 4, pp. 1104–1114, 2014.
 [43] M. Donoser and H. Bischof, “Diffusion processes for retrieval revisited,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1320–1327.
 [44] Z. Gao, J. Xue, W. Zhou, S. Pang, and Q. Tian, “Democratic diffusion aggregation for image retrieval,” IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1661–1674, 2016.
 [45] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 2017.
 [46] R. W. Floyd, “Algorithm 97: shortest path,” Communications of the ACM, vol. 5, no. 6, p. 345, 1962.
 [47] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
 [48] J. Philbin, O. Chum, M. Isard, and J. Sivic, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
 [49] S. Wang and S. Jiang, “Instre: a new benchmark for instancelevel object retrieval and recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 11, no. 3, p. 37, 2015.
 [50] R. Arandjelovic and A. Zisserman, “Three things everyone should know to improve object retrieval,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2911–2918.

[51]
H. Abdi and L. J. Williams, “Principal component analysis,”
Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–459, 2010.  [52] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley & Sons, 2004, vol. 46.
 [53] X. Shen, Z. Lin, J. Brandt, and Y. Wu, “Spatiallyconstrained similarity measurefor largescale object retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 6, pp. 1229–1241, 2014.
Comments
There are no comments yet.