Iterative Manifold Embedding Layer Learned by Incomplete Data for Large-scale Image Retrieval

07/14/2017 ∙ by Jian Xu, et al. ∙ 0

Existing manifold learning methods are not appropriate for image retrieval task, because most of them are unable to process query image and they have much additional computational cost especially for large scale database. Therefore, we propose the iterative manifold embedding (IME) layer, of which the weights are learned off-line by unsupervised strategy, to explore the intrinsic manifolds by incomplete data. On the large scale database that contains 27000 images, IME layer is more than 120 times faster than other manifold learning methods to embed the original representations at query time. We embed the original descriptors of database images which lie on manifold in a high dimensional space into manifold-based representations iteratively to generate the IME representations in off-line learning stage. According to the original descriptors and the IME representations of database images, we estimate the weights of IME layer by ridge regression. In on-line retrieval stage, we employ the IME layer to map the original representation of query image with ignorable time cost (2 milliseconds). We experiment on five public standard datasets for image retrieval. The proposed IME layer significantly outperforms related dimension reduction methods and manifold learning methods. Without post-processing, Our IME layer achieves a boost in performance of state-of-the-art image retrieval methods with post-processing on most datasets, and needs less computational cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past decades, image retrieval has received widespread attention. The representations are shown to be effective for image retrieval [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], which are derived by aggregating Scale-Invariant Feature Transform (SIFT) [16]

features. After that, image retrieval methods based on Convolutional Neural Network (CNN) 

[17] achieve excellent performance [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]

. These methods represent a image as the description vector, and sort the Euclidean distances between the feature vectors of query and database images as the retrieval results.

Manifolds are the fundamental to perception [31]. For example, to recognize the faces, the brain equates all images from the same manifold but distinguishes between images from different manifolds. %ͬһ ͼ IJ ̵ͬ ӽǺ͹ һ In image collection, objects and landmarks are depicted in various conditions, such as different viewing angles and under various illumination. As a consequence, query and relevant images are often connected by a sequence of images, where consecutive images are similar. The descriptors of these images form a manifold in the descriptor space. As shown in Fig. 1, some negative samples are far from query image through the paths in K-NN graph (geodesic distances) but are closer to query than some positive samples based on Euclidean distances.


Fig. 1: Image manifold. The image with gray border is the query. The Euclidean distances to query image are expressed by the distance between the positions of the query and reference images in Fig. 1. Contour lines correspond to the geodesic distance. The images with green and red border are positive samples and negative samples respectively. The wires show the neighbourhood relationships of images. As shown above, a negative sample is closer to query than some positive samples in Euclidean space. However, it is far from query image along the wires in manifold. The shortest path between the negative sample and query are shown by images with blue border.

But the data in image collection is incomplete in most situation. These images do not vary smoothly. The data sampled from intrinsic manifolds are too sparse. As a result, the manifolds reconstructed by these sparse data have some holes and are not continuous and smooth. The holes lead to the large calculation error of geodesic distance. Moreover, existing manifold learning methods [32, 33, 34, 35, 36, 37, 38] are not appropriate for image retrieval task, because most of them are unable to process query image and they have much additional computational cost especially for large scale database. Except for IsoMap [32] and LLE [33], these manifold learning methods can not handle new image at query time. And the computational cost for query image is high for IsoMap and LLE due to the computation of k-NN of query image.

For the above problems, we propose the iterative manifold embedding (IME) layer to explore the intrinsic manifolds by incomplete data in this paper. The weights of the IME layer are learned off-line by unsupervised strategy. In the on-line query stage, our IME layer maps the features of query images into embedding space with very little or even ignorable additional computational complexity.

The IME layer solves the problem of sample holes from two aspects. (1) The points that share similar neighbours tend to be similar to each other [37]. We employ this natural intuition to improve the topological instability of k-NN graph. The information of second-order proximity suppresses the interference of the sample holes. (2) We utilize the Euclidean distances to correct the calculation error of geodesic distances. Based on the corrected geodesic distances, we embed the data into low-dimensional space and preserve the intrinsic geometry of the data. The above steps are repeated many times to construct the stable and rubout k-NN graph by incomplete data. To adapt our algorithm to image retrieval task, we simplify and approximate the IME by linear mapping, called IME layer in this paper. The IME layer is the integration and simplification version of IME, which reduces the computational cost and estimation error of geodesic distances for query images. The query image is embedded with very little or even ignorable additional computational cost by IME layer in the on-line retrieval stage. Working as the additional fully connected layer, the proposed IME layer can be directly connected to CNNs [39, 26, 27, 28, 29]. For SIFT-based representations [9, 11, 12, 13, 14, 15], IME layer can work as the transform matrix to map the vector representation into low-dimensional space and preserve the original neighbourhood relationships.

We conduct extensive experiments on five public standard image retrieval datasets, including landmarks and logos. Experiments results show that our proposed algorithm for manifold-based embedding significantly improves the performance of global representation vectors. The proposed IME layer achieves a significant boost in the performance of the related dimension reduction methods and manifold learning methods. Without reranking, our IME layer still outperforms the state-of-the-art methods based on search reranking in post-processing step on most datasets. On a set of five thousand images with 2048 dimensions, IME layer is up to twenty-seven times faster than other manifold learning methods [32, 33] at query time. The computational time of IsoMap [32] and LLE [33] increases as the scale of database grows, while the cost of our IME layer does not change. On the large scale dataset that contains 27000 images, our IME layer is more than 120 times faster than other manifold learning methods. Therefore our IME layer is efficient and effective for large scale image retrieval.

The main contributions of this paper are summarized as follows:

  • We propose a iterative manifold embedding (IME) approach, which explores the intrinsic manifold by incomplete data. To suppress the interference of the sample holes, we employ the second-order proximity and original Euclidean distances to correct the geodesic distances during the iteration process. By reconstructing the manifold of database images, our IME method reduces the dimensions of the original representation vectors and enhances the discrimination of the embedded representations.

  • We propose the iterative manifold embedding (IME) layer to simplify and accelerate calculation of IME, which is the integration and simplification version of IME. The weights of IME layer are learned off-line according to original representations and embedded representations by ridge regression. With embedding time below 2 milliseconds, the trained iterative manifold embedding layer can be directly connected to CNNs [39, 26, 27, 28, 29] or independently work as the transform matrix to map the SIFT-based representations [9, 11, 12, 13, 14, 15].

The paper is organized as follows. In Section II we discuss the previous work related to manifold learning and manifold-based methods for image retrieval. Then, we illustrate the formulation of the proposed algorithm and derive the solution in detail in Section III. The experimental results are described in Section IV. Finally, Section V concludes the paper and Section VI introduces our future work.

Ii Related work

In this section, we review several previous related works from two aspects: manifold learning methods and manifold-based image retrieval methods.

To the best of our knowledge, very few manifold learning methods can be directly applied to image retrieval. Instead, most manifold learning methods pay attention to dimension reduction and data visualizations. Some manifold-based methods are applied to image retrieval in the search reranking process. Our IME layer embeds the original representations in image representation process, based on the image manifold reconstructed by incomplete data.

Ii-a Manifold learning

Our work is related to the manifold learning and dimension reduction methods, such as IsoMap [32], LLE [33], Laplacian Eigenmap [34], SNE [35], t-SNE [36], LINE [37] and LargeVis [38].

The most popular method for manifold learning may be the IsoMap [32], which preserves shortest graph path distance by MDS [40] method. IsoMap first constructs the k-NN graph of data, and then it computes the shortest path distances between all pairs of points according to the k-NN graph. Finally, the distance vectors are embedded into a low dimensional space. By exploiting the local symmetries of linear reconstructions, LLE [33] is able to learn the global structure of nonlinear manifolds. Laplacian Eigenmap [34] constructs a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space by geometrically motivated algorithm. SNE [35]

minimizes the Kullback-Leibler divergences between the original and induced distributions to preserves neighbour identities as well as possible. After that, the Student t-distribution is used to solve the crowding problem in t-SNE 

[36]

instead of the Gaussian distribution in SNE 

[35]. By exploiting the first-order proximity and the second-order proximity between the vertices, LINE [37] designs the objective function that preserves both the local and global network structures. Instead of building a large number of trees to obtain a highly accurate k-NN graph, LargeVis [38] uses neighbour exploring techniques to improve the accuracy of the graph.

These manifold learning methods can not be used directly for image retrieval except for IsoMap [32] and LLE [33], because we can not get the embedded representation for query image. To map a new query image, IsoMap [32] estimates the geodesic distances between query image and database images by constructed k-NN graph, and then reduces the dimensions of geodesic distances vector. LLE [33] computes the k-NN of a query image and presents the query image by the weighted sum. The computational cost of the embedded representation for query image is high for IsoMap [32] and LLE [33] due to the computation of k-NN of query image. The sample holes seriously interfere with the stability of k-NN graph. Therefore IsoMap [32] and LLE [33] are not robust, especially for incomplete data. Our IME layer embeds the original representation of query image quickly, and is not sensitive to the interference of sample holes.

Ii-B Manifold-based image retrieval

Some manifold-based methods are applied to image retrieval in the context of convolutional features as the post-processing, e.g., query expansion [41, 42] and diffusion on region manifold [43, 44, 45]. Manifold-based methods that leverage the information of image manifold in the search reranking process are introduced into image retrieval and achieve outstanding performance.

A number of the highly ranked results that satisfy strong spatial constraints from the original query are reissued as a new query in query expansion [41] in search reranking process. The average query expansion (AQE) [41] is now used as a standard post-processing of the image retrieval methods, due to its efficiency and significant performance boost. However, AQE only explores the first-order neighbourhood of query images. Recursive average query expansion [41] further improve the results by explicitly crawling the image manifold, but it increases much cost of query time. Different with query expansion exploits the manifold of images at query time, diffusion [43, 45] constructs the neighborhood graph of the dataset off-line and uses this information at query time to search on the manifold. In the recent work [45], the diffusion on image manifold is used to compute the rerank scoring in the search reranking process.

Working as the post-processing, these manifold-based image retrieval methods need much additional computational cost of reranking at query time. Without post-processing, our IME layer still achieves better performance than these state-of-the-art image retrieval methods on most datasets. And the additional computational cost of our IME layer is ignorable (less than 2 milliseconds) in on-line retrieval stage.


Fig. 2: Flow chart of the image retrieval framework which is based on proposed IME layer. We learn the weights of iterative manifold embedding (IME) layer in the first off-line stage. Then, we use the IME layer to embed the original representations according to the intrinsic manifold rapidly in the on-line image retrieval stage. While constructing the k-NN graph, we exploit the information of second-order proximity to suppress the interference of sample holes. The Euclidean distances of original representations are utilized to correct the calculation error of geodesic distances in the manifold embedding step. In order to better reconstruct the intrinsic manifold by incomplete sampled data, we repeat the k-NN graph construction and manifold embedding steps many times. By adequate approximation and simplification, IME layer embeds the original representations into IME representations and preserves the intrinsic manifold of database images.

Iii The proposed approach

The diagram of the proposed method is shown in Fig. 2. We learn the weights of iterative manifold embedding (IME) layer in the first off-line stage. Then, we embed the original representations into IME representations rapidly by the proposed IME layer in the on-line image retrieval stage. While constructing the k-NN graph, we exploit the information of second-order proximity to suppress the interference of sample holes. The Euclidean distances of original representations are utilized to correct the calculation error of geodesic distances in the manifold embedding step. In order to better reconstruct the intrinsic manifold by incomplete sampled data, we repeat the k-NN graph construction and manifold embedding steps many times. By adequate approximation and simplification, IME layer embeds the original representations into IME representations which preserve the intrinsic manifold of database images with ignorable additional time. IME layer is the integration version of IME, which reduces the estimation error of geodesic distances and the loss of discriminative information in dimension reduction by ridge regression.

In this section, we describe our novel IME layer for image retrieval in detail. Firstly, we show the formulation of proposed iterative embedding method which embeds the original representation according to the intrinsic manifold of images in Section III-A. Then in Section III-B the proposed embedding is equivalently implemented by the fully connected layer, which is called IME layer.

Iii-a Iterative manifold embedding

The iterative manifold embedding (IME) algorithm has two cyclic steps, which are detailed in Algorithm. 1. In the first step, we construct the k-NN graph and calculate the geodesic distances, considering the information of second-order proximity. Then, the original representations are embedded into the manifold-based representations which preserve the corrected geodesic distances in the second step.

The first step constructs the k-NN graph and calculates the geodesic distances of original representations (where is the n-dimensional original representation vector of image and is the database scale).

To construct the first-order k-NN graph , each point is only connected to its k nearest neighbours based on the Euclidean distances between pairs of images in the input space . The neighbourhood relations are coarsely represented as a weighted first-order graph over the data, with the edges of weight between neighbouring images. The weights of the edges in first-order graph are defined as the following matrix:

(1)

Where indicates that the edge between images and is cut in first-order graph . indicates that image is connected to by the edge with path distance .

To suppress the interference of the sample holes, we exploit the second-order proximity in the constructed graph . Like the idea in LINE [37], we assumes that the vertices sharing many connections to other vertices are similar to each other. Different with LINE [37], we use this natural intuition to construct the stable graph in a simple but effective way. The weights of the edges in graph are calculated by the weights in first-order graph :

(2)

Similar to , indicates that image is not connected to directly in second-order graph . The weight indicates the distance between directly connected images and in second-order graph, which is more robust for sparse samples. In the condition that the samples are sparse, the k nearest neighbours based on the Euclidean distances are farfetched and unstable sometimes. The second-order graph utilizes the natural intuition that the vertices sharing connections to other vertices are similar to each other to ensure the stability of connection.

We estimate the geodesic distances between all pairs of images on the manifold by computing their shortest path distances in the graph . The shortest path is simply computed by Floyd-Warshall algorithm [46].

Although the information of second-order proximity is utilized to suppress the interference of sample holes, the geodesic distances still have estimation error. To further reduce the error of reconstruction of image manifold, we correct the matrix of geodesic distances by Euclidean distances in the second step. We embed the representations into a low-dimensional space that best preserves the manifold’s estimated intrinsic geometry. The manifold-based representations (where is the manifold-based representation vector of image ) are computed by minimizing the cost function

(3)

where denotes the -norm of matrix and the parameter denotes the strength of the correction. is a conversion function that converts distances to similarity. Given distance matrix , many conversion functions can be used such as quadratic function and t-distribution in practice. The solution of Eq. 3 using quadratic function as conversion function is equivalent to the global optimal solution of the cost function in IsoMap [32]. We compare different conversion functions in Section IV. In order to retain more discriminatory information, we constrain the embedded representations to be orthonormal.

0:  Original representation , number of iterations .
0:  Iterative manifold embedding representation
1:  ;
2:  while  do
3:     1. Construct graph and calculate the geodesic distances :
4:       Compute Euclidean distances of updated representations ;
5:       Construct first-order neighbour graph based on by Eq. 1;
6:       Calculate the weights of the edges in graph : ;
7:       Compute geodesic distances by Floyd-Warshall algorithm [46];
8:     2. Map the representations into embedding space that best preserves the manifold s estimated intrinsic geometry:
9:       Correct the geodesic distances by Euclidean distances ;
10:       Compute d-dimensional representation by solving Eq. 3;
11:     ;
12:     ;
13:  end while
14:  ;
Algorithm 1 Iterative Manifold Embedding

The global minimum of Eq. 3 is achieved by setting the representations to the top eigenvectors of the similarity matrix

(4)

Let

be the p-th largest eigenvalue of the matrix

, and be the -th component of the -th eigenvector. Then, the -th component of the m-dimensional manifold-based representation equals to . The dimension impacts the discrimination of the representation . The selection of is a compromise between computational cost and accuracy.

The manifold-based representation exploits the neighborhood relationships to represent the feature of a image. We only use the small amount of neighbours with high credibility to construct k-NN graph in this paper. The constructed k-NN graph in first step is sparse and has few negative neighbour pairs that represent different objects. But some important connection paths are cut off. The sample holes do harm to the reliability of graph . The correction operations in Eq. 2 and Eq. 4 solve this problem by exploring the information of second-order proximity and Euclidean distances. The performance of the embedded representations is improved significantly in this way.

Although these strategies is straightforward, the remarkable performance is achieved with some iterations. We repeat above two steps to improve the stability of the graph . In each iteration, the Euclidean distances are updated based on manifold-based representations . In Algorithm. 1, the iteration process is stated in detail. After a few iterations, we get the final m-dimensional iterative manifold embedding representations (where is the IME representation vector of image ). We map the original representations into a low-dimensional space that preserves the geometry of the image manifold.

Iii-B Integration as IME layer

The iterative manifold embedding (IME) method proposed above embeds the original representations into manifold-based representations

. But if we directly apply IME for query image, the embedding leads to more feature extraction time of query images and more estimation error of geodesic distances. Similar to Isomap 

[32], the IME proposed in previous Section III-A needs to compute the shortest pathes from a query image to all database images and estimate the geodesic distances at query time. They are implemented in image retrieval by linking the query into the graph of geodesic distances of the training data. First the k nearest neighbors of query are found in the training data. Then, the shortest geodesic distances from query to each point in the training data are computed and transformed into similarity vector by conversion function. The similarity vector corrected by Euclidean distance is projected into the IME representation by the eigenvector matrix of training data finally. The additional computational cost is proportional to the database scale. In order to reduce the estimation error and computational cost, the IME method is equivalently implemented and simplified by the fully connected layer in this section, which is called IME layer in this paper.

The IME layer can be regarded as transform matrix which integrates the estimation of geodesic distances with dimension reduction. We learn the transform matrix according to the original representations and the IME representations of database images. We minimize the following objective function to calculate the weights of IME layer as the transform matrix .

(5)

Where and are original representations and IME representations of the d images on retrieval database respectively. and are the dimensions of original representation and IME representation respectively. denotes the -norm of matrix . The first term in the objective function is the transformation cost term for minimizing the difference between the representations computed by the IME algorithm and representations mapped by the IME layer. is a regularization term, and controls the weight of the regularization.

In this case, this is equivalent to a ridge regression problem and has a closed form solution. %ͨ  Ϊ ã ñϽ⣺ Let gradient =0 to minimize the objective function .

(6)

Then, this reduces to

(7)

Where

is the identity matrix. Since

is the dimensions of original representation, which is low, solving this problem (which needs to be solved only once, at learning time) is extremely fast.

IME layer is the integration and simplification version of IME. The computational complexity of IME layer is low at both learning and retrieval steps. IME layer simplifies the calculation processes and integrates the estimation of geodesic distances with dimension reduction by ridge regression. Through the integration of IME, we reduce the estimation error and the loss of discriminative information. In IME, the representation of query image is directly computed by the corrected geodesics distances projected by the eigenvector matrix of training data. Therefore the estimation error of geodesics distances affect the result significantly. The performance of IME also depend on the parameters of computing processes very much. The IME layer is the integration of IME, which diminishes the number of parameters and omits the computation of the shortest geodesic distances from query to each point in the training data. The cumulative computation error in intermediate step is avoided by integration. In practice, our integration strategy also can be applied to other manifold learning methods [32, 33, 34, 35, 36, 37, 38] for image retrieval task. For SIFT-based representations [9, 11, 12, 13, 14, 15], IME layer can work as the transform matrix to map the vector representations into embedding space. It also can be directly connected to CNNs [39, 26, 27, 28, 29] as an additional fully connected layer for CNN-based representations.

Iv Experiment

This section presents the experimental setup and investigates the accuracy of our approaches for image retrieval on five public datasets. To evaluate the efficiency and effectiveness of our IME layer, we compare the IME layer with the related manifold learning methods and the state-of-the-art image retrieval methods.

Iv-a Datasets

We evaluate the performance of our IME layer on five standard datasets for image retrieval. Mean average precision (mAP) is used as the performance measure on all datasets.

Two are well-known image retrieval benchmarks: Oxford5k [47] and Paris6k [48]. Oxford5k contains 5062 images collected from Flickr by searching for particular Oxford landmarks. Paris6k dataset contains 6412 photographs from Flickr associated with Paris landmarks. 55 queries corresponding to 11 buildings are manually annotated. The performance is measured using mean average precision (mAP) over the 55 queries.

For large scale image retrieval, we experiment at Oxford105k and Paris106k datasets which add 100k distractor images from Flickr [47].

The fifth dataset is the recently introduced instance search dataset called INSTRE [49]. It contains various everyday 3D or planar objects from buildings to logos with many variations such as different scales, rotations and occlusions. Some objects cover a small part of the image, making it a challenging dataset. It contains of 28543 images from 250 different object classes. In particular, 100 classes with images retrieved from on-line sources, 100 classes with images taken by the dataset creators, and 50 classes consisting of pairs from the second category. Different from the original protocol [49] that uses all databases images as queries, we evaluate the performance in the same way as the recent works [45]. The INSTRE dataset is randomly split into 1250 queries, 5 per class, and 27293 database images, while a bounding box defines the query region. The query and the database sets have no overlap.

Iv-B Implementation details

Our IME layer can combine with both SIFT-based representations and CNN-based representations. For CNN-based representations, we employ the fine-tuned network for image retrieval [29] to extract the representation vectors. This fine-tuned ResNet101 produces 2048 dimensional representations. We extract regions at 3 different scales as in R-MAC [23], and we additionally include the full image as a region. In this fashion, each image has 21 regions on average. The regional representations are aggregated and re-normalized to unit norm in order to construct the original representations, which is exactly as in R-MAC [23]. For SIFT-based representations, we employ the triangulation embedding [13] to aggregate the RootSIFT descriptors [50]. In practice, we employ the 8064-dimensional representations of which the vocabulary size is 64.

The weights of the correction term and regularization term are set as and respectively, throughout our experiments. Time measurements are reported with a 32-core Intel Xean 2.2GHz CPU.

Iv-C Impact of different components

In this section, we conduct a series of experiments on second-order proximity, similarity computation, parameters of IME , IME layer and various original representations.

Second-order proximity. To demonstrate the effectiveness of second-order proximity information, we compare the results of employing first-order graph and graph respectively to compute the geodesic distances in Table. I. The weights of the edges in graph G are calculated based on both first-order and second-order relationships of database images. Obviously, the performance of graph is consistently better on all datasets. The results show that graph is more reliable to calculate geodesic distance. The second-order neighbour relationship is effective to suppress the interference of manifold’s sample holes. As a result, the constructed k-NN graph is more stable and robust.

Graph INSTRE Oxford5k Oxford105k Paris6k Paris106k
80.7 89.5 85.9 92.5 85.6
82.4 92.0 87.2 96.6 93.3
TABLE I: Performance comparison between first-order graph and second-order graph

Fig. 3: Performance comparison of different methods to convert distances into similarity. T-distribution has better performance, because it reduces the calculation error of similarity between the points that are far apart and enhances the discrimination of moderate geodesic distances.

Similarity computation. We compare different conversion functions in Fig. 3, such as quadratic function and t-distribution . These conversion operators are used in IsoMap [32] and t-SNE [36] respectively.

Fig. 3

shows the performance versus different functions to convert distances into similarity. The t-distribution performs much better than quadratic function on all datasets. It enhances the discrimination of the moderate geodesic distances and suppresses the difference of large geodesic distances. Due to error accumulation, the calculation error of large geodesic distances is large. The t-distribution reduces the calculation error of similarity between the points that are far apart by suppressing the difference of large geodesic distances, while quadratic function increases the calculation error. Therefore, we employ a Student t-distribution with one degree of freedom (which is the same as Cauchy distribution) as the conversion function in our IME layer.

Parameters of IME. We evaluate the performance of different parameters of IME, such as the number of iterations , dimensions of IME representation , the number of neighbours and the strength of correction .

Fig. 4(a) shows the performance versus the number of iterations . Our IME layer performs well just with a small number of iterations. Too many iterations lead to the overfitting of embedding. We set =2 in the rest experiments in this paper due to the better performance and moderate training time.

The performance comparison of various dimensions of embedded representations is shown in Fig. 4(b). We achieve 92.0 and 96.6 on Oxford5k and Paris6k datasets respectively when we employ the 2048-dimensional IME representation. The high-dimensional representation preserves more discriminative information, so performance of it is better than low-dimensional representation. As shown in Fig. 4(b) by dotted lines, the mAP of original representation are 83.9 and 93.8 on Oxford5k and Paris6k datasets respectively. Our method achieves better performance than original representation even if the dimensions of final IME representation is one sixteenth of original representation, 128 dimension. The results show that our low-dimensional IME representation still preserves the intrinsic manifold. The reconstructed manifold is effective for image retrieval task to search the similar images.

Fig. 4(c,d) shows the performance versus the number of neighbours to construct the k-NN graph. and are the number of neighbours for first and second iterations respectively. The selection of parameter depends on the sparsity of datasets to some degree. The small is better suited for sparser dataset. With small , most of the neighbours are positive. With large , there are many negative neighbours in the graph especially for the sparse dataset. Therefore the constructed k-NN graph with too large is unstable for sparse dataset. The calculation error of geodesic distances computed according to the graph with too large is large. But the dense dataset is more adaptable. Even with large , the stability of neighbours in dense dataset are little influenced by noise due to more stable neighbours. There are more similar objects in Paris6k than Oxford6k. The images in Oxford5k are sparser. But the images in Paris6k are denser. Therefore Paris6k is almost unaffected by but Oxford5k degrades with .

We evaluate the effect of various correction weights , and then report the results in Fig. 4(e,f). and are the strength of correction for first and second iterations respectively. The results show that the performance of our method does not heavily rely on the correction weights . We set the correction weights in all the other experiments. The estimation error of geodesic distance is larger in the first iteration. has a more significant impact than . If we do not correct the geodesic distances by Euclidean distances while iterative embedding (that is, ), the results are 89.9 and 95.7 on Oxford5k and Paris6k datasets respectively as shown in Fig. 4(e,f) by dotted lines. datasets respectively. The mAPs of uncorrected geodesic distances () are lower than mAPs of corrected geodesic distances (), 92.0 and 96.6 on Oxford5k and Paris6k datasets respectively. The results demonstrate that the correction is important for the geodesic distances computed by the incomplete data.


Fig. 4: Performance comparison of different parameters on three datasets with varying the number of iterations, dimensions of IME representation, number of neighbours and correction weights. The number of iterations are set as =2 due to its better performance. Even if the dimension is reduced to 128, the performance of IME representation is better than original representation. The mAP of original representation is shown in (b) by dotted lines. Our method is not heavily relied on the correction weights , and we set =2 in other experiments.
mAP Average additional query time (second)
Method Oxford5k Paris6k INSTRE Oxford5k Paris6k INSTRE
PCA [51] 82.6 91.5 62.2 0.002 0.002 0.002
IsoMap [32] 77.9 91.8 68.6 0.378 0.403 3.483
LLE [33] 51.7 40.5 42.7 0.054 0.066 0.249
IME 83.5 93.4 75.9 0.907 0.937 7.659
IME layer 92.0 96.6 82.4 0.002 0.002 0.002
TABLE II: Performance comparison with other manifold learning methods for image retrieval

IME layer. The proposed IME layer is the integration and simplification version of IME. The accuracy and average additional query time of PCA, IME, IME layer and other manifold learning methods are shown in Table II. IME layer achieves better performance on both mAP and time cost. The additional computation cost of IME layer is roughly identical to PCA. On Oxford5k and Paris6k datasets, our IME layer is more than twenty-seven times faster than IsoMap [32] and LLE [33], the manifold learning methods that can be applied to image retrieval, and significantly outperforms them on mAP. The computational cost of IME layer is unrelated to the scale of database, while the cost of IsoMap [32] and LLE [33] is proportional to the number of database images. On a large scale dataset INATRE, our method is more than 120 times faster than them. The performance of IME layer is better than IME. Because the integration reduces the calculation error of the geodesics distances between query image and database images and the loss of discriminative information in dimension reduction. The results demonstrate that our IME layer is effective and efficient for image retrieval.

Various original representations. We do experiments on both SIFT descriptors and CNN features. Table. III presents the results of the accuracy of SIFT-based and CNN-based representations with/without IME layer. The dimensions of final IME representation is reduced to 2048 in this experiment. The results demonstrate that our IME layer is effective for various features. The IME layer can be directly connected with a CNN as the trained fully connected layer. For other features, our IME layer can work as the transform matrix to map the aggregated representations into embedding space.

Feature Oxford5k Oxford105k
SIFT [13] 52.7 27.6
SIFT+IME layer 62.2 31.3
CNN [29] 83.9 80.8
CNN+IME layer 92.0 87.2
TABLE III: Performance of IME layer for various features
Datasets
Method Dimensions INSTRE Oxford5k Oxford105k Paris6k Paris106k
Original representations
CroW [24] 512 68.2 63.2 79.8 71.0
R-MAC [27] 512 47.7 77.7 70.1 84.1 76.8
R-MAC [29] 2048 62.6 83.9 80.8 93.8 89.9
Dimension reduction and manifold learning
PCA [51] 512 50.0 78.2 74.7 91.0 85.4
ICA [52] 512 50.3 77.5 73.7 90.8 85.2
IsoMap [32] 512 69.7 77.8 64.7 91.8 69.6
LLE [33] 512 60.2 64.0 47.6 50.7 21.7
IME layer 512 83.1 91.2 85.1 96.3 92.5
PCA [51] 1024 58.7 80.8 78.0 91.7 86.8
ICA [52] 1024 58.7 81.6 78.0 92.1 87.3
IsoMap [32] 1024 69.1 78.1 65.4 92.1 72.1
LLE [33] 1024 50.4 58.8 42.3 45.0 16.2
IME layer 1024 82.8 91.8 86.2 96.5 92.9
PCA [51] 2048 62.2 82.6 79.1 91.5 86.5
ICA [52] 2048 62.2 82.7 79.1 91.5 86.5
IsoMap [32] 2048 68.6 77.9 68.3 91.8 76.4
LLE [33] 2048 42.7 51.7 34.9 40.5 14.7
IME layer 2048 82.4 92.0 87.2 96.6 93.3
Search reranking
QE [41] 2048 70.5 89.6 88.3 95.3 92.7
SCSM [53] 2048 71.4 89.1 87.3 95.4 92.5
Diffusion [45] 2048 80.5 87.1 86.8 96.5 95.4
IME layer 2048 82.4 92.0 87.2 96.6 93.3
TABLE IV: Performance comparison with the state-of-the-art methods.

Fig. 5: Sample retrieval results of original representation and IME representation. The images with gray border are query images. The images with green border are positive retrieval results (ground-truth) and the images with red border are negative results. Because the reconstructed manifold of images is continuous and smooth, the images that contain the same objects in different viewing angles and various illumination are closer in the embedding space.

Iv-D Comparison with the state-of-the-art

We compare with the state-of-the-art approaches with global representation. Table IV summarizes the results. Our IME layer significantly outperforms all the existing dimension reduction and manifold learning methods on all datasets. Without post-processing, our IME layer still outperforms the state-of-the-art image retrieval methods with reranking on most datasets.

In the first part of the table, we show results of the methods that employ global representations of images and do not perform any form of spatial verification or query expansion at query time. The 2048-dimensional R-MAC vectors [29] in the first part are employed as the original CNN-based representations in this paper.

We compare our IME layer with related dimension reduction and manifold learning methods in the second part of the table. Except for IsoMap [32] and LLE [33], other nonlinear manifold learning methods can not be directly applied to image retrieval. We consistently outperform them for various dimensions on all datasets. In one case (namely, on INSTRE), our method is more than 13 mAP points ahead of the best competitor [32]. Our IME layer requires less computational cost compared with IsoMap [32] and LLE [33] in the on-line image retrieval stage, and same computational cost as PCA [51] and ICA [52].

As shown in the third part of the table, we show the results of state-of-the-art methods that employ global representations and perform search reranking (e.g. , spatial verification [53], query expansion (QE) [41] or diffusion [45]) at query time. Without reranking, our method still outperforms the state-of-the-art methods with post-processing on most datasets. These methods with post-processing [53, 41, 45] contain image scoring and ranking steps more than once. It is worth noting that our IME layer method is much faster than these methods and has comparable performance.

For Oxford105k and Paris106k datasets, we use the incomplete data (the images in Oxford5k and Paris6k respectively) and 5000 noisy data from 100k distractor images on Flickr [47] to learn the weights of IME layer in the off-line stage taking a few minutes. In comparison, diffusion [45] employs all 100 thousand images to construct the manifold and takes many hours. Our IME layer has less than 2 milliseconds additional time per query in the on-line retrieval stage, while diffusion [45] requires about 14 seconds. Employing sparse samples to learn the weights of IME layer, our IME layer still achieves good performance on large scale image retrieval. The results demonstrate that our IME layer is effective to reconstruct image manifold by incomplete data.

In Fig. 5 we present some query examples using original representation and IME representation respectively. The images with gray border are cropped query images. The images with green border are positive retrieval results (ground-truth) and the images with red border are negative results. IME layer significantly improves the retrieval results by mapping the original representations into embedding space. Due to the reconstructed continuous manifold of images, the images that contain the same objects in different viewing angles and various illumination are closer in the embedding space.

V Conclusion

In this paper we propose a manifold learning method called iterative manifold embedding (IME) layer and demonstrate its efficiency and effectiveness for image retrieval. Through the unsupervised strategy, the weights of IME layer are learned by incomplete data.

Our IME layer introduces the manifold learning into image retrieval. We solve the sample holes problem on manifold learning, using the information of second-order proximity and the correction of geodesic distances by Euclidean distances. In order to reduce the additional computational cost and estimation error of geodesic distances at query time, we integrate the manifold-based embedding by the approximate linear mapping.

Experiments on five standard retrieval datasets demonstrate that our IME layer significantly outperforms related dimension reduction and manifold learning methods with equivalent or lower computational complexity. Without search reranking, our method still outperforms the state-of-the-art methods with search reranking on most datasets.

Vi Future Work

The main limitation of our IME layer is that the off-line learning is time-consuming. Especially for large datasets, the cost of construction of k-NN graph and calculation of geodesic distances is large. We will try to speed up the off-line learning stage in the future work by parallel computing and other strategies. The IME layer is learned via a linear transform in this work. We will employ nonlinear transform and try to learn more layers end-to-end in the future work.

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant 61531019, Grant 61601462, and Grant 71621002. The authors would like to thank the Associate Editor and the anonymous reviewers for their contributions to improve the quality of this paper.

References

  • [1] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in

    IEEE International Conference on Computer Vision

    , 2003, p. 1470.
  • [2] H. Jégou, M. Douze, and C. Schmid, “Improving bag-of-features for large scale image search,” International journal of computer vision, vol. 87, no. 3, pp. 316–336, 2010.
  • [3] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek, “Accurate image search using the contextual dissimilarity measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 2–11, 2010.
  • [4] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in

    Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on

    , 2008, pp. 1–8.
  • [5] J. C. van Gemert, C. J. Veenman, A. W. Smeulders, and J. M. Geusebroek, “Visual word ambiguity.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1271–83, 2010.
  • [6] Y.-H. Kuo, W.-H. Cheng, H.-T. Lin, and W. H. Hsu, “Unsupervised semantic feature discovery for image object retrieval and tag refinement,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 1079–1090, 2012.
  • [7] Y. Gao, M. Shi, D. Tao, and C. Xu, “Database saliency for fast image retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 359–369, 2015.
  • [8] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.   IEEE, 2010, pp. 3360–3367.
  • [9] H. J gou, F. Perronnin, M. Douze, J. S nchez, P. P rez, and C. Schmid, “Aggregating local image descriptors into compact codes.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1704–16, 2012.
  • [10] E. Spyromitros-Xioufis, S. Papadopoulos, I. Y. Kompatsiaris, G. Tsoumakas, and I. Vlahavas, “A comprehensive study over vlad and product quantization in large-scale image retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1713–1728, 2014.
  • [11] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies for image categorization,” in IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
  • [12] F. Perronnin, J. Nchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Computer Vision - ECCV 2010, European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, 2010, pp. 143–156.
  • [13] H. Gou and A. Zisserman, “Triangulation embedding and democratic aggregation for image search,” in Computer Vision and Pattern Recognition, 2014, pp. 3310–3317.
  • [14] T.-T. Do, Q. D. Tran, and N.-M. Cheung, “Faemb: a function approximation-based embedding method for image retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3556–3564.
  • [15] S. S. Husain and M. Bober, “Improving large-scale image retrieval through robust aggregation of local descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [16] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 60, pp. 91—110, 2004.
  • [17]

    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”

    Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
  • [18] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813.
  • [19] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” in European conference on computer vision.   Springer, 2014, pp. 392–407.
  • [20] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes for image retrieval,” in European conference on computer vision.   Springer, 2014, pp. 584–599.
  • [21] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Visual instance retrieval with deep convolutional networks,” ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.
  • [22]

    A. Babenko and V. Lempitsky, “Aggregating local deep features for image retrieval,” in

    Proceedings of the IEEE international conference on computer vision, 2015, pp. 1269–1277.
  • [23]

    G. Tolias, R. Sicre, and H. J gou, “Particular object retrieval with integral max-pooling of cnn activations,”

    ICLR, 2016.
  • [24] Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in European Conference on Computer Vision.   Springer, 2016, pp. 685–701.
  • [25]

    A. Chadha and Y. Andreopoulos, “Voronoi-based compact image descriptors: Efficient region-of-interest retrieval with vlad and deep-learning based descriptors,”

    IEEE Transactions on Multimedia, 2017.
  • [26] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
  • [27] F. Radenovic, G. Tolias, and O. Chum, “Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples,” in European Conference on Computer Vision.   Springer, 2016, pp. 3–20.
  • [28] A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in European Conference on Computer Vision.   Springer, 2016, pp. 241–257.
  • [29] A. Gordo, J. Almazan, and J. Revaud, “End-to-end learning of deep visual representations for image retrieval,” 2016.
  • [30] X. Jian, S. Cunzhao, Q. Chengzuo, W. Chunheng, and X. Baihua, “Part-based weighting aggregation of deep convolutional features for image,” arXiv preprint arXiv:1705.01247, 2017.
  • [31] H. S. Seung and D. D. Lee, “The manifold ways of perception,” Science, vol. 290, no. 5500, pp. 2268–2269, 2000.
  • [32] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
  • [33] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
  • [34] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in NIPS, vol. 14, no. 14, 2001, pp. 585–591.
  • [35] G. Hinton and S. Roweis, “Stochastic neighbor embedding,” in NIPS, vol. 15, 2002, pp. 833–840.
  • [36] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

    Journal of Machine Learning Research

    , vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [37] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Large-scale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web.   ACM, 2015, pp. 1067–1077.
  • [38] J. Tang, J. Liu, M. Zhang, and Q. Mei, “Visualization large-scale and high-dimensional data,” arXiv preprint arXiv:1602.00370, 2016.
  • [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR 2015, 2015.
  • [40] J. B. Kruskal and M. Wish, Multidimensional Scaling.   BOOK ON DEMAND POD, 1978.
  • [41] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on.   IEEE, 2007, pp. 1–8.
  • [42] H. Xie, Y. Zhang, J. Tan, L. Guo, and J. Li, “Contextual query expansion for image retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 4, pp. 1104–1114, 2014.
  • [43] M. Donoser and H. Bischof, “Diffusion processes for retrieval revisited,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1320–1327.
  • [44] Z. Gao, J. Xue, W. Zhou, S. Pang, and Q. Tian, “Democratic diffusion aggregation for image retrieval,” IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1661–1674, 2016.
  • [45] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 2017.
  • [46] R. W. Floyd, “Algorithm 97: shortest path,” Communications of the ACM, vol. 5, no. 6, p. 345, 1962.
  • [47] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on.   IEEE, 2007, pp. 1–8.
  • [48] J. Philbin, O. Chum, M. Isard, and J. Sivic, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.   IEEE, 2008, pp. 1–8.
  • [49] S. Wang and S. Jiang, “Instre: a new benchmark for instance-level object retrieval and recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 11, no. 3, p. 37, 2015.
  • [50] R. Arandjelovic and A. Zisserman, “Three things everyone should know to improve object retrieval,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2911–2918.
  • [51]

    H. Abdi and L. J. Williams, “Principal component analysis,”

    Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–459, 2010.
  • [52] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis.   John Wiley & Sons, 2004, vol. 46.
  • [53] X. Shen, Z. Lin, J. Brandt, and Y. Wu, “Spatially-constrained similarity measurefor large-scale object retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 6, pp. 1229–1241, 2014.