Towards Visual Feature Translation

Most existing visual search systems are deployed based upon fixed kinds of visual features, which prohibits the feature reusing across different systems or when upgrading systems with a new type of feature. Such a setting is obviously inflexible and time/memory consuming, which is indeed mendable if visual features can be "translated" across systems. In this paper, we make the first attempt towards visual feature translation to break through the barrier of using features across different visual search systems. To this end, we propose a Hybrid Auto-Encoder (HAE) to translate visual features, which learns a mapping by minimizing the translation and reconstruction errors. Based upon HAE, an Undirected Affinity Measurement (UAM) is further designed to quantify the affinity among different types of visual features. Extensive experiments have been conducted on several public datasets with 16 different types of widely-used features in visual search systems. Quantitative results show the encouraging possibility of feature translation. And for the first time, the affinity among widely-used features like SIFT and DELF is reported.


page 2

page 5

page 7

page 8


Bi-Adversarial Auto-Encoder for Zero-Shot Learning

Existing generative Zero-Shot Learning (ZSL) methods only consider the u...

CNN-based search model underestimates attention guidance by simple visual features

Recently, Zhang et al. (2018) proposed an interesting model of attention...

ViTOR: Learning to Rank Webpages Based on Visual Features

The visual appearance of a webpage carries valuable information about it...

Visual Encoding and Debiasing for CTR Prediction

Extracting expressive visual features is crucial for accurate Click-Thro...

Predicting Actions to Help Predict Translations

We address the task of text translation on the How2 dataset using a stat...

Feature Map Filtering: Improving Visual Place Recognition with Convolutional Calibration

Convolutional Neural Networks (CNNs) have recently been shown to excel a...

Deep Class-Specific Affinity-Guided Convolutional Network for Multimodal Unpaired Image Segmentation

Multi-modal medical image segmentation plays an essential role in clinic...

1 Introduction

Visual features serve as the basis for most existing visual search systems. In a typical setting, a visual search system can only handle predefined features extracted from the image set offline. Such a setting prohibits the reusing of a certain kind of visual feature across different systems. Moreover, when upgrading a visual search system, a time-consuming step is needed to extract new features and to build the corresponding indexing, while the previous features and indexing are simply discarded. Breaking through such a setting, if possible, is by any means very beneficial. For instance, the existing features and indexing can be efficiently reused when updating old features with new ones, which can significantly save the time and memory cost. For another instance, images can be efficiently archived with only respective features for cross-system retrieval.

Figure 1: Two potential applications of visual feature translation. Top: In cross-feature retrieval, Feature A is translated to Feature AB, which can be used to search images that are represented and indexed by Feature B. Bottom: In the merger of retrieval systems, Feature A used in System A is efficiently translated to Feature AB, instead of the expensive process of re-extracting entire dataset in System A with Feature B.

However, feature reusing is not an easy task at all. Various dimensions and diverse distributions of different types of features prohibit reusing features directly. Therefore, a feature “translator” is needed to transform across different types of features, which, to our best knowledge, remains untouched in the literature. Intuitively, given a set of images extracted with different types of features, one can leverage the feature pairs to learn the corresponding feature translator.

Figure 2: The overall flowchart of the proposed visual feature translation. In Stage I, different handcrafted or learning-based features are extracted from image set for training. In Stage II, the mappings from source features to target features are learned by our HAE with the encoders and the decoder . Then the encoder and the decoder are used in inference. In Stage III, the UAM is calculated to quantify the affinity among different types of visual features, which is further visualized by employing the Minimum Spanning Tree.

In this paper, we make the first attempt to investigate visual feature translation. Concretely, we propose a Hybrid Auto-Encoder (HAE) that learns a mapping from source features to target features by minimizing the translation and reconstruction errors. HAE consists of two encoders and one decoder. In training, the source and target features are encoded into a latent space by corresponding encoders. Features in this latent space are sent to a shared decoder to produce the translated features and reconstructed features. Then the reconstruction and translation errors are minimized by optimizing the objective function. In inference, the encoder of source features and the shared decoder are used for translation. The proposed HAE further provides a way to characterize the affinity among different types of visual features. Based upon HAE, an Undirected Affinity Measurement (UAM) is further proposed, which provides, also for the first time, a quantification of the affinity among different types of visual features. We also discover that UAM can predict the translation quality before the actual translation happens.

We train HAE on the Google-Landmarks dataset [16] and evaluate in total different types of widely-used features in visual search community [2, 4, 19, 21, 29, 36, 41, 44, 52]. The test is conducted on three benchmark datasets, , Oxford5k [40], Paris6k [37], and Holidays [18]. Quantitative results show the encouraging possibility for feature translation. In particular, HAE works relatively well for feature pairs such as V-CroW to V-SPoC (, mAP decrease on Oxford5k benchmark) and R-rMAC to R-CroW (, mAP decrease on Holidays benchmark). Interestingly, visual feature translation provides some intriguing results (see Fig. 4 later in our experiments). For example, when translating from SIFT to DELF, characteristics like rotation or viewpoint invariance can be highlighted, which provides a new way to absorb merits of handcrafted features to learning-based ones.

In short, our contributions can be summarized as below:

  • We are the first to address the problem of visual feature translation, which fills in the gaps between different features.

  • We are the first to quantify the affinity among different types of visual features in retrieval, which can be used to predict the quality of feature translation.

  • The proposed scheme innovates in several detailed designs, such as the HAE for training the translator and the UAM for quantifying the affinity.

The rest of this paper is organized as follows. Section 2 reviews the related work. The proposed feature translation and feature relation mining algorithms are introduced in Section 3. Quantitative experiments are given in Section 4. Finally, we conclude this work in Section 5.

2 Related Work

Visual Feature. Early endeavors mainly include holistic features (, color histogram [15] and shape [7]) and handcrafted local descriptors [6, 20, 30, 31, 33, 39, 47, 49], such as SIFT [29] and ORB [45]

. Then, different aggregation schemes (, Fisher Vector

[36] and VLAD [19]

) are proposed to encode local descriptors. Along with the proliferation of neural networks, deep visual features have dominated visual search

[1, 4, 5, 12, 16, 21, 32, 41, 43, 52], for instance, the local feature DELF [16] and the global feature produced by GeM [41] pooling are both prominent for representing images. Detailed surveys of visual features can be found in [50, 56].

Transfer Learning.Transfer learning [35, 51] aims to improve the learning of the target task using the knowledge in source domain. It can be subdivided into: instance transfer, feature transfer, parameter transfer, and relation transfer. Our work relates to, but is not identical with, the feature transfer. Feature transfer [3, 9, 11, 13, 24, 27, 34, 42, 53] is usually based on the hypothesis that the source domain and target domain have some shared characteristics. It aims to find a common feature space for both source and target domains, which serves as a new representation to improve the learning of the target task. For instance, the Structural Corresponding Learning (SCL) [8] uses pivot features to learn a mapping from features of both domains to a shared feature space. For another instance, Joint Geometrical and Statistical Alignment (JGSA) [54]

learns two coupled projections that project features of both domains into subspaces where the geometrical and distribution shifts are reduced. More recently, deep learning has been introduced into feature transfer

[25, 26, 28, 46], in which neural networks are used to find the common feature spaces. In contrast, the visual feature translation aims to learn a mapping to translate features from the source space to the target space, and the translated features are used directly in the target space.

3 Visual Feature Translation

Fig. 2 shows the overall flowchart of the proposed visual feature translation. Firstly, source and target feature pairs are extracted from image set for training in Stage I. Then, feature translation based on HAE is learned in Stage II. After translation, the affinity among different types of features is quantified and visualized in Stage III.

3.1 Preprocessing

As shown in Stage I of Fig. 2, we prepare the source and target features for training the subsequent translator. For the handcrafted features such as SIFT [29], the local descriptors are extracted by the designed procedures firstly. These local descriptors are then aggregated by encoding schemes to produce the global features. For the learning-based features such as V-MAC [44, 52], the feature maps are extracted by neural networks firstly, followed by a pooling layer or encoding schemes to produce the feature vectors. In our settings, we investigate in total different types of features, a detailed table of which can be found in Table 1. The feature sets are arranged to form feature set pairs , where denotes the set of source features and denotes the set of target features. The implementation is detailed in Section 4.1.

3.2 Learning to Translate

Inspired by [55], we adopt the Auto-Encoder to learn an energy model for translation, in which low energy is attributed to the data manifold that reflects the data distribution. To fit the task of translating different types of features, a Hybrid Auto-Encoder (HAE) is further proposed, which is shown in Stage II of Fig. 2. For training HAE, the source features and the target features are input to the model which outputs the translated features and the reconstructed features . The energy scalar is calculated by the translation and reconstruction errors. A low energy scalar indicates the input pair has similar data distribution.

Input: Feature sets and , decoders and encoder parameterized by and .
Output: The learned translator and .

1:while not convergence do
2:     Get by .
3:     Get by .
4:     Get by translation: .
5:     Get by reconstruction: .
6:     Optimize the Eq. 1.
7:end while
8:return and .
Algorithm 1 The Training of HAE

Formally speaking, HAE consists of two encoders and one decoder . In training, is encoded into the latent feature by the encoder , and the same for into by . The latent features and are then decoded to obtain the translated feature and the reconstructed feature by the shared decoder . We define the energy function as , which is typically a distance metric in the Euclidean space. The and are parameterized by and , which can be learned by minimizing the following objective function:


where we define the first item as the translation error and the second item as the reconstruction error.

In inference, only and are used to translate features from to . The algorithm for training the HAE is summarized as Alg. 1.

In the following, we introduce a key theorem to guarantee the manifolds of translated and target features trend to be consistent after optimizing the objective function .

Theorem 1: The upper and lower bounds of the difference between the translation error and the reconstruction error are and zero, under the assumption that the reconstruction error is smaller than the translation error.

Proof: The objective function for a data point can be written as:


According to the Cauchy-Schwarz inequality, we have:


According to the Euclidean triangle inequality, we obtain:


Considering the assumption, we get:


Combining Eqs. 3, 4 and 5, we obtain:


Averaging all the data points, we get the final inequality:


Thus, the upper and lower bounds of the difference between the translation and reconstruction errors are derived under the assumption.

We then get the following characteristics for our visual feature translation:

Characteristic I: Saturation. The performance of translated features is difficult to exceed that of the target features. This phenomenon is inherent in the feature translation process. According to Eq. 1, the translation and reconstruction errors are minimized after optimizing. However, they are difficult to approach zero due to the information loss brought by the architecture of Auto-Encoder.

Characteristic II: Asymmetry. The convertibility of translation is discrepancy between A2B and B2A (We abbreviate A2B for the translation from features A to features B, ). The networks for translating different types of features are by nature asymmetry. HAE approximates the manifold between features, but this approximation relies on the translation and reconstruction errors, which is not the same between A2B and B2A.

Characteristic III: Homology. In general, homologous features tend to have high convertibility. In contrast, the convertibility is not guaranteed for heterogenous features. Homologous features refer to the features extracted by the same extractor but encoded or pooled by different methods (, DELF-FV [16, 36] and DELF-VLAD [16, 19], or V-CroW [21] and V-SPoC [4]), and the heterogenous features refer to the features extracted by different extractor. This characteristic is analyzed in details in Section 4.2.

Input: The number of different types of features , the feature pairs and the translator .

The directed affinity matrix

and the undirected affinity matrix .

1:for i = 1 : n, j = 1 : n do   
2:     Calculate by Eq. 8.
3:end for
4:for i = 1 : n, j = 1 : n do   
5:     Calculate and by Eq. 9 and Eq. 10.
6:end for
7:Calculate by Eq. 11.
8:Generate the MST based on by Kruskal”s algorithm.
9:Visualize the MST.
10:return .
Algorithm 2 Affinity Calculation and Visualization

3.3 Feature Relation Mining

HAE provides a way to characterize the affinity between feature pairs according to Theorem 1. Therefore, the total affinity among different types of features can be quantified as the Stage III shown in Fig. 2. Firstly, we use the difference between translation and reconstruction errors as a Directed Affinity Measurement (DAM) and calculate the directed affinity matrix , which forms a directed graph for all the feature pairs. Secondly, in order to quantify the total affinity among different types of features, we design an Undirected Affinity Measurement (UAM) by employing . The calculated undirected affinity matrix is symmetry, which forms a complete graph. Thirdly, we visualize the local similarity between features by using the Minimum Spanning Tree (MST) of the complete graph.

Figure 3: The visualization of the MST based on with popular visual search features. The length of edges is the average value of the results on Holidays, Oxford5k and Paris6k datasets. The images are the retrieval results for a query image of the Pantheon with corresponding features in the main trunk of the MST. The close feature pairs such as R-SPoC and R-CroW have similar ranking lists.

Directed Affinity Measurement. We assume that after optimizing, the reconstruction error is smaller than the translation error. This intuitive assumption is verified later in our experiments in Section 4.3. According to Theorem 1, when minimizing , the translation error is forced to approximate the reconstruction error. If the translation error is close to the reconstruction error, we think the translation between source and target features is similar to the reconstruction of target features, which indicates that the source features and target features may have high affinity. Therefore, we regard the difference between the translation and the reconstruction errors as the affinity measurement between two features. We use to represent the Directed Affinity Measurement between and . The calculation of the element at row and column of the matrix is defined as follows:


where and are parameterized by and .

Undirected Affinity Measurement. Due to the asymmetry characteristic, is asymmetric, which is unsuitable to be the total affinity measurement of feature pairs. We then resort to designing an Undirected Affinity Measurement (UAM) to quantify the overall affinity among different types of features. Specifically, we treat A2B and B2A as a unified whole, therefore the rows and columns of are considered consistently. For the rows of , the element at row and column of the matrix with normalized rows is defined as:


where and are the minimum and maximum of the row . And is normalized to .

In a similar way, for the columns of , the element at row and column of the matrix with normalized columns is defined as:


where and are the minimum and maximum of the column . And is normalized to .

Then, the undirected affinity matrix is defined as:


If has a small value, feature and feature are similar, and vice versa.

The Visualization. We use the Minimum Spanning Tree (MST) to visualize the relationship of features based on . The Kruskal’s algorithm [23] is used to find MST. This algorithm firstly creates a forest , where each vertex is a separate tree. Then the edge with minimum weight that connects two different trees is recurrently added to the forest , which combines two trees into a single tree. The final output forms an MST for the complete graph. The MST helps us to understand the most related feature pairs (connected by an edge), as well as their affinity score (the length of the edge). The overall procedure is summarized as Alg. 2. The visualization result of the affinity among popular visual features with a query example can be found in Fig. 3.

4 Experiments

We show the experiments in this section. We first introduce the experimental settings. Then the translation performance of our HAE is reported. Finally, we visualize and analyze the results of relation mining.

4.1 Experimental Settings

Training Dataset. The Google-Landmarks dataset [16] contains more than 1M images captured at various landmarks all over the world. We randomly pick 40,000 images from this dataset to train the HAE, and pick 4,000 other images to train PCA whitening [4, 17] and creating the codebooks for local descriptors.

Test Dataset. We use the Holidays, Oxford5k and Paris6k datasets for testing. The Holidays dataset [18] has 1,491 images with various scene types and 500 query images. The Oxford5k dataset [37] consists of 5,062 images which have been manually annotated to generate a comprehensive ground truth for 55 query images. Similarly, the Paris6k dataset [38] consists of 6,412 images with 55 query images. Since the scalability of retrieval algorithms is not our main concern, we do not use the disturbance dataset Flickr100k [38]. Recently, the work in [40] revisited the labels and queries on both Oxford5k and Paris6k. Because the images remained the same, which does not affect the characteristics of features, we do not use the revisited datasets as our test datasets. The mean average precision (mAP) is used to evaluate the retrieval performance. We translate the source features of reference images to the target space, and the target features of query images are used for testing.

Features. L1 normalization and square root [2] are applied to SIFT [29]. The original extraction approach (at most 1,000 local representations per image) is applied to DELF [16]. The codebooks of FV [36] and VLAD [19] are created for SIFT and DELF. We use

components of Gaussian Mixture Model (GMM) to form the codebooks of FV and the dimension of this feature is reduced to

by PCA whitening. The aggregated features are termed as SIFT-FV and DELF-FV. We use central points to form the codebooks of VLAD and the dimension of this feature is also reduced to

by PCA whitening. The aggregated features are termed as SIFT-VLAD and DELF-VLAD. For off-the-shelf deep features, we use ImageNet

[10] pre-trained VGG-16 (abbreviated as V) [48] and ResNet101 (abbreviated as R) [14]

to produce the feature maps. The max-pooling (MAC)

[44, 52], average-pooling (SPoC) [4], weighted sum-pooling (CroW) [21], and regional max-pooling (rMAC) [52] are then used to pool the feature maps. The extracted features are termed as V-MAC, V-SPoC, V-CroW, V-rMAC, R-MAC, R-SPoC, R-CroW and R-rMAC, respectively. For fine-tuned deep features, we consider the generalized mean-pooling (GeM) and regional generalized mean-pooling (rGeM) [41]. The extracted features are termed as V-GeM, V-rGeM, R-GeM and R-rGeM, respectively.

Holidays Oxford5k Paris6k
DELF-FV [16, 36] 83.42 73.38 83.06
DELF-VLAD [16, 19] 84.61 75.31 82.54
R-CroW [21] 86.38 61.73 75.46
R-GeM [41] 89.08 84.47 91.87
R-MAC [44, 52] 88.53 60.82 77.74
R-rGeM [41] 89.32 84.60 91.90
R-rMAC [52] 89.08 68.46 83.00
R-SPoC [4] 86.57 62.36 76.75
V-CroW [21] 83.17 68.38 79.79
V-GeM [41] 84.57 82.71 86.85
V-MAC [44, 52] 74.18 60.97 72.65
V-rGeM [41] 85.06 82.30 87.33
V-rMAC [52] 83.50 70.84 83.54
V-SPoC [4] 83.38 66.43 78.47
SIFT-FV [2, 29, 36] 61.77 36.25 36.91
SIFT-VLAD [2, 29, 19] 63.92 40.49 41.49
Table 1: The mAP (%) of target features.

















DELF-FV 8.9 12.3 38.8 46.2 48.3 42.7 26.9 38.5 26.7 30.9 42.6 29.2 21.9 26.3 48.4 52.2
DELF-VLAD 10.3 9.2 40.9 44.4 52.6 44.1 26.6 38.5 24.9 28.4 41.2 28.6 20.1 25.9 49.7 52.5
R-CroW 25.0 30.5 7.8 26.2 17.1 23.8 8.5 7.3 18.5 28.8 37.7 24.8 15.4 15.2 43.5 46.3
R-GeM 24.1 30.7 22.6 12.8 26.4 11.5 12.8 25.1 20.0 16.8 34.3 17.5 15.3 21.2 43.7 47.1
R-MAC 29.0 36.0 14.3 26.1 20.5 31.2 11.4 15.6 19.2 30.7 32.6 29.1 13.5 19.0 45.8 50.3
R-rGeM 23.8 30.0 19.4 12.4 25.7 11.3 10.5 21.1 20.8 18.6 37.1 14.7 13.8 21.6 44.4 47.5
R-rMAC 27.7 31.4 11.0 25.0 18.3 23.1 9.3 11.4 20.4 27.9 34.5 21.6 12.3 19.4 45.3 49.3
R-SPoC 25.0 30.0 7.4 25.5 15.7 24.2 9.1 7.1 18.1 26.9 37.0 23.2 13.5 16.1 44.3 45.2
V-CroW 27.4 31.5 28.9 34.1 30.9 30.6 18.2 28.1 8.5 19.5 15.9 17.3 9.5 9.6 39.9 41.5
V-GeM 26.0 31.2 35.0 32.7 37.6 31.8 20.7 33.5 15.2 11.4 21.8 8.0 11.3 18.3 41.3 41.2
V-MAC 40.5 43.7 46.9 51.3 49.2 48.5 29.7 45.4 24.1 31.9 24.8 35.2 17.7 28.8 48.8 52.8
V-rGeM 27.2 31.4 31.7 30.6 40.6 29.4 19.0 32.1 17.6 11.1 24.3 8.8 9.2 18.2 40.0 42.2
V-rMAC 30.0 34.6 32.3 43.5 39.5 36.9 19.9 33.9 17.8 24.4 20.0 21.6 10.6 18.9 45.4 48.4
V-SPoC 24.2 29.0 25.5 30.8 28.9 30.7 15.8 25.8 8.5 18.4 19.4 17.0 9.8 8.4 37.8 38.4
SIFT-FV 63.1 69.0 66.7 74.9 77.5 74.3 62.6 66.0 68.0 70.4 67.9 70.1 66.4 64.0 16.0 13.3
SIFT-VLAD 65.1 69.3 68.1 77.2 78.7 76.7 63.5 67.8 67.3 70.0 68.9 69.2 66.5 64.8 9.7 20.2

11.0 17.6 32.3 55.1 38.5 52.0 31.1 35.1 26.3 39.8 34.7 35.9 31.5 24.3 31.8 34.5
DELF-VLAD 13.9 8.6 27.5 50.3 34.4 45.4 26.3 31.8 19.9 29.5 37.9 30.5 25.3 20.0 32.4 34.6
R-CroW 37.9 39.9 9.7 43.3 16.6 39.0 19.4 10.8 22.8 40.3 37.9 33.8 31.0 22.8 32.9 36.5
R-GeM 37.2 29.0 21.0 14.7 23.9 14.4 21.7 20.2 24.5 20.2 31.7 27.4 26.7 32.7 32.1 37.5
R-MAC 47.5 39.9 20.3 52.2 23.8 47.5 24.7 22.9 31.4 41.6 34.8 41.3 36.1 31.4 31.0 37.4
R-rGeM 34.3 32.0 20.1 14.2 23.8 13.4 18.4 17.3 24.4 21.4 34.0 21.5 26.9 27.9 30.8 35.9
R-rMAC 41.0 31.7 10.8 41.1 16.9 36.7 19.3 12.8 25.2 43.6 31.2 31.2 24.5 27.0 31.8 36.2
R-SPoC 39.2 36.2 9.0 43.5 18.1 36.8 17.4 10.9 22.7 37.8 37.2 36.8 26.3 23.2 30.6 36.5
V-CroW 28.9 29.8 29.5 50.5 26.6 52.2 24.6 31.3 5.3 28.3 12.5 23.8 14.3 3.9 28.2 31.6
V-GeM 25.2 24.0 26.4 36.6 26.6 38.5 25.7 27.2 12.6 6.4 22.9 5.2 14.0 16.8 29.2 34.6
V-MAC 45.2 46.3 42.5 60.9 37.6 61.3 37.3 39.2 18.4 40.5 19.8 41.0 20.4 22.2 31.9 36.8
V-rGeM 27.1 23.4 30.9 36.2 23.1 34.4 21.5 25.6 12.5 7.4 22.0 6.1 11.9 12.6 27.2 32.0
V-rMAC 40.0 41.8 36.2 53.8 34.4 56.4 29.7 30.0 11.3 28.8 12.4 27.0 12.4 14.3 29.5 34.6
V-SPoC 35.1 35.7 31.9 50.4 27.7 50.6 28.0 32.5 7.7 28.3 15.9 25.1 13.9 4.8 28.2 33.2
SIFT-FV 67.9 69.8 57.2 81.7 56.8 80.7 61.8 57.9 63.0 78.5 57.4 76.9 64.7 61.8 21.2 22.5
SIFT-VLAD 67.3 70.2 57.2 82.0 56.8 81.0 61.7 58.5 61.9 77.8 57.6 77.3 64.8 60.4 17.9 22.3

13.7 14.9 26.4 34.9 39.4 36.5 31.1 29.9 17.6 24.9 28.5 23.1 22.2 20.5 20.9 25.5
DELF-VLAD 15.1 9.2 26.8 34.9 32.5 40.2 27.9 26.1 16.3 20.6 33.4 25.0 21.8 14.2 24.1 25.3
R-CroW 29.3 30.1 15.3 34.3 22.1 30.0 21.2 16.8 21.6 25.6 32.1 26.0 25.8 20.9 24.2 30.0
R-GeM 24.7 23.1 24.5 16.2 21.7 14.8 18.2 24.0 14.4 11.4 23.4 14.9 16.8 14.8 24.0 26.0
R-MAC 32.9 39.0 22.7 38.4 29.9 36.2 26.9 24.4 25.7 28.7 33.6 29.8 29.6 21.9 26.3 30.1
R-rGeM 23.5 22.9 20.1 15.7 21.7 13.8 18.1 18.2 14.8 12.2 23.9 10.4 14.8 14.9 23.1 26.7
R-rMAC 29.2 28.0 14.2 30.5 19.6 24.4 18.3 15.9 18.4 21.4 31.5 19.4 18.8 18.1 23.4 28.1
R-SPoC 29.3 28.3 13.7 32.7 20.7 29.7 20.1 15.9 20.2 24.4 33.0 22.9 23.4 17.2 22.7 27.4
V-CroW 27.3 26.8 26.2 35.7 26.4 38.1 28.8 29.3 5.5 17.2 10.6 17.6 13.4 5.5 19.8 23.5
V-GeM 21.7 18.4 20.4 27.9 24.3 28.4 16.5 26.0 9.3 8.5 13.7 7.7 11.4 7.9 15.3 19.7
V-MAC 40.5 43.0 43.5 54.2 38.1 50.6 40.5 43.5 18.8 25.4 21.5 28.8 19.6 22.6 26.5 30.6
V-rGeM 19.5 20.5 21.7 22.9 22.1 24.2 17.1 21.9 10.2 8.0 14.5 9.0 10.7 9.6 18.2 20.8
V-rMAC 32.7 32.5 30.1 39.7 32.0 42.9 29.6 33.7 11.2 19.3 14.6 19.2 13.6 11.9 24.5 28.0
V-SPoC 24.4 30.9 29.9 40.6 31.5 42.1 27.8 30.1 8.2 21.1 16.1 19.2 15.6 7.8 18.5 22.2
SIFT-FV 65.0 67.9 61.7 83.3 70.4 81.4 66.2 62.1 64.9 73.2 65.0 72.9 66.9 62.7 18.2 19.7
SIFT-VLAD 65.2 68.2 62.3 82.8 69.5 83.1 65.6 63.4 65.0 73.9 65.2 73.4 65.8 64.7 14.2 22.6
Table 2: The mAP(%) difference between target and translated features on three public datasets: Holidays (Green), Oxford5k (Blue) and Paris6k (Brown) in the first, second and third blocks, respectively.

Network Architecture. The task-specific network architectures have a fixed latent feature space of

dimension. The parameter settings of encoder which consists of fully-connect layers with ReLU-based activation function are 2048-1024-512-256-128 or 512-256-128 for encoding the features with 2048 or 512 dimension. The parameter settings of the decoder are in reverse of that of encoder, depending on the dimension of the output features. And the output features are L2 normalized. We used Adam

[22] optimizer to minimize the objective function for all feature pairs, where the learning rate is set as , as and as in all our experiments.

Figure 4: The retrieval results for querying images of the Eiffel Tower (up) and the Arc de Triomphe (down) with the target features and the translated features. The images are resized for better view and the interesting results are colored by red bounding boxes.
Figure 5: The heat maps of the directed affinity matrix (left) and the undirected affinity matrix (right), the values are the averaged results on Holidays, Oxford5k and Paris6k datasets.

4.2 Translation Results

The performance of target features is shown in Table 1. After translating, Characteristic I. Saturation is revealed immediately, in which the performance of translated features is difficult to exceed that of target feature. Therefore, we use the mAP difference between target and translated features to show the translation results.

As shown in Table 2, we use a color map which is normalized according to the minimum (white) and maximum (colored) values to show results of each dataset. From the result, we find although there are still few differences between datasets, the trend of the colored values is almost the same. Then, the Characteristic II. Asymmetry and the Characteristic III. Homology are found naturally. For example, translating from R-rGeM to V-MAC gains a higher performance than from V-MAC to R-rGeM (almost on Oxford5k dataset), and the white grids are clustered to appear around the diagonal.

For further analyzing, the results can be divided into three groups: high convertibility, inferior convertibility and low convertibility. Firstly, the high convertibility results appear mostly in the translation between homologous features. For example, when translating from V-CroW to V-SPoC, the mAPs drop on the Holidays, Oxford5k and Paris6k datasets, respectively. Secondly, the inferior results are found between heterogenous features such as R-based features and V-based features. For example, when translating from R-GeM to V-GeM, the mAPs decrease on the three datasets, respectively. Another example is the translation from V-rGeM to R-rMAC, the mAPs decrease on the three datasets, respectively. Thirdly, the low convertibility results also emerge between heterogenous features. For example, when translating from SIFT-FV to DELF-FV, the performance is not high. Another example is the translation from DELF-VLAD to R-GeM, in which the former is extracted by Resnet50 and the latter is extracted by Resnet101. We explain it from the different depth of network architectures, different training procedures and different encoding/pooling schemes.

Some cross-feature retrieval results are shown in Fig. 4. The first column shows a successful translation from V-CroW to V-SPoC, the ranking lists are almost the same. The second column shows an inferior translation from R-GeM to V-GeM. Interestingly, when querying an image of the Arc de Triomphe at night, the images of the Arc de Triomphe during the day are retrieved by the translated features and get high ranks, which inspires the integration of feature translation to improve cross-model retrieval. The most exciting result lies in the third column: although the translation from SIFT-FV to DELF-FV suffers a low performance, the characteristics like rotation or viewpoint invariance can be highlighted by translation, which well bridges the merits of the handcrafted features to the learning-based features. For example, the images from the bottom view of the Eiffel Tower and the Arc de Triomphe get high ranks (Rank@8 and Rank@2). The rotated images of them also have high ranks (both Rank@4). Then, in the fourth column, we show these characteristics do not symmetrically exist in the reverse translation from DELF-FV to SIFT-FV. We explain it from the limited ability of the SIFT-FV.

4.3 Relation Mining Results

After calculating the directed affinity matrix and the undirected affinity matrix , we average the values of the three datasets and draw the heat maps. As shown in Fig. 5 (left), the values of directed affinity matrix verify our assumption that the reconstruction error is smaller than the translation error as all the values are positive. As shown in Fig. 5 (right), the positions of light and dark colors are almost the same as that of the translation results in Table 2, which indicates the UAM can be used to predict the translation quality between two given features. To study the relationship between features better, we visualize the MST based on as Fig. 3. The images are the ranking lists for a query image with corresponding features. Since the results of leaf nodes connected in the MST (R-CroW and R-SPoC) are very similar, we mainly show the results of nodes in the trunk of the MST. The closer features return more similar ranking lists, which indicates the rationality of our affinity measurement from the other perspective.

5 Conclusion

In this work, we present the first attempt to investigate visual feature translation, as well as the first attempt at quantifying the affinity among different types of features in visual search. In particular, we propose a Hybrid Auto-Encoder (HAE) to translate visual features. Based on HAE, we design an Undirected Affinity Measurement (UAM) to quantify the affinity. Extensive experiments have been conducted on several public datasets with different types of widely-used features in visual search. Quantitative results prove the encouraging possibility of feature translation.


  • [1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, 2016.
  • [2] R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In CVPR, 2012.
  • [3] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, 2007.
  • [4] A. Babenko and V. Lempitsky.

    Aggregating local deep features for image retrieval.

    In ICCV, 2015.
  • [5] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval. In ECCV, 2014.
  • [6] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In ECCV, 2006.
  • [7] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 2002.
  • [8] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In EMNLP, 2006.
  • [9] W. Dai, G.-R. Xue, Q. Yang, and Y. Yu. Co-clustering based classification for out-of-domain documents. In SIGKDD, 2007.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [11] L. Duan, I. W. Tsang, and D. Xu. Domain transfer multiple kernel learning. PAMI, 2012.
  • [12] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. In ECCV, 2016.
  • [13] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In NIPS, 2012.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [15] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih. Image indexing using color correlograms. In CVPR, 1997.
  • [16] N. Hyeonwoo, A. Andre, S. Jack, W. Tobias, and H. Bohyung. Large-scale image retrieval with attentive deep local features. In ICCV, 2017.
  • [17] H. Jégou and O. Chum. Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening. In ECCV, 2012.
  • [18] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In ECCV, 2008.
  • [19] H. Jegou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
  • [20] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek. Accurate image search using the contextual dissimilarity measure. PAMI, 2010.
  • [21] Y. Kalantidis, C. Mellina, and S. Osindero. Cross-dimensional weighting for aggregated deep convolutional features. In ECCV, 2016.
  • [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • [23] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. AM MATH SOC, 1956.
  • [24] J. Liu, M. Shah, B. Kuipers, and S. Savarese. Cross-view action recognition via view knowledge transfer. In CVPR, 2011.
  • [25] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.
  • [26] M. Long, J. Wang, Y. Cao, J. Sun, and S. Y. Philip. Deep learning of transferable representation for scalable domain adaptation. TKDE, 2016.
  • [27] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsupervised domain adaptation. In CVPR, 2014.
  • [28] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
  • [29] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  • [30] J. M. Michal Perďoch, Ondřej Chum. Efficient representation of local geometry for large scale object retrieval. In CVPR, 2009.
  • [31] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 2005.
  • [32] J. Y.-H. Ng, F. Yang, and L. S. Davis. Exploiting local features from deep networks for image retrieval. In CVPRW, 2015.
  • [33] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, 2006.
  • [34] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. TNN, 2011.
  • [35] S. J. Pan, Q. Yang, et al. A survey on transfer learning. TKDE, 2010.
  • [36] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In CVPR, 2010.
  • [37] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.
  • [38] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.
  • [39] D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. Van Gool. Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In CVPR, 2011.
  • [40] F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In CVPR, 2018.
  • [41] F. Radenović, G. Tolias, and O. Chum. Fine-tuning cnn image retrieval with no human annotation. PAMI, 2018.
  • [42] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unlabeled data. In ICML, 2007.
  • [43] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPRW, 2014.
  • [44] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki. Visual instance retrieval with deep convolutional networks. MTA, 2016.
  • [45] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. 2011.
  • [46] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning transferrable representations for unsupervised domain adaptation. In NIPS, 2016.
  • [47] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local feature descriptors using convex optimisation. PAMI.
  • [48] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [49] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
  • [50] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. PAMI, 2000.
  • [51] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep transfer learning. In ICANN, 2018.
  • [52] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of cnn activations. In ICLR, 2016.
  • [53] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [54] J. Zhang, W. Li, and P. Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In CVPR, 2017.
  • [55] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.
  • [56] L. Zheng, Y. Yang, and Q. Tian. Sift meets cnn: A decade survey of instance retrieval. PAMI, 2017.