Domain adaptive image retrieval includes single-domain retrieval and cross-domain retrieval. Most of the existing image retrieval methods only focus on single-domain retrieval, which assumes that the distributions of retrieval databases and queries are similar. However, in practical application, the discrepancies between retrieval databases often taken in ideal illumination/pose/background/camera conditions and queries usually obtained in uncontrolled conditions are very large. In this paper, considering the practical application, we focus on challenging cross-domain retrieval. To address the problem, we propose an effective method named Probability Weighted Compact Feature Learning (PWCF), which provides inter-domain correlation guidance to promote cross-domain retrieval accuracy and learns a series of compact binary codes to improve the retrieval speed. First, we derive our loss function through the Maximum A Posteriori Estimation (MAP): Bayesian Perspective (BP) induced focal-triplet loss, BP induced quantization loss and BP induced classification loss. Second, we propose a common manifold structure between domains to explore the potential correlation across domains. Considering the original feature representation is biased due to the inter-domain discrepancy, the manifold structure is difficult to be constructed. Therefore, we propose a new feature named Histogram Feature of Neighbors (HFON) from the sample statistics perspective. Extensive experiments on various benchmark databases validate that our method outperforms many state-of-the-art image retrieval methods for domain adaptive image retrieval. The source code is available at https://github.com/fuxianghuang1/PWCFREAD FULL TEXT VIEW PDF
We address the problem of cross-domain image retrieval, considering the
With the explosive growth of image databases, deep hashing, which learns...
Representation learning is a fundamental but challenging problem, especi...
Recently, learning to hash has been widely studied for image retrieval t...
Visual localization is a crucial component in the application of mobile ...
Cross-domain disentanglement is the problem of learning representations
In this paper, we propose a simple but effective semantic part-based
The problem of domain adaptive image retrieval including single-domain retrieval and cross-domain retrieval is an important task for many practical applications. Single-domain retrieval refers to a sort of image retrieval problem that the queries and databases are both from the same domain. On the contrary, cross-domain retrieval means the queries and databases permitting to come from different domains, which is more flexible and applicable in real-world applications. In practice, retrieval databases have often taken in ideal illumination/pose/background/camera conditions and queries usually obtained in uncontrolled conditions, which leads to the large discrepancy between the databases and queries. For example, mobile product image search  aims at identifying a product, or retrieving similar products from the online shopping domain based on a photo captured in unconstrained scenarios by a mobile phone camera.
However, as shown in Fig. 1, most of the existing methods only focus on single-domain retrieval and the performance deteriorates rapidly in handling cross-domain retrieval. Few people have proposed solutions to cross-domain retrieval problems. DARN  simultaneously integrates the attributes and visual similarity constraint into the retrieval feature learning to solve the cross-domain retrieval problem. However, attributes are usually insufficient and sorting high-dimensional features requires a lot of computation, resulting in slow retrieval.
Recently, due to the low storage and high computation efficiency of binary codes, the hashing algorithm has been widely used for many applications [6, 8, 13, 15, 24, 26, 30, 33, 36]. Hashing aims to map high-dimensional content features of samples into Hamming space (binary space) and generate a set of low-dimensional binary codes to represent samples. In consequence, the cost of data storage can be largely reduced, thus the retrieval speed can be improved with the Hamming distance using binary operation (XOR).
However, most of the existing hashing methods [14, 16, 18, 19, 21, 23, 29, 34] assume that the distributions of retrieval databases and queries are similar while ignoring the inter-domain discrepancy, which makes them difficult to accurately capture correlations between cross-domain samples. Consequently, although most of the existing hashing methods have achieved significant performance for single-domain retrieval, they perform poorly when queries and databases come from different domains.
To address the above problem, we propose an effective domain adaptive image retrieval method named Probability Weighted Compact Feature Learning (PWCT), which takes into account the similarity/dissimilarity relation between the samples from different domains to learn compact binary feature representations. Inspired by transfer learning (TL), we transfer knowledge across different domains to leverage knowledge between different domains and explore cross-domain sample correlations. Our goal is to use the available labeled data as a source domain to help us learn the projection matrix and get more discriminant binary codes. Instead of simply adding source domain data to expand the training set for better retrieval in the target domain, different from existing transfer hashing methods [37, 20], we focus on exploring the correlation of samples and data distribution discrepancy between domains to achieve good performance on cross-domain retrieval. To improve the performance of cross-domain retrieval, we propose our loss functions from a Bayesian Perspective. Specifically, we derive our loss functions: BP induced focal-triplet loss, BP induced quantization loss and BP induced classification loss by seeking for the Maximum A Posteriori Estimation (MAP) solution to promote the correlation between samples from different domains in Hamming space, ensure the discrimination of binary codes, and reduce the information error causing by quantification.
Besides, considering that the underlying manifold structure across different domains is extremely helpful to capture meaningful nearest neighbors correlation of different domains, we propose a common manifold to capture the inherent neighborhood structure in the source domain and target domain to further ensure that the correlation of different domains is preserved in Hamming space. However, the similarity between the samples from different domains is difficult to measure in terms of original content features. The same class samples from different domains may not be close, caused by inter-domain discrepancy. To handle such a problem, we consider the distribution characteristics of nearest neighbors for each sample in respective domains and propose a new statistical feature called Histogram Feature of Neighbors (HFON) from the perspective of sample statistics to reduce the influence of data distribution discrepancy between domains. The main contributions and novelties of this paper are summarized as follows:
In this paper, we propose an effective domain adaptive image retrieval method named Probability Weighted Compact Feature Learning (PWCF) to achieve fast and accurate retrieval. Fig. 2 shows the framework of our PWCF. To the best of our knowledge, we are the first to propose a new and practical adaptive cross-domain retrieval problem.
We propose loss functions named BP induced focal-triplet loss, BP induced quantization loss, and BP induced classification loss which seeks for the Maximum A Posteriori Estimation (MAP) solution to explore the similarity/dissimilarity between samples from different domains, ensure the discrimination, and reduce the information error.
In our PWCF, we propose a Histogram Feature of Neighbors (HFON) from the perspective of statistics to reduce the influence of the domain disparity and a common manifold structure based on HFON to further preserve the correlation between samples from different domains.
Extensive experiments on various benchmark databases have been conducted. The experimental results verify that our method outperforms many state-of-the-art image retrieval methods for both cross-domain retrieval and single domain retrieval.
Suppose that we have target samples unlabeled and source samples labeled . , where
is the label vector, of which the maximum item indicates the assigned class of. We denote and . We aim to learn a set of compact binary codes and to represent the samples, where is the corresponding binary codes of and is the corresponding binary codes of . and represent the original content feature dimension of each sample and the length of binary codes, respectively. In PWCF, both the data of the source domain and the target domain are used to learn a projection . Then, the -dimensional feature ( i.e., the continuous real values of binary codes) is denoted as . The binary codes are quantified as . Here sgn() is the sign function, which returns if and otherwise. In this paper, is the norm for vectors and Frobenius norm for matrices.
In order to achieve higher accuracy, we hope to explore the correlation of different samples. Given a triplet , let represent the pair-wise similarity between and . means they have the same label. Instead, means they have different labels.
Without loss of generality, let
be the posterior probability of feature representation, , for triplet sample set , , . Here we suppose that , , are the -dimensional features of samples
, respectively. With the assumption of conditional independence of each pair and Bayesian formulation, the joint posterior probability density function of the triplet training set can be generally represented as:
where is the likelihood probability and , and
are the prior probability of
-dimensional feature. We suppose that the likelihood probability density function to be exponential distribution, considering that the exponential distribution has shown fast convergence to a stable state. Letand . Considering the sample pair similarity, the likelihood probability density function is expressed as:
where , and is the XOR operation. is the margin. The purpose of this setting is to make the samples of the same class closer and the samples of different classes farther.
We aim to seek the solution of Maximum A Posteriori Estimation (MAP) of Eq. (1) from the Bayesian perspective. To mitigate the influence of the likelihood probability of hard pairs on the posterior probability maximization, we add a modulating factor to the likelihood probability where and . In other words, the modulating factor reduces the contribution of easy pairs and penalizes more on those hard pairs. For convenience, the same (different) labeled samples which have a large (small) distance are named as hard pairs, and the same (different) labeled samples that have a small (large) distance are named as easy pairs.
Besides, considering the quantization loss and discrimination of binary code, the prior probability is written as , where , and are hyper-parameters.
is a classifier and we will discuss the detail later in classification loss. By taking the natural logarithm, our objective function is written as:
In optimization, we consider the case that positive pairs and negative pairs both exist. In order to construct triplets, we can set as the anchor, is similar to the anchor, and is dissimilar to the anchor. Then Eq. (3) is
where denotes the operator of , which makes sure and improves the convergence. Clearly, if without , the probability of the exponential probability in Eq.(2) may be larger than 1. So, in Eq.(4), the is naturally resulted with clear probabilistic interpretation.
BP induced Focal-triplet loss. The first term in Eq. (3
) is a variant of standard triplet loss named BP induced focal-triplet loss. If we enumerate all the sample pairs, it will take a lot of time for training. So we will just pick some cross-domain triplets, which are more effective at promoting inter-domain correlations. In other words, we construct cross-domain triplets before training. Specifically, for each sample, if it comes from the source domain, we select a positive sample and a negative sample from the target domain. Otherwise, if it is from the target domain, we select a positive sample and a negative sample from the source domain. Since there is no label in the target domain, we first use the source domain data to predict the pseudo-label of the target domain by the KNN algorithm. We can getcross-domain triplets. For ease of understanding, let’s represent all selected triplets, where and come from different domain. If , then . Otherwise, if , then . Then, the BP induced focal-triplet loss can be written as:
where is the weight of group selected triplet and . As shown in Fig. 3
, the focal-triplet loss, which is a variant of standard triplet loss, imposes different importance for different triplets by down-weighting easy pairs and up-weighting hard pairs. In the training phase, we choose the hard triplets that satisfy the maximization of intra-class distance and the minimization of inter-class distance to improve the training speed. Considering the data distribution discrepancy in different domains, the Euclidean distance of the original content feature extracted from different domains may not measure the similarity of samples. So we use the Histogram Feature of Neighbors rather than the original content feature to calculate the distance of different samples across domains. The Histogram Feature of Neighbors will be explained in detail in the next section.
BP induced quantization loss. The second term in Eq. (4) is named BP induced quantization loss, which aims to reduce quantization error between binary codes and low-dimensional feature representation obtained by mapping (i.e., the continuous real values of binary codes). The BP induced quantization loss can be formulated as:
BP induced classification loss. The third term in Eq. (4) is named BP induced classification loss. Inspired by SDH , we consider that good binary codes should be with good discrimination. We take advantage of the label information to train a classifier and represent the predicted label of the sample. We want to use binary codes to predict labels that are as authentic as possible. In this paper, to avoid the negative impact of pseudo labels, we only use the source domain sample when we calculate the classification loss. The BP induced classification loss can be formulated as:
The regularization , i.e., the last term in Eq. (3), is used to avoid trivial solution and overfitting.
We argue that the nearest-neighbor relationship of samples in a single domain is regular. In other words, if two samples from different domains are similar, the classes of their neighbors in their respective domains should be similar. Based on this assumption, we propose a statistical feature named Histogram Feature of Neighbors (HFON) to reduce the domain disparity. Specifically, we use to represent the HFON vector of and is the number of classes. We find nearest neighbors of each sample in their respective domains and calculate the probability of each class of these nearest-neighbor samples. The element of the HFON can be written as where , represents the total number of samples belonging to class in the nearest neighbors of the sample. Fig. 4 shows the details of the Histogram Feature of Neighbors.
The underlying manifold structure across different domains, which is extremely helpful to capture meaningful nearest neighbors of different domains. Therefore, we want to keep the common manifold structure by taking advantage of local similarities. To minimize the representation error of the low-dimensional features between different neighbor samples, the manifold loss can be written as:
Similar to LPP , is the Laplacian matrix and . Here is a sparse symmetric matrix with having the weight of the edge connecting and , and 0 if there is no such connection. To reduce the domain disparity, let when and come from the same domain. Otherwise, when and come from different domains where and denote the HFON with respect to and , respectively.
Overall objective function. Finally, the overall objective function is rewritten as:
where the constraint, , is used to make be orthogonal projections in order to guarantee the discrimination of binary codes.
In this paper, we adopt an alternating optimization procedure to iteratively optimize , , and . As the non-convex sgn() function makes Eq. (8) a NP-hard problem, we relax the sgn() function as its signed magnitude .
-Step. Given , and , updating is a typical optimization problem with orthogonality constraints. Let be the partial derivative of the objective function Eq. (9) with respect to and is represented as:
where contains all selected cross-domain triplets and . Based on the orthogonal constraint optimization procedure in 
, we can define a skew-symmetric matrix as
. Then, we adopt Crank Nicolson like scheme to update the orthogonal matrix
where denotes the step size. We empirically set . By solving Eq. (11), we can get
and . We iteratively update several times based on Eq. (12) with the Barzilai-Borwein (BB) method . In addition, please note that when iteratively optimizing , the initial is set to be the updated one in the last round. For the first round, is initialized by PCA.
-Step. Given , and , taking the partial deviation of the objective function with respect to to be zero, we derive
-Step. Given , and , by relaxing the sign function , the solution can be obtained
-Step. Given , and , we obtain the approximate solution for hash codes by relaxing the sign function.
The details of the proposed algorithm are described in Algorithm 1.
Computation Complexity: Since the hard triplets and the Laplacian matrix can be pre-computed, the total computation cost of our PWCF in Algorithm 1 is and linear to the number of samples, where . In practice, , , and will be much less than . Hence, the binary codes learning is efficient.
Datasets: We perform the experiments on four groups benchmark datasets:
The VLCS  dataset aggregates photos from Caltech101, LabelMe, Pascal VOC2007 and SUN09, which provides a 5-way multi-class benchmark on five common classes: bird, car, chair, dog and person. In our experiments, every image is represented by a 4096-d CNN feature vector . We use VOC2007 dataset including 3376 images as the source domain and Caltech101 dataset containing 1415 images is used as the target domain in the VOC2007&Caltech101 dataset.
The Cross-dataset Testbed  is a Decaf7 based cross-dataset image classification dataset, which contains 40 categories of images from 3 domains: 3,847 images in Caltech256, 4,000 images in ImageNet, and 2,626 images for SUN. In our experiments, each image is represented by a 4096-d CNN feature vector . Caltech256 is used as the source domain and ImageNet is used as the target domain in the Caltech256&ImageNet dataset.
The Office-Home dataset  consists of images from 4 different domains: Artistic images (i.e., paintings, sketches and/or artistic depictions), Clip Art images, Product images without background and Real-World images (i.e., regular images captured with a camera). For each domain, the dataset contains images of 65 object categories found typically in Office and Home settings. In our experiments, each image is represented as a 4096-d feature by VGG-16. Each domain is used as the source domain and the target domain, respectively.
Implementation details: We choose eleven state-of-the-art hashing methods, including SH , ITQ , DSH , LSH , SGH , OCH , GTH , ITQ+ , LapITQ+ , KSH  and SDH  as baselines. We use the public codes and suggested parameters of these methods from the corresponding authors. For our PWCF, we empirically set to 1e2, to 1, to 1e3 and
to 1e4. For unsupervised methods, we use all the training samples including source domain and target domain in the training phase. For a fair comparison, we introduce a NoTL method that only uses the target domain to train the ITQ model. For supervised methods, we use the training samples and labels of source domain to train. All of the methods use identical training sets and testing sets. Specifically, for each dataset, we randomly select 500 images of the target domain images as a testing set (queries) and rest images as a training set. In the testing phase, class labels are used to determine whether a sample returned for a given query is considered a true positive. Moreover, the widely used criterion, i.e., mean average precision (MAP), is used as the performance metric. To remove the randomness for sampling, we repeat each algorithm 10 times and report their mean of MAP. We also show the precision and recall curves.
To verify the performance of our method in the scenarios of domain adaptive retrieval, we report the retrieval performance including cross-domain retrieval and single-domain retrieval on MNIST&USPS, VOC2007&Caltech101, and Caltech256&ImageNet databases when the code length is set as 16, 32, 48, 64, 96 and 128, respectively. For cross-domain retrieval, the training samples from the source domain are used as retrieval database. For the single-domain retrieval, the training samples from the target domain are used as retrieval database.
To further prove our versatility and cross-domain retrieval performance, a large number of experiments were carried out on Office-Home. For the sake of simplicity, Artistic images, Clip Art images, Product images, and Real-World images are replaced as A, C, P and R, respectively. AC implies Artistic is the source domain and Clip Art is the target domain.
In Table 1, we report the MAP scores (%) of all the compared methods and our method PWCF on MNIST&USPS, VOC2007&Caltech101, and Caltech256&ImageNet for cross-domain retrieval. Obviously, our PWCF outperforms compared methods on all databases in most cases. To further prove the effectiveness of our method, we conduct an experimental evaluation of single-domain retrieval. The results are shown in Table 2. We can see that our method is superior to the compared methods in both cross-domain retrieval and single-domain retrieval.
In Table 3, we report the MAP scores (%) of all the compared methods and our PWCF with 64 bits on Office-Home for cross-domain retrieval. We can see however the source domain and target domains are set, our methods perform better than others. The results certify that our PWCF has universality in practical application. We also show the influence of the number of retrieved samples for cross-domain retrieval task. Fig. 5 (a) shows the precision when the number of retrieved samples vary from 0 to 1000 and Fig. 5 (b) shows the recall when the number of retrieved samples vary from 0 to 1000. From the figures, we can see that our PWCF always presents competitive retrieval performance compared to baselines, which demonstrates the efficacy of our PWCF.
The proposed PWCF is solved with a variable alternating strategy, and the convergence can be guaranteed. We present the convergence curves of the objective function in Fig. 6, from which we see that PWCF can quickly converge to an optimal solution within several iterations.
We investigate six variants of PWCF in Table 4: (1) PWCF-T is the PWCF variant without BP induced focal-triplet loss . (2) PWCF-F is the STH variant, which replaces the BP induced focal-triplet loss as the standard triplet loss. (3) PWCF-M is the PWCF variant without manifold loss . (4) PWCF-C is the PWCF variant without BP induced classifier loss . (5) PWCF-H is the PWCF variant without the Histogram Feature of Neighbors, which calculates the weight matrix by using original content features. Also, the hard triplets are constructed by original features without the Histogram Feature of Neighbors. (6) PWCF-Q is the PWCF variant without BP induced quantization loss . We report the results to different code lengths on the MNIST&USPS dataset for single-domain and cross-domain retrieval.
We can see that the four parts of our model have different effects on retrieval performance. Comparing PWCF-T, PWCF-F, PWCF, triplet loss is good for training PWCF and our proposed focal-triplet hashing loss is better than standard triplet loss. Comparing PWCF-M with PWCF, the underlying manifold structure across different domains is extremely helpful to capture the correlation between samples. Comparing PWCF-H with PWCF, the proposed histogram features reduce the impact of data distribution discrepancy between different domains and it is unreasonable to use Euclidean distance of original features to measure the similarity between cross-domain samples.
In this paper, we propose an effective domain adaptive retrieval method named Probability Weighted Compact Feature Learning (PWCF), which learns compact binary codes to represent images. First, we propose BP induced focal-triplet loss, BP induced quantization loss and BP induced classification loss from the Bayesian perspective to optimize the binary compact feature between samples from different domains. Besides, The underlying manifold structure across different domains is used to capture meaningful nearest neighbors of different domains and further explore the potential correlation. To address the data distribution discrepancy issue, we propose a Histogram Feature of Neighbors (HFON) to metric the similarity/dissimilarity between the samples from different domains. The experimental results show that our PWCF always shows much higher retrieval performance in the scenarios of the cross-domain retrieval and single-domain retrieval, which verify that our method outperforms many state-of-the-art image retrieval methods.
Acknowledgement: This work was supported by the National Science Fund of China under Grants (61771079), Chongqing Youth Talent Program, and the Fundamental Research Funds of Chongqing (No. cstc2018jcyjAX0250).
Numerical solution of partial differential equations. by smith g. d. . pp. viii, 179. 25s. 1965. (oxford university press). Mathematical Gazette 50 (374), pp. 179–449. Cited by: §2.4.
International Conference on International Conference on Machine Learning50 (1), pp. I–647. Cited by: 2nd item, 3rd item.
European Conference on Computer Vision. Cited by: §1.
International Conference on Artificial Intelligence, pp. 2248–2254. Cited by: §3.1.
Computer Vision and Pattern Recognition, pp. 1971–1978. Cited by: §1.
Transfer feature learning with joint distribution adaptation. In IEEE International Conference on Computer Vision, Cited by: 1st item.
IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §3.1.