CVarXivDaily
分享计算机视觉每天的arXiv文章
view repo
Deep hashing establishes efficient and effective image retrieval by endtoend learning of deep representations and hash codes from similarity data. We present a compact coding solution, focusing on deep learning to quantization approach that has shown superior performance over hashing solutions for similarity retrieval. We propose Deep Triplet Quantization (DTQ), a novel approach to learning deep quantization models from the similarity triplets. To enable more effective triplet training, we design a new triplet selection approach, Group Hard, that randomly selects hard triplets in each image group. To generate compact binary codes, we further apply a triplet quantization with weak orthogonality during triplet training. The quantization loss reduces the codebook redundancy and enhances the quantizability of deep representations through backpropagation. Extensive experiments demonstrate that DTQ can generate highquality and compact binary codes, which yields stateoftheart image retrieval performance on three benchmark datasets, NUSWIDE, CIFAR10, and MSCOCO.
READ FULL TEXT VIEW PDF
Deep hashing enables image retrieval by endtoend learning of deep
repr...
read it
With the explosive growth of image databases, deep hashing, which learns...
read it
Hashing methods, which encode highdimensional images with compact discr...
read it
Recently, it has been observed that 0,1,1ternary codes which are simpl...
read it
In this paper, we focus on tripletbased deep binary embedding networks ...
read it
In this paper, we aim to learn a mapping (or embedding) from images to a...
read it
Quantization has been an effective technology in ANN (approximate neares...
read it
分享计算机视觉每天的arXiv文章
Approximate nearest neighbors (ANN) search has been widely applied to retrieve largescale multimedia data in search engines and social networks. Due to the low storage cost and fast retrieval speed, learning to hash has been increasingly popular in the ANN research community, which transforms highdimensional media data into compact binary codes and generates similar binary codes for similar data items. This paper will focus on datadependent hashing schemes for efficient image retrieval, which have achieved better performance than dataindependent hashing methods, e.g. LocalitySensitive Hashing (LSH) (Gionis et al., 1999).
A rich line of hashing methods have been proposed to enable efficient ANN search using Hamming distance (Kulis and Darrell, 2009; Gong and Lazebnik, 2011; Norouzi and Blei, 2011; Fleet et al., 2012; Liu et al., 2012; Wang et al., 2012; Zhang et al., 2014b). Recently, deep hashing methods (Xia et al., 2014; Lai et al., 2015; Shen et al., 2015; Erin Liong et al., 2015; Zhu et al., 2016; Li et al., 2016; Liu et al., 2016; Do et al., 2016; Cao et al., 2017b; Jain et al., 2017) have shown that both image representation and hash coding can be learned more effectively using deep neural networks, resulting in stateoftheart results on many benchmark datasets. In particular, it proves crucial to jointly preserve similarity and control quantization error of converting continuous representations to binary codes (Zhu et al., 2016; Li et al., 2016; Liu et al., 2016; Cao et al., 2017b)
. However, a pivotal weakness of these deep hashing methods is that they first learn continuous deep representations, and then convert them into hash codes by a separated binarization step. By
continuous relaxation, i.e. solving the original discrete optimization of hash codes with continuous optimization, the optimization problem deviates significantly from the original hashing objective. As a result, these methods cannot learn exactly compact binary hash codes in their optimization.To address the limitation of continuous relaxation, Deep Quantization Network (DQN) (Cao et al., 2016) and Deep VisualSemantic Quantization (DVSQ) (Cao et al., 2017a) are proposed to integrate quantization method (Ge et al., 2014; Zhang et al., 2014a; Wang et al., 2016) and deep learning. The quantization method represents each point by a short binary code formed by the index of the nearest center, which can generate natively binary codes and empirically achieve better performance than hashing methods for ANN search. However, previous deep quantization methods are either pointwise method that relies on expensive classlabel information, or pairwise method that cannot capture the relative similarity between images, i.e. a pair of images should not be seen as absolutely similar or dissimilar. In other words, there should be a continuous spectrum from very similar to very dissimilar relations.
Recently, the triplet loss (Norouzi et al., 2012)
has been studied for computer vision problems. The triplet loss captures the
relative similarity, which only brings anchor images closer to positive samples than to negative samples, hence it fits the ranking tasks naturally and achieves better performance than pointwise and pairwise losses for retrieval tasks. However, how to enable effective triplet training for deep learning to quantization with only pairwise similarity available still remains a challenge. Note that, without effective triplet selection, previous deep hashing method with triplet loss (Lai et al., 2015) cannot achieve superior results. Hence, how to select good triplets for effective training in deep quantization also remains an open problem.Towards these open problems, this paper presents Deep Triplet Quantization (DTQ) for efficient and effective image retrieval, which introduces a novel triplet training strategy to deep quantization, offering superior retrieval performance. The proposed solution is comprised of four main components: 1) a novel triplet selection module, Group Hard
, to mine good triplets for effective triplet training; 2) a standard deep convolutional neural network (CNN), e.g. AlexNet or ResNet, for learning deep representations; 3) a wellspecified triplet loss for pulling together similar pairs and pushing away dissimilar pairs; and 4) a novel triplet quantization loss with weak orthogonality constraint for converting the deep representations of different samples (such as the anchor, positive and negative samples) in the triplets into
bit compact binary codes. The weakorthogonality reduces the redundancy of codebooks and controls the quantizability of deep representations. Comprehensive empirical evidence shows that the proposed DTQ can generate compact binary codes and yield stateoftheart retrieval results on three image retrieval benchmarks, NUSWIDE, CIFAR10, and MSCOCO.Existing hashing methods can be categorized into unsupervised hashing and supervised hashing (Kulis and Darrell, 2009; Gong and Lazebnik, 2011; Norouzi and Blei, 2011; Fleet et al., 2012; Liu et al., 2012; Wang et al., 2012; Liu et al., 2013; Gong et al., 2013; Yu et al., 2014; Zhang et al., 2014b; Wang et al., 2015). Please refer to (Wang et al., 2018) for a comprehensive survey.
Unsupervised hashing methods learn hash functions to encode data points to binary codes by training from unlabeled data. Typical learning criteria include reconstruction error minimization (Salakhutdinov and Hinton, 2007; Gong and Lazebnik, 2011; Jegou et al., 2011) and graph learning (Weiss et al., 2009; Liu et al., 2011). Supervised hashing explores supervised information (e.g. given similarity or relevance feedback) to learn compact hash codes. Binary Reconstruction Embedding (BRE) (Kulis and Darrell, 2009) pursues hash functions by minimizing the squared errors between the distances of data points and the distances of their corresponding hash codes. Minimal Loss Hashing (MLH) (Norouzi and Blei, 2011) and Hamming Distance Metric Learning (Norouzi et al., 2012)
learn hash codes by minimizing the triplet loss functions based on similarity of data points. Supervised Hashing with Kernels (KSH)
(Liu et al., 2012) and Supervised Discrete Hashing (SDH) (Shen et al., 2015) build discrete binary codes by minimizing the Hamming distances across similar pairs and maximizing the Hamming distances across dissimilar pairs.As deep convolutional networks (Krizhevsky et al., 2012; He et al., 2016) yield advantageous performance on many computer vision tasks, deep hashing methods have attracted wide attention recently. CNNH (Xia et al., 2014) adopts a twostage strategy in which the first stage learns binary hash codes and the second stage learns a deepnetwork based hash function to fit the codes. DNNH (Lai et al., 2015) improved CNNH with a simultaneous feature learning and hash coding pipeline such that deep representations and hash codes are optimized by the triplet loss. DHN (Zhu et al., 2016) and HashNet (Cao et al., 2017b) improve DNNH by jointly preserving the pairwise semantic similarity and controlling the quantization error by simultaneously optimizing the pairwise crossentropy loss and quantization loss via a multitask approach.
Quantization methods (Cao et al., 2016, 2017a) represent each point by a short code formed by the index of the nearest center, have been shown to give more powerful representation ability than hashing for approximate nearest neighbor search. To our best knowledge, Deep Quantization Network (DQN) (Cao et al., 2016) and Deep VisualSemantic Quantization (DVSQ) (Cao et al., 2017a) are the only two prior works on deep learning to quantization. DQN jointly learns deep representations via a pairwise cosine loss and a product quantization loss (Jegou et al., 2011) for generating compact binary codes. DVSQ proposes a pointwise adaptivemargin Hinge loss exploring class labels, and a visualsemantic quantization loss for innerproduct search.
There are several key differences between our work and previous deep learning to quantization methods. 1) Our work introduces a novel triplet training strategy to deep quantization framework for efficient similarity retrieval. It is worth noting that DTQ can learn compact binary codes when only the relative similarity information is available, which is more general than the labelbased quantization method DVSQ. 2) During the triplet learning procedure, DTQ proposes a novel triplet mining strategy, Group Hard, resulting in faster convergence and better search accuracy. 3) DTQ proposes a novel triplet quantization loss with weak orthogonality constraint to reduce coding redundancy. An endtoend architecture to join the above three terms yield both efficient and effective image retrieval.
In similarity retrieval, we are given training points , where some pairs of points and are given with pairwise similarity labels , where if and are similar while if and are dissimilar. The goal of deep learning to quantization is to learn a composite quantizer from input space to binary coding space through deep networks, which encodes each point into bit binary code such that the supervision in the training data can be maximally preserved. In supervised hashing, the similarity pairs are readily available from semantic labels or relevance feedbacks from clickthrough data in many image search engines.
We propose Deep Triplet Quantization (DTQ), an endtoend architecture to join deep learning and quantization, as shown in Figure 1. DTQ has four key components: 1) a novel triplet selection module, Group Hard, to mine a appropriate number of good triplets for effective triplet training; 2) a standard deep convolutional neural network (CNN), e.g. AlexNet, VGG, or ResNet, for learning deep representations; 3) a wellspecified triplet loss for pulling together similar pairs and pushing away dissimilar pairs; and 4) a novel triplet quantization loss with weak orthogonality constraint for converting deep representations of different samples (the anchor, positive and negative samples) in triplets into bit compact binary codes and controlling the quantizability of the deep representations.
We train a convolutional network from image triplets =. Each triplet = is constructed from pairwise similarity data as follows: for each anchor image , we find a positive image with ( and are similar), and a negative image with ( and are dissimilar). Given a triplet =, the deep network maps the triplet into a learned feature space with =. We ensure that an anchor image is closer to all positive images than to all negative images . And the relative similarity between the images in triplets,
, are measured by the Euclidean distances between their deep features,
, . Thus the triplet loss is(1) 
where is a margin that is enforced between positive and negative pairs, and is the set of cardinality for all possible triplets in the training set. Compared to the widelyused pointwise and pairwise metriclearning losses (Cao et al., 2016, 2017a) in previous deep quantization methods, the triplet loss (1) only requires anchor samples to be more similar to positive samples than to negative samples, by a specifically margin. This establishes a relative similarity relation between images, thus is much more reasonable than the absolute similarity relation used in previous pointwise or pairwise approaches.
However, as the dataset gets larger, the number of triplets grows cubically, and generating all possible triplets would result in many easy triplets with in Eq. (1), which would not contribute to the training and suffer from slower convergence. Note that, without a sophisticated triplet selection procedure, previous deep hashing methods with the triplet loss (Lai et al., 2015) cannot achieve superior performance. Consequently, it is crucial to mine good triplets for effective triplet training and faster convergence. In this paper, we propose a novel triplet selection module, Group Hard, to ensure the number of mined valid triplets is neither too big nor too small. The core idea is that we first randomly split the training data into several groups , then randomly select one hard negative sample for each anchorpositive pair in one group. The proposed triplet selection method is formulated as
(2) 
where is the group of positive samples consisting of the samples similar to the anchor in the th group, ) is the random function that randomly chooses one negative sample from the group of hard negative samples =. Here hard negative sample is defined as having nonzero loss value for a triplet =
. Note that, mining only the triplets with the hardest negative images would select the outliers in the dataset and make it unable to learn ground truth relative similarity. Thus in this paper, the proposed DTQ only selects the negative examples with moderate hardness, based on the random sampling
instead of in Eq. (2).As the training proceeds, the average of triplet loss becomes smaller and the size of the hard triplets reduces. To ensure that there are enough hard triplets each epoch for effective triplet training, we design a decay strategy for the size of groups
as: if the actual number of valid hard triplets is lower than the minimum number of the valid hard triplets (the constant MINTRIPLETS in Algorithm 1), the size of the groups is halved until .Complexity: Similar to previous work on triplet training (Zhao et al., 2017), we can prune the triplets with zero losses (), resulting a valid triplet set whose size is much smaller than the possible number of triplets. Through the proposed Group Hard selection strategy that chooses one negative sample for each anchorpositive pair in each group, the number of the candidate triplets for training is further reduced to . Furthermore, all the selected triplets are hard triplets ( in Eq. (1)), and the total amount can be controlled in a suitable range by adjusting the number of groups , resulting in effective triplet training and higher retrieval accuracy.
While triplet training with Group Hard selection enables effective image retrieval, efficient image retrieval is enabled by a novel triplet quantization model. As each batch used for training the deep neural networks is comprised of triplets, the proposed quantization model should be compatible with the triplet training. For the th triplet, each image representation , where , is quantized using a set of codebooks , where each codebook contains codewords , and each codeword is a
dimensional clustercentroid vector as in Kmeans. Corresponding to the
codebooks, we partition the binary codewords assignment vector into of indicator vectors , and each indicator vector indicates which one (and only one) of the codewords in the th codebook is used to approximate the th data point . To enable knowledge sharing across the anchors, positive and negative samples in the triplets, we propose a triplet quantization approach by sharing the codebooks across different samples in all triplets. To mitigate the degeneration issue of Kmeans, we further propose a weak orthogonality penalty across the codebooks, which potentially reduces the redundancy of the multiple codebooks and improves the compactness of the binary codes. The proposed triplet quantization model with weakorthogonal constraint is defined as(3) 
where , is the norm that simply counts the number of the vector’s nonzero elements, and is the hyperparameter that controls the degree of orthogonality. The constraint guarantees that only one codeword in each codebook can be activated to approximate the input data, which leads to compact binary codes. The underlying reason of using codebooks instead of single codebook to approximate each input data point is to further minimize the quantization error, while single codebook yields significantly lossy compression and large performance drop.
We enable efficient and effective image retrieval in an endtoend architecture by integrating the triplet training procedure (1), triplet selection module (2) and the weakorthogonal quantization (3) in a unified deep triplet quantization (DTQ) model as
(4) 
where is a hyperparameter between the triplet loss and the triplet quantization loss , and denotes the set of learnable parameters of the deep network. Through joint optimization problem (4), we can learn the binary codes by jointly preserving the similarity via triplet learning procedure and controlling the quantization error of binarizing continuous representations to compact binary codes. A notable advantage of joint optimization is that we can improve the quantizability of the learned deep representations such that they can be quantized more effectively by our weakorthogonal quantizer (3), yielding more accurate binary codes.
Approximate nearest neighbor (ANN) search by maximum innerproduct similarity is a powerful tool for quantization methods (Du and Wang, 2014). Given a database of binary codes , we follow (Cao et al., 2016, 2017a) to adopt Asymmetric Quantizer Distance (AQD) as the metric, which computes the innerproduct similarity between a given query and the reconstruction of the database point as
(5) 
Given query and the deep representation , these innerproducts between and all codebooks and all possible values of can be precomputed and stored in a queryspecific lookup table, which is used to compute AQD between the query and all database points, each entails table lookups and additions and is slightly more costly than computing the Hamming distance.
The DTQ optimization problem in Equation (4) consists of three sets of variables: deep convolutional neural network parameters , shared codebook , and binary codes . We adopt an alternating optimization paradigm (Long et al., 2016) which iteratively updates one variable with the remaining variables fixed.
Learning . The network parameters
can be efficiently optimized via standard backpropagation (BP) algorithm. We adopt the automatic differentiation techniques in TensorFlow.
Learning . We update codebook by rewriting Equation (4) with as the unknown variables in matrix formulation as follows,
(6) 
We adopt the gradient descent to update , , and
(7) 
where is a learning rate. We can further speed up computation by first solving with , which leads to an analytic solution , then updating with this solution as the starting point of gradient descent.
Learning . As each is independent on the rest of , the optimization for can be decomposed to 3 subproblems,
(8) 
This is essentially a highorder Markov Random Field (MRF) problem. As the MRF problem is generally NPhard, we resort to the Iterated Conditional Modes (ICM) algorithm (Zhang et al., 2014a) that solves indicators alternatively. Specifically, given fixed, we update by exhaustively checking all the codewords in the codebook , finding the specific codeword with mimimal objective in (8), and setting the corresponding entry of as and the rest as . The ICM algorithm is guaranteed to converge to local minima, and can be terminated if maximum iteration is reached. And the training procedure of DTQ is summarized in Algorithm 1.


Method  NUSWIDE  CIFAR10  MSCOCO  
8 bits  16 bits  24 bits  32 bits  8 bits  16 bits  24 bits  32 bits  8 bits  16 bits  24 bits  32 bits  
ITQCCA  0.526  0.575  0.572  0.594  0.315  0.354  0.371  0.414  0.501  0.566  0.563  0.562 
BRE  0.550  0.607  0.605  0.608  0.306  0.370  0.428  0.438  0.535  0.592  0.611  0.622 
KSH  0.618  0.651  0.672  0.682  0.489  0.524  0.534  0.558  0.492  0.521  0.533  0.534 
SDH  0.645  0.688  0.704  0.711  0.356  0.461  0.496  0.520  0.541  0.555  0.560  0.564 
CNNH  0.586  0.609  0.628  0.635  0.461  0.476  0.476  0.472  0.505  0.564  0.569  0.574 
DNNH  0.638  0.652  0.667  0.687  0.525  0.559  0.566  0.558  0.551  0.593  0.601  0.603 
DHN  0.668  0.702  0.713  0.716  0.512  0.568  0.594  0.603  0.607  0.677  0.697  0.701 
HashNet  0.613  0.662  0.687  0.699  0.621  0.643  0.660  0.667  0.625  0.687  0.699  0.718 
DQN  0.721  0.735  0.747  0.752  0.527  0.551  0.558  0.564  0.649  0.653  0.666  0.685 
DVSQ  0.780  0.790  0.792  0.797  0.715  0.727  0.730  0.733  0.704  0.712  0.717  0.720 
DTQ  0.795  0.798  0.799  0.801  0.785  0.789  0.790  0.792  0.758  0.760  0.764  0.767 

We conduct extensive experiments to evaluate the efficacy of the proposed DTQ approach against several stateoftheart shallow and deep hashing methods on three image retrieval benchmark datasets, NUSWIDE, CIFAR10, and MSCOCO. Project codes and detailed configurations will be available at https://github.com/thuml.
The evaluation is conducted on three widely used image retrieval benchmark xdatasets: NUSWIDE, CIFAR10, and MSCOCO.
NUSWIDE^{1}^{1}1http://lms.comp.nus.edu.sg/research/NUSWIDE.htm (Chua et al., 2009) is a public image dataset which contains 269,648 images in 81 ground truth categories. We follow similar experimental protocols in (Cao et al., 2016, 2017a), and randomly sample 5,000 images as query points, with the remaining images used as the database and randomly sample 10,000 images from the database for training.
CIFAR10^{2}^{2}2http://www.cs.toronto.edu/kriz/cifar.html is a public dataset with 60,000 tiny images in 10 classes. We follow the protocol in (Cao et al., 2016) to randomly select 100 images per class as the query set, 500 images per class for training, and the rest images as the database.
MSCOCO^{3}^{3}3http://mscoco.org (Lin et al., 2014) is a dataset for image recognition, segmentation and captioning. The current release contains 82,783 training images and 40,504 validation images, where each image is labeled by some of the 80 semantic concepts. We randomly sample 5,000 images as the query points, with the rest used as the database, and randomly sample 10,000 images from the database for training.
Following standard evaluation protocol as previous work (Xia et al., 2014; Lai et al., 2015; Zhu et al., 2016; Cao et al., 2017a, b), the similarity information for hash function learning and for groundtruth evaluation is constructed from image labels: if two images and share at least one label, they are similar and , otherwise they are dissimilar and . Though we use the ground truth image labels to construct the similarity information, the proposed DTQ can learn compact binary codes when only the similarity information is available, more general than labelbased hashing and quantization methods (Cao et al., 2016, 2017a).
We compare the retrieval performance of DTQ with ten stateoftheart hashing methods, including supervised shallow hashing methods BRE (Kulis and Darrell, 2009), ITQCCA (Gong and Lazebnik, 2011), KSH (Liu et al., 2012), SDH (Shen et al., 2015) and supervised deep hashing methods CNNH (Xia et al., 2014), DNNH (Lai et al., 2015), DHN (Zhu et al., 2016), DQN (Cao et al., 2016), HashNet (Cao et al., 2017b), DVSQ (Cao et al., 2017a)
. We evaluate retrieval quality based on three standard evaluation metrics: Mean Average Precision (
MAP), PrecisionRecall curves (PR), and Precision curves with respect to the numbers of top returned samples (P@N). To enable a direct comparison to the published results, all methods use identical training and test sets. We follow (Cao et al., 2016, 2017b, 2017a) and adopt MAP@5000 for NUSWIDE dataset, MAP@5000 for MSCOCO dataset, and MAP@54000 for CIFAR10 dataset.Our implementation of DTQ is based on TensorFlow. For shallow hashing methods, we use as image features the 4096dimensional DeCAF features (Donahue et al., 2014). For deep hashing methods, we use as input the original images, and adopt AlexNet (Krizhevsky et al., 2012) as the backbone architecture. We finetune layers conv1–fc7
copied from the AlexNet model pretrained on ImageNet and train the last hash layer via backpropagation. As the last layer is trained from scratch, we set its learning rate to be 10 times that of the lower layers. We use minibatch stochastic gradient descent (SGD) with 0.9 momentum as the solver, and crossvalidate the learning rate from
to with a multiplicative stepsize . We fix codewords for each codebook as (Cao et al., 2017a). For each point, the binary code for all codebooks requires bits (i.e. bytes), where we set as is a hyperparameters. We fix the minibatch size of triplets as in each iteration and set the initial number of groups as for NUSWIDE and MSCOCO, and for CIFAR10. We select the hyperparameters of the proposed method DTQ and all comparison methods using the threefold crossvalidation.The MAP results of all methods are listed in Table 1, showing that the proposed DTQ substantially outperforms all the comparison methods. Specifically, compared to SDH (Shen et al., 2015), the best shallow hashing method with deep features as input, DTQ achieves absolute increases of 11.1%, 33.0% and 20.7% in the average MAP on NUSWIDE, CIFAR10, and MSCOCO respectively. Compared to DVSQ (Cao et al., 2017a), the stateoftheart deep quantization method with class labels as supervised information, DTQ outperforms DVSQ by large margins of 0.8%, 6.2% and 4.9% in average MAP on the three datasets, NUSWIDE, CIFAR10, and MSCOCO, respectively.
The MAP results reveal several interesting insights. 1) Shallow hashing methods cannot learn discriminative deep representations and hash codes through endtoend framework, which explains the fact that they are surpassed by deep hashing methods. 2) Deep quantization methods DQN and DVSQ learn less lossy binary codes by jointly preserving similarity information and controlling the quantization error, significantly outperforming pioneering methods CNNH and DNNH without reducing the quantization error.
The proposed DTQ improves substantially from the stateoftheart DVSQ by three important perspectives: 1) DTQ introduces a novel triplet training strategy to deep quantization framework for efficient similarity retrieval. It is worth noting that DTQ can learn compact binary codes when only the similarity information is available, which is more general than the labelbased hashing method DVSQ. 2) During the learning of triplet loss, DTQ adopts a novel triplet mining strategy, Group Hard, that mines appropriate amount of good triplets for each epoch, resulting in effective triplet training and better performance. 3) DTQ is the first method to apply weakorthogonal quantization during triplet training. And backpropagating the triplet quantization loss can remarkably enhance the quantizability of the deep representations.
The retrieval performance in terms of PrecisionRecall curves (PR) and Precision curves with respect to different numbers of top returned samples (P@N) are shown in Figures 2 and 3, respectively. These metrics are widely used in deploying practical systems. The proposed DTQ significantly outperforms all the comparison methods by large margins under these two evaluation metrics. In particular, DTQ achieves much higher precision at lower recall levels or smaller number of top samples than all compared baselines. This is very desirable for precisionoriented retrieval, where people count more on the top returned results with a small . This justifies the value of our model for practical retrieval systems.


Method  NUSWIDE  CIFAR10  MSCOCO  
8 bits  16 bits  24 bits  32 bits  8 bits  16 bits  24 bits  32 bits  8 bits  16 bits  24 bits  32 bits  
DTQH  0.753  0.758  0.763  0.769  0.741  0.747  0.751  0.754  0.708  0.714  0.722  0.729 
DTQT  0.719  0.722  0.727  0.731  0.663  0.670  0.672  0.679  0.714  0.720  0.728  0.734 
DTQ2  0.752  0.757  0.761  0.768  0.718  0.722  0.726  0.731  0.717  0.725  0.733  0.739 
DTQQ  0.769  0.773  0.777  0.781  0.750  0.761  0.763  0.765  0.721  0.727  0.734  0.740 
DTQO  0.785  0.787  0.780  0.788  0.771  0.777  0.779  0.781  0.739  0.745  0.750  0.758 
DTQ  0.795  0.798  0.799  0.801  0.785  0.789  0.790  0.792  0.758  0.760  0.764  0.767 

We investigate five variants of DTQ: 1) DTQT is the DTQ variant by replacing the triplet loss in (1) with the widelyused pairwise crossentropy loss (Zhu et al., 2016; Cao et al., 2017b); 2) DTQH is the DTQ variant without Group Hard to mine appropriate amount of good triplets for each epoch during the learning of the triplet loss as (Lai et al., 2015); 3) DTQ2 is the twostep variant of DTQ which first learns the deep representations for all images and then generates compact binary codes via the weakorthogonal quantization. 4) DTQQ is the DTQ variant which replaces the proposed Triplet Quantization to the Product Quantization (Jegou et al., 2011) used in DQN (Cao et al., 2016). 5) DTQO is the DTQ variant by removing the weak orthogonality penalty for redundancy reduction, i.e. .
The MAP results for DTQ and it’s five variants with respect to different code lengths on three benchmark datasets, NUSWIDE, CIFAR10, and MSCOCO are reported in Table 2.
Triplet Loss. DTQ outperforms DTQT by very large margins of 7.4%, 11.8% and 3.8% in the average MAP on the three datasets, NUSWIDE, CIFAR10, and MSCOCO, respectively. DTQT uses the widelyused pairwise crossentropy loss (Zhu et al., 2016; Cao et al., 2017b) which achieves stateoftheart results on previous similarity retrieval tasks. It is worth noting that the triplet loss is a learning to rank method, and tries to bring the anchor and the positive samples closer while also pushing away the negative samples. The DTQ with triplet loss is actually more suitable for the similarity retrieval tasks and naturally gives rise to much better performance than DTQT.
Quantizability. Another observation is that by jointly preserving similarity information in the deep representations of image triplets as well as controlling the quantization error of compact binary codes, DTQ outperforms DTQ2 by 3.9%, 6.4% and 3.4% in the average MAP on the three datasets, NUSWIDE, CIFAR10, and MSCOCO. This shows that endtoend quantization can improve the quantizability of deep feature representations and satisfactorily yield much more accurate retrieval results.
Triplet Quantization. After replacing the proposed Triplet Quantization to Product Quantization (Jegou et al., 2011) used in DQN (Cao et al., 2016), DTQQ yields significantly lossy compression and incur remarkable performance drop of 2.3%, 2.9%, 3.2% in the average MAP on the three datasets, NUSWIDE, CIFAR10, and MSCOCO datasets respectively. This proves that the proposed Triplet Quantization with weak orthogonality can effectively learn compact binary codes and enable more effective retrieval than Product Quantization.
WeakOrthogonal Quantization. Finally, by removing the weak orthogonality penalty, DTQO incurs performance drop of 1.3%, 1.2%, 1.4% in the average MAP on the three datasets, NUSWIDE, CIFAR10, and MSCOCO datasets respectively. This proves the importance of removing the codebook redundancy and improving the compactness of binary codes for efficient image retrieval.
By using the proposed triplet mining strategy, Group Hard, DTQ outperforms DTQH by large margins of 3.8%, 4.0% and 4.4% in the average MAP on three benchmark datasets, NUSWIDE, CIFAR10, and MSCOCO, respectively. As shown in Figure 4, without mining the appropriate amount of hard triplets, the Group All training of triplet loss will quickly stagnate, leading to suboptimal convergence quality and MAP results. The proposed triplet mining strategy, Group Hard, randomly samples proper amount of useful triplets with hard examples from several randomly partitioned group, resulting in effective training and faster convergence as well as more accurate retrieval performance.


Method  8 bits  16 bits  24 bits  32 bits 
DTQonline  0.703  0.708  0.710  0.713 
DTQ  0.785  0.789  0.790  0.792 

Online Selection. Selecting all batch samples as negative is also known as online triplet selection in the literature. Here we conduct a new experiment which uses online triplet selection and selects all hard negative samples in a batch (samples per batch = 192) for each anchorpositive pair. The results are reported in Table 3. Due to the low ratio of the valid hard triplets in each batch for triplet training, DTQonline (with online triplet selection) fails to achieve satisfactory retrieval results compared with the proposed DTQ.
As online triplet selection cannot achieve satisfactory results, we adopt offline triplet selection, which selects the valid hard triplets at the beginning of each epoch. However, the offline strategy may generate too many candidate triplets and need a huge number of batches per epoch, leading to hard triplets outdated for training and potentially wasting most batches of each epoch. To alleviate the outdated effect of hard triplets in offline selection, we split the data into specific groups and select hard triplets within each group, reducing the training triplets from to .
We conduct an experiment to count the number of outdated hard triplets during training, shown in Figure 6. By splitting training data into specific groups, the number of outdated hard triplets is significantly reduced, leading to much better MAP results than the original offline triplet selection (i.e. ). This validates the effectiveness of the proposed offline selection strategy, Group Hard.
We show tSNE visualization of binary codes and the illustration of top 10 returned images for better understanding the impressive performance improvement of DTQ.
Visualization of Representations. Figure 5 shows the tSNE visualizations (van der Maaten and Hinton, 2008) of the deep representations learned by DVSQ (Cao et al., 2017a), DTQ2, and DTQ on CIFAR10 dataset. The deep representations of the proposed DTQ exhibit clear discriminative structures with data points in different categories well separated, while the deep representations by DVSQ (Cao et al., 2017a) exhibit relative vague structures. This validates that by introducing the triplet training to deep quantization, the deep representations generated by our DTQ are more discriminative than that generated by DVSQ, enabling more accurate image retrieval. Also, the deep representations of DTQ are more discriminative than that of the twostep variant DTQ2, showing the efficacy of jointly preserving similarity information in the deep representations of image triplets and controlling the quantization error of compact binary codes via backpropagation.
Illustration of Top 10 Results. Figure 7 illustrates the top 10 returned images of DTQ and the best deep hashing baseline DVSQ (Cao et al., 2017a) for three query images on the three datasets NUSWIDE, CIFAR10, and MSCOCO, respectively. DTQ yields much more relevant and userdesired retrieval results than the stateoftheart method.
This paper proposed Deep Triplet Quantization (DTQ) for efficient image retrieval, which introduces a triplet training strategy to deep quantization framework. Through a novel triplet selection module, Group Hard, an appropriate number of hard triplets are selected for effective triplet training and faster convergence. To enable efficient image retrieval, DTQ can learn compact binary codes by jointly optimizing a novel triplet quantization loss with weak orthogonality. Comprehensive experiments justify that DTQ generates compact binary encoding and yields stateoftheart retrieval performance on three benchmark datasets NUSWIDE, CIFAR10, and MSCOCO.
This work is supported by National Key R&D Program of China (2016YFB1000701), and NSFC grants (61772299, 61672313, 71690231).
Learning binary codes for highdimensional data using bilinear projections. In
CVPR. IEEE, 484–491.
Comments
There are no comments yet.