In this big data era, large volume and high-dimensional multimedia data is ubiquitous in social networks and search engines. This leads to the major challenge of how to efficiently retrieve information from the large-scale database (Wang et al., 2018)
. To guarantee retrieval efficiency and quality, approximate nearest neighbour (ANN) search has attracted increasing attention in recent years. Parallel to the traditional indexing methods, hashing is one of the most advantaged methods in existing ANN methods, as it transforms high dimensional multimedia data into compact binary codes and enables efficient xor operations to accelerate calculation in Hamming space. In this paper, we will focus on learning to hash methods which build upon data-dependent binary encoding schemes for efficient image retrieval, which have demonstrated superior performance over data-independent hashing methods, e.g. LSH(Gionis et al., 1999).
Generally, learning to hash methods can be divided into unsupervised and supervised groups. Compared with unsupervised methods (Zhang et al., 2018; Gong et al., 2013), supervised methods (Shen et al., 2015; Kang et al., 2016; Zhang et al., 2014, 2019b, 2019a)
can yield better performance with the support of label supervision. With the rapid development of deep neural network, deep hashing methods(Zhu et al., 2016; Cao et al., 2016; Li et al., 2015; Cao et al., 2017; Yang et al., 2016; Luo et al., 2018a, b) have demonstrated superior performance over non-deep hashing methods and achieved state-of-the-art results on public benchmarks.
However, among mainstream deep hashing frameworks, human-annotated labels purely supervise the distribution alignment of hash code embedding, yet fail to trigger context-aware visual representation learning, let alone optimal binary codes generation. Moreover, the correlations between features and semantics are not well-explored to generate semantic consistent binary codes. Furthermore, existing supervised methods are vulnerable to the imbalanced distribution of semantic labels. Models tend to grasp the frequently appeared concepts in the training data and disregard the infrequent ones, which highly restricts the expression capacity of hash codes. Hence, existing deep hashing methods may fail to generate optimal binary codes for efficient image retrieval.
In this paper, we propose a novel Deep Collaborative Discrete Hashing (DCDH) method, which constructs a discriminative common discrete space via dual-stream learning, as illustrated in Figure 1
. The main idea of the proposed framework is to construct a semantic invariant space, via bridging the gap between visual space and semantic space. Specifically, (1) We develop a bilinear representation learning framework, which significantly fuses and strengthens visual-semantic correlations to learn context-aware binary codes. (2) We employ outer product on visual features and label embeddings to generate more expressive representations rather than element-wise product or plain concatenation. To the best of our knowledge, this is one of the first attempts to utilize the outer product to capture pairwise correlations between heterogeneous spaces. (3) We seamlessly integrate our framework with the focal loss to enhance the discriminant of generated binary codes and mitigate the class-imbalance problem by reducing weights on the well classified concepts and increasing weights on rare concepts. (4) Extensive experiments conducted on benchmark datasets demonstrate that DCDH is capable to generate more discriminative and informative binary codes and yield state-of-the-art performance.
2. The proposed Approach
2.1. Problem Formulation
Given a set of images , where and are the
-th image and corresponding one-hot label vector, respectively. Deep hashing aims to encode dataas -bits binary codes . In our method, we mainly focus on the pairwise similarity-preserving hashing. In particular, we construct the similarity information based on ground-truth label. If two images and share at least one common label, we define and are semantically similar and , otherwise indicating dissimilar.
2.2. Deep Visual Embedding Network
The purpose of the visual embedding network is to generate discriminative hash codes such that similar pairs can be distinguished from dissimilar pairs. Specifically, Hamming distance between and should be minimized when = 1, while maximized when = -1. To preserve the pairwise similarities (Liu et al., 2012), our work adopts smooth loss defined on the inner product between binary codes as:
However, it is difficult to generate the discrete outputs. We can set , where denotes the parameters of deep visual embedding network.
where is the binary similarity matrix. In this paper, we designed an end-to-end feature learning network which extends the pretrained AlexNet (Krizhevsky et al., 2012) model for discriminative visual embedding learning. Based on this backbone network, we replace the final classifier layer with a fully connected layer to transform the convolutional feature maps into the -dimensional continuous codes
. Subsequently, we apply hyperbolic tangent (tanh) as the activation function to approximate non-differential signum (sgn) function,i.e., . To control the quantization error and bridge the gap between the binary codes and its relaxation, we add an extra penalty term to keep and
as close as possible. We adopt the following matrix-form loss function to facilitate the network back-propagate the gradient to. Hence, the problem in (2) is transformed into the following problem:
where is a weighting parameter.
2.3. Deep Class Encoding Network
In pairwise-preserving hashing methods, labels are always exploited as similarity measurement between data points by applying the element-wise inner product. However, solely using similarity matrix to supervise hash codes learning inevitably results in severe information loss and thus highly restricts the expression ability of generated hash codes, especially in multi-label cases. To be more specific, one image annotated by multiple labels (such as ’ocean’, ’beach’ and ’water’) contains underlying semantic connections in concepts, while single class vector may hinder the conceptual bridge at a fine-grained level. The purpose of the label encoding network is to capture the original semantic information and preserve them in -dimensional flexible continuous space. Similarly, the loss function of the label encoding network can be defined as:
where denotes the label encoding network parameterized by . By providing with complementary views of semantics, the label encoding network potentially guides the visual embedding network to learn beneficial context-aware representations.
2.4. Semantic Invariant Structure Construction
To disentangle the relationships between the abstract concepts and the visual features, we apply the outer product to fuse visual and label embeddings. Being distinct from the conventional element-wise product or plain concatenation, the applied outer product allows high-level label encoding and low-level visual feature embeddings to interactively influence each other. In this way, we can capture the pairwise correlations between the feature of an image and its corresponding label, enabling discovery of the common latent attributes. By applying the outer product, we first obtain the pairwise interaction between label and image features. After the training procedure, the semantic information is well separated by the related region in the image. The latent vector is obtained by reshaping the pairwise correlation matrix to a vector, which can project to discrete space to generate hash codes. The generated codes are more discriminative since the outer product operator ensures the bits truly reflect regions in the images to the corresponding semantic information. To construct the semantic invariant structure, the objective function can be formulated as:
where denotes the outer product and denotes the fusion network parameterized by .
Furthermore, we introduce the focal loss (Lin et al., 2017) to mitigate the side effect from class imbalance, and the objective function is:
where is the hyper-parameter. The denotes the classification layer parameterized by , and-th sample for the -th class. The adaptive factor is:
2.5. Collaborative Learning
To learn context-aware and discriminative hash codes, we adopt the joint learning framework consisting of the visual embedding network, the label encoding network and the semantic-invariant structure construction, and we have:
where and are coefficients to weight the importance of different terms. , and denote the visual embedding net loss, the label encoding net loss and semantic invariant structure construction loss, respectively.
The proposed DCDH needs to jointly optimize Eqn. (3), (4), and (6). Due to similar forms, we only illustrate one detailed optimization on Eqn. (6), and the rest equations can be solved similarly. Specifically, we adopt the iterative alternating optimization manner, that is, we update one variable with others fixed.
Learning : The network parameters
can be optimized via standard back-propagation algorithm by automatic differentiation techniques in Pytorch(Paszke et al., 2017).
Learning : We aim to optimize with all hyper-parameters fixed, and we rewrite the Eqn. (6) as follows:
where is the trace norm. Since focal loss is independent to update, we can consider focal loss as constant when learning .
It is challenging to directly optimize due to its discrete constraint. Inspired by (Shen et al., 2015), we learn binary codes by the DCC strategy, in which non-differential variable can be solved in a bit-by-bit manner. Therefore, problem (9) can be reformulated to minimize
where , and . is the row of , denotes the matrix of exclude . is the row of , denotes the matrix of exclude . is the row of , denotes the matrix of exclude Eventually, we can get the following optimal solution of problem (2.6) that can be used to update :
The network parameters can be efficiently optimized through standard back propagation algorithm by using automatic differentiation techniques by PyTorch (Paszke et al., 2017).
2.7. Out of Sample Extension
Based on the proposed optimization method, we can obtain the optimal binary codes for all the training data and the optimized visual embedding learning network, i.e., . Our learning framework can easily generate the binary codes of a new query by using the visual network followed by the signum function:
We conduct extensive experiments to evaluate our method against several state-of-the-art hashing methods on NUS-WIDE and MIRFlickr. NUS-WIDE contains 269,648 images with 81 tags. Following (Li et al., 2015), we select a subset of 195,834 images that are included in the 21 most frequent classes. MIRFlickr contains 25,000 images from Flickr website, in which each image is tagged by at least one of 38 concepts. Following evaluation splits in (Shen et al., 2015; Kang et al., 2016), we randomly sample 2,100 and 1700 images as query sets for NUS-WIDE and MIRFlickr, respectively, and the rest are utilized for training.
We compare our DCDH with 10 state-of-the-art hashing methods, which include 4 non-deep hashing methods (i.e. KSH (Liu et al., 2012), SDH (Shen et al., 2015), COSDISH (Kang et al., 2016), LFH (Zhang et al., 2014)), 4 deep hashing methods (i.e. DPSH (Li et al., 2015), DHN (Zhu et al., 2016), DVSQ (Cao et al., 2017), DTQ (Liu et al., 2018)), 1 unsupervised method (i.e. ITQ (Gong et al., 2013)) and 1 data independent method (i.e. LSH (Gionis et al., 1999)
). For fair comparison, we employ 4096-dim deep features extracted from AlexNet(Krizhevsky et al., 2012) for non-deep methods. Two evalution metrics, i.e., Mean Average Precision (MAP), and Precision@top K, are used for performance comparison.
The MAP results of different methods on NUS-WIDE, and MIRFlickr are reported in Table 1. (1) Generally, taking advantage of semantic information, supervised methods can achieve better retrieval performance than unsupervised methods, while ITQ can obtain competitive results. (2) Deep hashing methods can outperform shallow methods in most cases, since deep hashing methods benefit from learning discriminative representations and non-linear hash functions. (3) From MAP results in Table 1 and percision@top K curves in Figure 2, we can observe DCDH outperforms other comparison methods by a large margin. Our proposed method always produces the best performance on both of the benchmarks, which emphasizes the importance of semantic invariant structure construction and excavating the underlying semantic correlation.
3.2. Ablation Study
We investigate two variants of DCDH: 1) DCDH-V utilizes the visual feature embedding net solely to generate hash codes. 2) DCDH-S leaves alone the semantic encoding and visual feature embedding to generate hash codes. We report the results of DCDH variants in Table 2 with 12 bits hash codes and 48 bits hash codes on NUS-WIDE, and MIRFlickr. Compared with full model DCDH, we observe that both DCDH-S and DCDH-V incur a descendant in MAP. DCDH-S can achieve better performance than DCDH-V after employing the supervision of encoded labels. The result further reveals that the importance of mining the semantic correlation between semantic information and local visual features.
|(a) 12 bits@NUS-WIDE||(b) 12 bits@MIRFLickr|
In this paper, we proposed a novel deep supervised hashing framework, which collaboratively explores the visual feature representation learning, semantic invariant structure construction, and label distribution correction. A discriminative common discrete Hamming space was constructed by concurrently considering the shared and model-specific semantic information from visual features and context annotations. Moreover, the class imbalance problem was addressed to leverage frequent and rare concepts. Extensive experimental results demonstrate the superiority of the proposed joint learning framework.
Acknowledgements.This work is supported by ARC discovery project DP190102353.
- Deep visual-semantic quantization for efficient image retrieval. In CVPR, pp. 1328–1337. Cited by: §1, §3.
- Deep quantization network for efficient image retrieval.. In AAAI, pp. 3457–3463. Cited by: §1.
- Similarity search in high dimensions via hashing. In VLDB, pp. 518–529. Cited by: §1, §3.
- Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE TPAMI 35 (12), pp. 2916–2929. Cited by: §1, §3.
- Column sampling based discrete supervised hashing. In AAAI, pp. 1230–1236. Cited by: §1, §3, §3.
- ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §2.2, §3.
- Feature learning based deep supervised hashing with pairwise labels. pp. 1711–1717. Cited by: §1, §3, §3.
- Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §2.4.
- Deep triplet quantization. ACMM. Cited by: §3.
- Supervised hashing with kernels. In CVPR, pp. 2074–2081. Cited by: §2.2, §3.
- Collaborative learning for extremely low bit asymmetric hashing. CoRR abs/1809.09329. External Links: Cited by: §1.
- Robust discrete code modeling for supervised hashing. Pattern Recognition 75, pp. 128–135. Cited by: §1.
- Automatic differentiation in pytorch. Cited by: §2.6, §2.6.
- Supervised discrete hashing. In CVPR, pp. 37–45. Cited by: §1, §2.6, §3, §3.
- A survey on learning to hash. IEEE Trans. PAMI 40 (4), pp. 769–790. Cited by: §1.
- Zero-shot hashing via transferring supervised knowledge. In ACMM, Cited by: §1.
- Supervised hashing with latent factor models. In SIGIR, pp. 173–182. Cited by: §1, §3.
- Scalable supervised asymmetric hashing with semantic and latent factor embedding. IEEE Trans. IP 99 (99), pp. 1–16. Cited by: §1.
- Binary multi-view clustering. IEEE Trans. PAMI 99 (99), pp. 1–9. Cited by: §1.
- SADIH: semantic-aware discrete hashing. In AAAI, pp. 12–19. Cited by: §1.
- Deep hashing network for efficient similarity retrieval.. In AAAI, pp. 2415–2421. Cited by: §1, §3.