1 Introduction and related works
Contentbased image retrieval (CBIR) is one of the interesting problems in computer vision and has enormous applications such as being a common approach for image captioning, visual searching (Google Images, Flickr). Stateoftheart image search systems involve a major component which maps an input image into a visual representation. The output vector is searched through a database via Euclidean distancebased comparison or quantization schemes in order to retrieve relevant images
[1, 2]. Due to the exponential growth of image data, it is necessary to reduce memory storage and searching time in CBIR systems. An interesting approach for achieving these requirements is binary hashing [3, 4, 5, 6, 7, 8]. Technically, instead of producing a realvalued vector as a final representation, the hashing approach maps an input into a compact binary code via datadependent or dataindependent algorithms. Consequently, the produced binary codes dramatically reduce memory storage. In addition, thanks to the advance of computer hardware, we are able to compute the Hamming distance between the binary codes within only one cycle clock via the POPCNT instruction [9].Datadependent approaches utilize training datasets to learn hashing models, and thus they usually outperform dataindependent approaches. Along with the spectacular rise of deep learning, recent datadependent hashing methods tried to construct endtoend models which are able to simultaneously learn image representations and binary codes
[3, 4, 5, 7]. Thanks to joint optimization, the binary codes are able to retain label information and so increasing the discrimination. However, because the hashing network has to produce binary output, the loss function involves binary constraints which are commonly represented by a nondifferentiable function, e.g., [10, 11]. One of the workarounds is to use approximated functions of . For instance, [12] used a logisticfunction to relax the binary constraints to range constraints. Although the proposed functions are differentiable, these functions cause the vanishing gradient problem when training via stochastic gradient descent (SGD)
[13]. On the other hand, [14] resolved the binary constraints by assuming that the absolute function and regularization are differentiable everywhere. However, it could be some degradation in performance due to the assumption.In [7], the authors proposed a supervised hashing neural network (SHBDNN). They used the idea of penalty method [15] to deal with the binary constraints on output codes and optimized the model via LBFGS. Nevertheless, SHBDNN is not an endtoend model in which the feature representation and the binary code learning and not joint optimized.
Our specific contributions are: (i) We propose an endtoend deep neural network (SHE2E) for supervised hashing which integrates 3 components of a visual search system, i.e., the feature extraction component, the dimension reduction component, and the binary learning component. (ii) We also introduce a learning scheme which not only is able to cope with binary constraints but also is scalable for largescale training. (iii) Solid experiments on 3 image retrieval benchmarks show our method significantly outperforms other supervised hashing methods.
2 Methodology
2.1 Binary constrained loss function
Given the set of images and network parameters , let be the output of the network, i.e., is binary codes of length corresponding to the input. We not only want the output to be binary but also to achieve other hashing properties. One of them is similarity preserving, i.e., samples belonging to same class should have similar codes, while samples belonging to different classes should have different codes. In other words, we want to minimize where:
(1) 
From the definition mentioned, we define the loss function:
(2) 
where hyperparameter controls the similarity property. Because of the binary constraints under Equation 2, the loss function becomes a mixedinteger programming problem which is NPhard. Inspired from [7], we relax the constraints by introducing an auxiliary variable. Let , the loss function is formulated as:
(3) 
The second term plays a role as a measure of constraint violation. If we set sufficiently, we can strongly force the output to be binary, which helps us easy to reach the feasible solution. Additionally, by introducing , we can optimize Equation 3 by alternating methods (Section 2.3). Finally, we integrate the independence and balance properties which were introduced in [11] by directly attaching them into the objective function:
(4) 
Where , are hyperparameters; is a dimensional column vector of ones.
2.2 Network architecture
Figure 1 illustrates the overall architecture of the proposed endtoend deep hashing network. The network is composed of three major components: (i) a learning feature component which extracts visual image representations, (ii) a dimension reduction layer, and (iii) the hashing component which produces binary codes.
In CBIR systems, the feature extractor can be constructed from handcrafted features [6]
or learnable features such as convolutional neural networks
[2]. In order to build an endtoend system which can optimize all components, we choose the convolutional deep networks as the feature extractor. It is flexible that we can choose any convolutional deep networks such as VGG [16], AlexNet [17]. To make fair experiments, we utilize AlexNet as the feature extractor of the network for which is consistent with other hashing methods compared. In the proposed design, we remove the last layer (i.e. softmax layer) of AlexNet, and consider its last fully connected layer () as the image representation. Thus, the output of the feature extractor component is a 1024dimensional realvalued vector.The dimension reduction component (the DR layer) involves a fully connected layer initialized by PCA from the outputs of AlexNet’s layer from the training set. Specifically, let and be the weights and bias of the layer, respectively:
(5)  
where
are eigenvectors extracted from the covariance matrix and
is the mean of the features of the training set. We use the identity function as the activation function of this layer.
The last component, namely the hashing optimizer, is constructed from several fully connected layers. The output of this component has the same length as the length of required binary output.
2.3 Training
The training procedure is demonstrated in Algorithm 1 in which is the binary codes of the training set at th iteration and is SHE2E’s parameters at the th innerloop of the th outerloop. First of all, we use a pretrained AlexNet as the initial weights of SHE2E, namely . The DR layer is generated as discussed in Section 2.2
and the remaining layers are initialized randomly. In order to encourage the algorithm to converge faster, the binary variable
is initialized by ITQ [18].We apply the alternating approach to minimize the loss function (4). In each iteration , we only sample a minibatch including images from the training set as well as corresponding binary codes from . Additionally, we create the similarity matrix (Equation 1) corresponding to as well as the matrix. Since has been already computed, we can fix that variable and update the network parameter by standard backpropagation via SGD. Through iterations, we are able to exhaustively use all training data. After learning SHE2E from the whole training set, we update and then start the optimization procedure again until it reaches a criterion, i.e., after iterations.
Comparing to the recent supervised hashing SHBDNN [7], the proposed framework has two advantages. SHE2E is an endtoend hashing network which consists both feature extraction and binary code optimization in a unified framework. It differs from SHBDNN which requires image features as inputs. Secondly, thanks to the SGD, SHE2E can be trained with large scale datasets because the network builds the similarity matrix from an image batch of samples (). It differs from SHBDNN which uses the whole training set in each iteration.
3 Experiments
3.1 Datasets
MNIST [19] comprises 70k grayscale images of 10 handwritten digits which are divided into 2 sets: 60k training images and 10k testing images.
Cifar10 [20] includes 60k RGB images categorized into 10 classes where images are divided into 50k training images and 10k testing images.
SUN397 [21] is a largescale dataset which contains 108754
images categorized into 397 classes. Following the setting
from [8], we select 42 categories which have more than 500
images. This results 35k images. For each class, we randomly sample 100 images as test samples and therefore we get 4200 images to form a testing set. The remaining images are considered as training samples.
3.2 Implementation details
We implement SHE2E by MATLAB and the MatConvNet library [22]
. All experiments are conducted in a workstation machine (Intel(R) Xeon(R) CPU E52650 2.20GHz) with one Titan X GPU. The last component includes 3 fully connected layer in which the sigmoid function is used as the activation function. The number of units in the binary optimizer are empirically selected as described in Table
1.Layer 1  Layer 2  Layer 3  

8  90  20  8 
16  90  30  16 
24  100  40  24 
32  120  50  32 
48  140  80  48 
Regarding the hyperparameters of the loss function, we finetune
in range to , , and . For training the network, we select learning rate and weights decay to be . The size of a minibatch is for all experiments. Other settings in the algorithm are set as following: , .3.3 Comparison with other supervised hashing methods
We compare the proposed method with other supervised hashing methods, i.e., SHBDNN [7], ITQCCA [18], KSH [23], BRE [24], SDH [25]. In order to make fair comparisons for other methods, follow [7], we use the pretrained network AlexNet to extract features from fully connected layer then use PCA to map 4096dimensional features to lower dimensional space, i.e., a 800 dimensional space. The reduced features are used as inputs for compared methods.
The comparative results between methods are shown in Fig. 2. On MNIST dataset, SHE2E achieves fair improvements over compared methods. The performance of SHE2E is saturated when the code lengths are higher than 16, i.e, its mAP are 98.03%, 98.26% and 98.21% at code length 16, 24, 32, respectively. The similar results can be observed on the Cifar10 dataset. As shown in Figure 2(b), SHE2E outperforms other supervised hashing methods with a fair margin. SHE2E outperforms the most competitive SHBDNN from 4% to 7.5% at different code lengths. On the SUN397 dataset, SHE2E and the second best SHBDNN achieve comparable results, while these two methods significantly outperform other methods.
3.4 Comparison with other endtoend hashing networks
We also compare the proposed deep network with other endtoend supervised hashing architectures, i.e., Lai et al. [12], DHN [26], DQN [27], DSRH [28], and DRSCH [10]. For endtoend networks comparisons, we follow two different settings for the MNIST and the Cifar10 datasets.
The results of [12, 26, 10, 28, 27] are directly cited from the corresponding papers. Although those works proposed approaches which simultaneously learn image features and binary codes by combining CNN layers and a binary quantization layer, they used approximation of function, i.e., [28] and [12] use function, while [10] uses that may degrade the performance. Thanks to the new learning scheme for dealing with binary constraints and the effectiveness of proposed network architecture, the proposed SHE2E outperforms other endtoend methods with fair margins as observed in Table 2 and Table 3. Specifically, under Setting 1, the proposed SHE2E and DHN [26] achieve comparable results while these two methods significantly outperform other methods. Under Setting 2, the proposed SHE2E consistently outperforms the compared DRSCH [10] and DSRH [28] at all code lengths.
4 Conclusions
We propose a new deep network architecture which efficiently learns compact binary codes of images. The proposed network comprises three components, i.e., feature extractor, dimension reduction and binary code optimizer. These components are trained in an endtoend framework. In addition, we also propose the new learning scheme which can cope with binary constraints and also allows the network to be trained with large scale datasets. The experimental results on three benchmarks show the improvements of the proposed method over the stateoftheart supervised hashing methods.
Acknowledgements
This research is supported by the National Research Foundation Singapore under its AI Singapore Programme (Award number: AISG100E2018005)
References
 [1] ThanhToan Do and NgaiMan Cheung, “Embedding based on function approximation for large scale image search,” TPAMI, 2018.

[2]
Giorgos Tolias, Ronan Sicre, and Hervé Jégou,
“Particular object retrieval with integral maxpooling of cnn activations,”
2016.  [3] Kevin Lin, Jiwen Lu, ChuSong Chen, and Jie Zhou, “Learning compact binary descriptors with unsupervised deep neural networks,” in CVPR, 2016, pp. 1183–1192.
 [4] Bohan Zhuang, Guosheng Lin, Chunhua Shen, and Ian Reid, “Fast training of tripletbased deep binary embedding networks,” in CVPR, 2016, pp. 5955–5964.
 [5] Jie Feng, Svebor Karaman, and ShihFu Chang, “Deep image set hashing,” in WACV. IEEE, 2017, pp. 1241–1250.
 [6] ThanhToan Do, DangKhoa Le Tan, Trung T. Pham, and NgaiMan Cheung, “Simultaneous feature aggregating and hashing for largescale image search,” in CVPR, July 2017.
 [7] ThanhToan Do, AnhDzung Doan, and NgaiMan Cheung, “Learning to hash with binary deep neural network,” in ECCV. Springer, 2016, pp. 219–234.
 [8] ThanhToan Do, AnhDzung Doan, DucThanh Nguyen, and NgaiMan Cheung, “Binary hashing with semidefinite relaxation and augmented lagrangian,” in ECCV. Springer, 2016, pp. 802–817.
 [9] Jun Wang, Wei Liu, Sanjiv Kumar, and ShihFu Chang, “Learning to hash for indexing big data—a survey,” Proceedings of the IEEE, vol. 104, no. 1, pp. 34–57, 2016.
 [10] Ruimao Zhang, Liang Lin, Rui Zhang, Wangmeng Zuo, and Lei Zhang, “Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification,” TIP, pp. 4766–4779, 2015.
 [11] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou, “Deep hashing for compact binary codes learning,” in CVPR, 2015.
 [12] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in CVPR, 2015, pp. 3270–3278.
 [13] Geoffrey E Hinton, Simon Osindero, and YeeWhye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
 [14] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen, “Deep supervised hashing for fast image retrieval,” in CVPR, 2016, pp. 2064–2072.
 [15] Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Chapter 17, World Scientific, 2nd edition, 2006.
 [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.

[17]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
“Imagenet classification with deep convolutional neural networks,”
in NIPS, 2012, pp. 1097–1105.  [18] Yunchao Gong and Svetlana Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in CVPR, 2011.

[19]
Yann Lecun and Corinna Cortes,
“The MNIST database of handwritten digits,”
http://yann.lecun.com/exdb/mnist/.  [20] Alex Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., University of Toronto, 2009.

[21]
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba,
“Sun database: Largescale scene recognition from abbey to zoo,”
in CVPR. IEEE, 2010, pp. 3485–3492.  [22] A. Vedaldi and K. Lenc, “Matconvnet – convolutional neural networks for matlab,” in ACM Multimedia, 2015.
 [23] Wei Liu, Jun Wang, Rongrong Ji, YuGang Jiang, and ShihFu Chang, “Supervised hashing with kernels,” in CVPR. IEEE, 2012, pp. 2074–2081.
 [24] Brian Kulis and Trevor Darrell, “Learning to hash with binary reconstructive embeddings,” in NIPS, 2009, pp. 1042–1050.
 [25] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen, “Supervised discrete hashing,” in CVPR, 2015.
 [26] Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao, “Deep hashing network for efficient similarity retrieval.,” in AAAI, 2016, pp. 2415–2421.
 [27] Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen, “Deep quantization network for efficient image retrieval.,” in AAAI, 2016, pp. 3457–3463.
 [28] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan, “Deep semantic ranking based hashing for multilabel image retrieval,” in CVPR, 2015.
Comments
There are no comments yet.