1 Introduction
Contentbased image retrieval (CBIR) or textbased retrieval (TBR) has played a major role in practical computer vision applications. In some scenarios, however, if example queries are not available or it is difficult to describe them with keywords, what should we do? To address such a problem, sketchbased image retrieval (SBIR)
[13, 19, 46, 69, 42, 47, 3, 12, 27, 21, 57, 6, 7, 20, 62, 43, 49] has been recently developed and is becoming popular in information retrieval area (as shown in Fig. 1). Compared to traditional retrieval approaches, using a sketch query can more efficiently and precisely express the shape, pose and finegrained details of the search target, which is intuitive to humans and far more convenient than describing it with a “hundred” words in text.However, SBIR is challenging since humans draw freehand sketches without any reference but only focus on the salient object structures. As such, the shapes and scales in sketches are usually distorted compared to natural images. To deal with this problem, some studies have attempted to bridge the domain gap between sketches and natural images for SBIR. These methods can be roughly divided into two groups: handcrafted methods and crossdomain deep learningbased methods.
Handcrafted SBIR first generates approximate sketches by extracting edge or contour maps from the natural images. After that, handcrafted features (e.g., SIFT [39], HOG [8], gradient field HOG [18, 19], histogram of edge local orientations (HELO) [48, 46] and Learned KeyShapes (LKS) [47]) are extracted for both sketches and edgemaps of natural images, which are then fed into “BagofWords” (BoW) methods to generate the representations for SBIR. The major limitation of handcrafted methods is that the domain gap between sketches and natural images cannot be well remedied, as it is difficult to match edge maps to nonaligned sketches with large variations and ambiguity.
To further improve the above domain shift issue, convolutional neural networks (CNNs) [24] have been recently used to learn domaintransformable features from sketches and images with endtoend frameworks [49, 43, 62]. Being able to better handle the domain gap, deep methods typically achieve higher performance than handcrafted ones for both categorylevel [13, 19, 46, 69, 42, 47, 12] and finegrained [49, 62, 27] SBIR tasks.
Although achieving progress, current deep SBIR methods are still facing severe challenges. In particular, these methods tend to perform well in the situation that each of the gallery images contains only a single object with a simple contour shape on a clean background (e.g., “Moon”, “Eiffeltower” and “Pyramid” in the shapebased Flickr15K dataset [19]). In practice, however, objects in gallery images may appear from various viewpoints with relatively complex backgrounds (e.g., a rhinoceros in bushes). In such a case, current methods fail to handle the significant geometric distortions between freehand sketches and natural images, and result in unsatisfactory performance.
Moreover, less study has been devoted to the searching efficiency of SBIR. Most SBIR techniques are based on applying nearest neighbor (NN) searches with computational complexity on continuousvalued features (handcrafted or deeply learned). Such methods become inappropriate for largescale SBIR tasks in certain realistic scenarios (e.g., on wearable or mobile devices). Therefore, being able to conduct a fast SBIR on a substantial number of images with limited computational and memory resources is crucial for practical applications.
To address the above issues, in this paper, we introduce a novel Deep Sketch Hashing (DSH) framework for the fast freehand SBIR, which incorporates the learning of binary codes and deep hash functions into a unified framework. Specifically, DSH speeds up SBIR by embedding sketches and natural images into two sets of compact binary codes, aiming at not only preserving their pairwise semantic similarities, but also leveraging the intrinsic category correlations. Unlike previous methods with Siamese [43, 57] or triplet CNNs [49, 62] only utilizing images and sketches, we propose a novel semiheterogeneous deep architecture including three CNNs, where a unique middlelevel network fed with “sketchtokens” is developed to effectively diminish the aforementioned geometric distortion between freehand sketches and natural images. The contributions of this work mainly include:

To the best of our knowledge, DSH is the first hashing work specifically designed for categorylevel SBIR, where both binary codes and deep hash functions are learned in a joint endtoend framework. DSH aims to generate binary codes which can successfully capture the crossview relationship (between images and sketches) as well as the intrinsic semantic correlations between different categories. To this end, an efficient alternating optimization scheme is applied to produce the highquality hash codes.

A novel semiheterogeneous deep architecture is developed in DSH as the hash function, where natural images, freehand sketches and the auxiliary sketchtokens are fed into three CNNs (as shown in Fig. 3). Particularly, natural images and their corresponding sketchtokens are fed into a heterogeneous latefusion net, while the CNNs for sketches and sketchtokens share the same weights during training. As such, the architecture in DSH can better remedy the domain gap between images and sketches compared to previous SBIR deep nets.

The experiments consistently illustrate superior performance of DSH compared to the stateoftheart methods, while achieving significant reduction on both retrieval time and memory load.
Related Work
Hashing techniques [16, 33, 58, 38, 34, 17, 70, 35, 14, 66, 36, 44, 37, 51, 25, 32] have recently been successfully applied to encode highdimensional features into compact similaritypreserving binary codes, which enables extremely fast similarity search by the use of Hamming distances. Inspired by this, some recent SBIR works [1, 15, 40, 52, 54, 56] have incorporated existing hashing methods for efficient retrieval. For instance, LSH [16] and ITQ [17] are adopted to sketchbased image [1] and 3D model [15] retrieval tasks, respectively. In fact, among various hashing methods, crossmodality hashing [30, 64, 68, 26, 31, 2, 53, 71, 67, 10, 23, 5, 4], which learns binary codes by preserving the correlations between heterogeneous representations from different modalities, are more related to SBIR problems. However, all of the above hashing techniques are not specifically designed for SBIR and neglect the intrinsic relationship between freehand sketches and natural images, resulting in unsatisfactory performance.
In the next section, we will introduce the detailed architecture of our deep hash nets in DSH, then elaborate on our hashing objective function.
2 Deep Sketch Hashing
To help better understand this section, we first introduce some notation. Let , where is a natural image and is its corresponding sketchtoken computed from ; be the set of freehand sketches ; and and indicate the numbers of the samples in and , respectively. Additionally, define the label matrix , where if belongs to class and otherwise; for sketches is defined in the same way. We aim to learn two sets of bit binary codes and for and , respectively.
2.1 Semiheterogeneous Deep Architecture
As previously stated, SBIR is a very challenging task due to large geometric distortion between sketches and images. Inspired by [29, 47], in this work, we propose to adopt an auxiliary image representation as a bridge to mitigate the geometric distortion between sketch and natural images. In particular, a set of edge structures are detected from natural images, called “sketchtokens”, using supervised middlelevel information in the form of handdrawn sketches. In practice, given an image we will get an initial sketchtoken, where each pixel is assigned a score for the likeliness of it being a contour point. We then use 60% of the maximum score (same as [47]) to threshold each pixel and obtain the final sketchtokens as shown in Fig. 2.
Sketchtokens have two advantages: (1) they reflect only essential edges of natural images without detailed texture information; (2) unlike ordinary edgemaps (e.g., Canny), they have very similar stroke patterns and appearance to freehand sketches. Next, we will show how to design the DSH architecture with the help of sketchtokens.
We propose a novel semiheterogeneous deep architecture, where three CNNs are developed as hash functions to encode freehand sketches, natural images and auxiliary sketchtokens into binary codes. As shown in Fig. 3, the DSH framework includes the following two parts:
1) Crossweight Latefusion Net: A heterogeneous net with two parallel CNNs is developed, termed C1Net (Bottom) and C2Net (Middle). Particularly, C1Net (bottom) is slightly modified from AlexNet [24] containing 5 convolutional (conv) layers and 2 fully connected (fc) layers for natural image inputs, while C2Net is configured with 4 convolutional layers and 2 fully connected layers for corresponding sketchtoken inputs. The detailed parameters are listed in Table 1. Inspired by the recent multimodal deep framework [45], we connected the pooling3, fc_a, fc_b of both C1Net (Bottom) and C2Net (Middle) with crossweights. In this way, we exploit highlevel interactions between two nets to maximize the mutual information across both modalities, while the information from each individual net is also preserved. Finally, similar to [30, 10], we latefuse the C1Net (Bottom) and C2Net (Middle) into a unified binary coding layer hash_C1 so that the learned codes can fully benefit from both natural images and their corresponding sketchtokens.
Net  Layer  Kernel Size  Stride  Pad  Output 
C1Net (Natural Image)  input        3227227 
conv1  1111  4  0  965555  
pooling1  33  2  0  962727  
conv2  55  1  2  2562727  
pooling2  33  2  0  2561313  
conv3  33  1  1  3841313  
conv4  33  1  1  3841313  
conv5  33  1  1  3841313  
pooling3  33  2  1  25677  
fc_a  77  1  0  409611  
fc_b  11  1  0  102411  
hash_C1  11  1  0  11  
C2Net (Freehand sketch/ Sketch tokens )  input        1200200 
conv1  1414  3  0  646363  
pooling1  33  2  0  643131  
conv2_1  33  1  1  1283131  
conv2_2  33  1  1  1283131  
pooling2  33  2  0  1281515  
conv3_1  33  1  1  2561515  
conv3_2  33  1  1  2561515  
pooling3  33  2  0  25677  
fc_a  77  1  0  409611  
fc_b  11  1  0  102411  
hash_C2  11  1  0  11 
2) Sharedweight Sketch Net: For freehand sketch inputs, we develop the C2Net (Top) with configurations shown in Table 1. Specifically, considering the similar characteristics and implicit correlations existing between sketchtokens and freehand sketches as mentioned above, we design a Siamese architecture for C2Net (Middle) and C2Net (Top) to share the same deep weights in conv and fc layers during the optimization (see in Fig. 3). As such, the hash codes of freehand sketches learned via the sharedweight net (from hash_C2) will mitigate the geometric difference between images and sketches during SBIR.
Deep Hash Functions: Denote by the deep weights in C1Net (Bottom) and the shared weights in C2Net (Middle) and C2Net (Top). For natural images and their sketchtokens, we form the deep hash function from the crossweight latefusion net of C1Net (Bottom) and C2Net (Middle). Similarly, the sharedweight sketch net (i.e., C2Net (Top)) is regarded as the hash function for freehand sketches. In this way, hash codes learned from the above deep hash functions can lead to more reasonable SBIR, especially when a significant sketchimage distortion exists. Next, we will introduce the DSH objective of joint learning of binary codes and hash functions.
2.2 Objective Formulation of DSH
1) Crossview Pairwise Loss: We first define the crossview similarity matrix of and as , where the element of denotes the crossview similarity between and . The inner product of learned and should sufficiently approximate the similarity matrix . Thus, we consider the following problem:
(1)  
where is the Frobenius norm and is the elementwise product. The crossview similarity matrix can be defined by semantic label information as if and otherwise. By Eq.(1), the binary codes of natural images and sketches from the same category will be pulled as close as possible and pushed far away otherwise.
2) Semantic Factorization Loss: Beyond the crossview similarity, we also consider preserving the intraset semantic relationships for both the image set and the sketch set . However, the given 0/1 label matrices and can only provide binary measurements (i.e., the samples belong to the same category or not), which causes all different categories to have equivalent distance (e.g., “cheetah” will be as different from “tiger” as from “dolphin”). Thus, directly using such discrete label information will implicitly make all categories independent and discards the latent correlation of highlevel semantics.
Inspired by the recent development of word embeddings [41]
, in this paper, we overcome the above drawback by utilizing the NLP wordvector toolbox
^{1}^{1}1https://code.google.com/archive/p/word2vec/. The model is trained from the first billion characters from Wikipedia. to map the independent labels into the highlevel semantic space. As such, the intrinsic semantic correlation among different labels can be quantitatively measured and captured (e.g., the semantic embedding of “cheetah” will be closer to “tiger” but further from “dolphin”). As semantic embeddings intentionally guide the learning of highquality binary codes, we optimize the following semantic factorization problem(2)  
where is the word embedding model, and , is the dimension of word embedding. is the shared basis of the semantic factorization for both views. Note that the shared basis we used helps to preserve the latent semantic correlations which also benefits crossview code learning in SBIR.
Final Objective Function: Unlike previous hashing methods using continuousrelaxation during code learning, we keep the binary constraints in the DSH optimization. By recalling Eq.(1) and Eq.(2), we obtain our final objective function:
(3)  
Here, and are the balance parameters. The last two regularization terms aim to minimize the quantization loss between binary codes , and deep hash functions , . Similar regularization terms are also used in [50, 36] for effective hash code learning. Next, we will elaborate on how to optimize problem (3).
3 Optimization
It is clear that problem (3) is nonconvex and nonsmooth, which is in general an NPhard problem due to the binary constraints. To address this, we propose an alternating optimization based algorithm, which sequentially updates , , and deep hash functions in an iterative fashion. In practice, we first pretrain C1Net (Bottom) and C2Net (Top) as classification nets using natural images and sketches with corresponding semantic labels. After that, pretrained models will be applied in our semiheterogeneous deep model as in Fig. 3 and then optimized with the following alternating steps.
Update Step.
By fixing all variables except for , Eq.(3) shrinks to a classic quadratic regression problem
(4) 
which can be solved analytically as
(5) 
Update Step.
By fixing all other variables, we optimize by the following equation
(6)  
We further rewrite (6) as
(7)  
where and .
It is challenging to directly optimize with discrete constraints. Inspired by the discrete cyclic coordinate descent (DCC) [51], we learn each row of by fixing all other rows, i.e., each time we only optimize one single bit of all samples. We denote , , and as the rows of , , and respectively, . For convenience, we also have
(8) 
It is not difficult to show Eq.(7) can be rewritten w.r.t. as
(9)  
Thus, the closedform solution for the row of can be obtained by
(10) 
In this way, the binary codes can be optimized bit by bit and finally reach a stationary point.
Update Step.
By fixing all other variables, we learn hash code with a similar formulation to Eq.(10).
and Update Step.
Once and are obtained, we update parameters and of C1Net and C2Net according to the following Euclidean loss:
(11) 
By first computing the partial gradients and , we can obtian
by the chain rule. We then use the standard minibatch backpropagation (BP) scheme to simultaneously update
and for our entire deep architecture. In practice, the above procedure can be easily achieved by deep learning toolboxes (e.g., Caffe
[22]).Methods  Dimension  TUBerlin Extension  Sketchy  
MAP 



MAP 




HOG [8]  1296  0.091  0.120  1.43  0.115  0.159  0.53  
GFHOG [18]  3500  0.119  0.148  4.13  0.157  0.177  1.41  
SHELO [46]  1296  0.123  0.155  1.44  0.161  0.182  0.50  
LKS [47]  1350  0.157  0.204  1.51  0.190  0.230  0.56  
Siamese CNN [43]  64  0.322  0.447  7.70  99.8  0.481  0.612  2.76  35.4  
SaN [63]  512  0.154  0.225  0.53  0.208  0.292  0.21  
GN Triplet [49]  1024  0.187  0.301  1.02  0.529  0.716  0.41  
3D shape [57]  64  0.054  0.072  7.53  99.8  0.084  0.079  2.64  35.6  

4096  0.367  0.476  5.35  0.518  0.690  1.68  

4096  0.448  0.552  5.35  0.573  0.761  1.68  
DSH (Proposed)  32 (bits)  0.358  0.486  5.57  0.78  0.653  0.797  2.55  0.28  
64 (bits)  0.521  0.655  7.03  1.56  0.711  0.858  2.82  0.56  
128 (bits)  0.570  0.694  1.05  3.12  0.783  0.866  3.53  1.11 
’*’ denotes we directly use the public models provided by the original papers without any finetuning on TUBerlin Extension or Sketchy datasets.
As shown in Fig. 4, we iteratively update in each epoch. As such, DSH can be finally optimized within epochs in total, where . Notice that the overall objective is lowerbounded, thus the convergence of (3) is always guaranteed by coordinate descent used in our optimization. The overall DSH is summarized in Algorithm 1.
Once the DSH model is trained, given a sketch query , we can compute its binary code with C2Net (Top). For the retrieval database, the unified hash code of each image and sketchtoken pair is computed as with C1Net (Bottom) and C2Net (Middle).
4 Experiments
In this section, we conduct extensive evaluations of DSH on the two largest SBIR datasets: TUBerlin Extension and Sketchy. Our method is implemented using Caffe^{2}^{2}2Our trained deep models can be downloaded from https://github.com/ymcidence/DeepSketchHashing. with dual K80 GPUs for training our deep models and MATLAB 2015b on an i7 4790K CPU for binary coding.
4.1 Datasets and Protocols
Datasets: TUBerlin [11] Extension contains 250 object categories with 80 freehand sketches for each category. We also use 204,489 extended natural images associated to TUBerlin provided by [65] as our natural image retrieval gallery. Sketchy [49]
is a newly released dataset originally for finegrained SBIR, in which 75,471 handdrawn sketches of 12,500 objects (images) from 125 categories are included. To better fit the task of largescale SBIR in our paper, we collect another 60,502 natural images (an average of 484 images/category) ourselves from ImageNet
[9] to form a new retrieval gallery with 73,002 images in total. Similar to previous hashing evaluations, we randomly select 10 and 50 sketches from each category as the query sets for TUBerlin and Sketchy respectively, and the remaining sketches and gallery images^{3}^{3}3All natural images are used as both training sets and retrieval galleries. are used for training.Method  TUBerlin Extension  Sketchy  
MAP  Precision@200  MAP  Precision@200  
32 bits  64 bits  128 bits  32 bits  64 bits  128 bits  32 bits  64 bits  128 bits  32 bits  64 bits  128 bits  
CrossModality Hashing Methods (binary codes)  CMFH [10]  0.149  0.202  0.180  0.168  0.282  0.241  0.320  0.490  0.190  0.489  0.657  0.286 
CMSSH [2]  0.121  0.183  0.175  0.143  0.261  0.233  0.206  0.211  0.211  0.371  0.376  0.375  
SCMSeq [64]  0.211  0.276  0.332  0.298  0.372  0.454  0.306  0.417  0.671  0.442  0.529  0.758  
SCMOrth [64]  0.217  0.301  0.263  0.312  0.420  0.470  0.346  0.536  0.616  0.467  0.650  0.776  
CVH [26]  0.214  0.294  0.318  0.305  0.411  0.449  0.325  0.525  0.624  0.459  0.641  0.773  
SePH [30]  0.198  0.270  0.282  0.307  0.380  0.398  0.534  0.607  0.640  0.694  0.741  0.768  
DCMH [23]  0.274  0.382  0.425  0.332  0.467  0.540  0.560  0.622  0.656  0.730  0.771  0.784  
Proposed  DSH  0.358  0.521  0.570  0.486  0.655  0.694  0.653  0.711  0.783  0.797  0.858  0.866 
CrossView Feature Learning Methods (continuousvalue vectors)  CCA [55]  0.276  0.366  0.365  0.333  0.482  0.536  0.361  0.555  0.705  0.379  0.610  0.775 
XQDA [28]  0.191  0.197  0.201  0.263  0.278  0.278  0.460  0.557  0.550  0.607  0.715  0.727  
PLSR [59]  0.141 (4096d)  0.215 (4096d)  0.462 (4096d)  0.623 (4096d)  
CVFL [60]  0.289 (4096d)  0.407 (4096d)  0.675 (4096d)  0.803 (4096d) 
PLSR and CVFL are both based on reconstructing partial data to approximate full data, so the dimensions are fixed to 4096d.
Compared Methods and Implementation Details: We first compare the proposed DSH with several previous SBIR methods, including handcrafted HOG [8], GFHOG [18], SEHLO [46], LSK [47]; and deep learning based Siamese CNN [43], SketchaNet (SaN) [63], GN Triplet [49], 3D shape [57]. For HOG, GFHOG, SEHLO, Siamese CNN and 3D shape, we need first to compute Canny edgemaps from natural images and then extract the features. In detail, we compute GFHOG via a BoW scheme with a codebook size 3500; for HOG, SEHLO and LSK, we exactly follow the best settings used in [47]. Due to lack of stroke order information in the Sketchy dataset, we only use a single deep channel SaN in our experiments as in [62]
. We finetune Siamese CNN and SaN on TUBerlin and Sketchy datasets, while the public models of GN Triplet and 3D shape are only allowed for direct feature extraction without any retraining. Additionally, we add SiameseAlexNet (with
contrastive loss) and TripletAlexNet (with triplet ranking loss) as the baselines, both of which are constructed and trained by ourselves on two datasets. Particularly, the semantic pairwise/triplet supervision for our Siamese/TripletAlexNet are constructed the same as [43]/[61] respectively.Moreover, DSH is also compared with stateoftheart crossmodality hashing techniques: Collective Matrix Factorization Hashing (CMFH) [10], CrossModal SemiSupervised Hashing (CMSSH) [2], CrossView Hashing (CVH) [26], Semantic Correlation Maximization (SCMSeq and SCMOrth) [64], SemanticsPreserving Hashing (SePH) [30] and Deep CrossModality Hashing (DCMH) [23]. Note that since DCMH is a deep hashing method originally for imagetext retrieval, in our experiments, we modify it into a Siamese net by replacing the text embedding channel with an identical parallel image channel. In addition, another four crossview feature embedding methods: CCA [55], PLSR [59], XQDA [28] and CVFL [60] are used for comparison. Except for DCMH, each image and sketch in both datasets are represented by 4096d AlexNet [24] fc7 and 512d SaN fc7deep features, respectively. Since these hashing and feature embedding methods need pairwise data with corresponding labels as inputs, in our experiments, we further construct these deep features (extracted from TUBerlin Extension/Sketchy datasets) into 100,000 sample pairs (with 800/400 pairs per category) to train all of the above crossmodality methods.
For the proposed DSH, we train our deep model using SGD on Caffe with an initial learning rate 0.001, momentum0.9 and batch size 64. We decrease every epoch and terminate the optimization after 15 epochs. For both datasets, our balance parameters are set to = and = via cross validation on training sets.
In the test phase, we report the mean average precision (MAP) and precision at toprank 200 (precision@200) to evaluate the categorylevel SBIR. For all hashing methods, we also evaluate the precision of Hamming distance with radius 2 (HD2) and the precisionrecall curves. Additionally, we report the retrieval time per query () from image galleries and memory loads (MB) for compared methods.
4.2 Results and Discussions
DSH vs. SBIR Baselines: In Table 2, we demonstrate the comparison of MAP and precision@200 over all SBIR methods on two datasets. Generally, deep learningbased methods can achieve much better performance than handcrafted methods and the results on Sketchy are higher than those on TUBerlin Extension since the data in Sketchy is relatively simpler with fewer categories. Our 128bit DSH leads to superior results with 0.138/0.142 and 0.210/0.105 improvements (MAP/precision@200) over the bestperforming comparison methods on the two datasets, respectively. This is because the semiheterogeneous deep architecture of DSH is specifically designed for categorylevel SBIR by effectively introducing the auxiliary sketchtokens to mitigate the geometric distortion between freehand sketches and natural images. The other deep methods: Siamese CNN, GN Triplet and 3D shape only incorporate images and sketches as training data with a simple multichannel deep structure. Among the compared methods, we notice 3D shape produces worse SBIR performance than previous papers [57, 62] reported. In [62], the images from the retrieval gallery all contain wellaligned objects with perfect background removal, thus the edgemaps computed from such images can well represent the objects and have almost identical stroke patterns with freehand sketches, which guarantees a good SBIR performance. However, in our tasks, all images in the retrieval gallery are realistic with relatively complex backgrounds and there is still a big dissimilarity between the computed edgemaps and sketches. Therefore, 3D shape features extracted from our edgemaps become ineffective. Similar problems also exist in SaN, HOG and SHELO. In addition, the retrieval time and memory load are listed in Table 2. Our DSH can achieve significantly faster speed with much lower memory load compared to conventional SBIR methods during retrieval.
DSH vs. Crossmodality Hashing: We also compare our DSH with crossmodality hashing/feature learning methods in Table 3
. As mentioned before, we use the learned deep features as the inputs for nondeep methods to achieve a fair comparison with our DSH. In particular, SCMOrth and SePH always lead to high accuracies among compared nondeep hashing methods on both datasets. With its deep endtoend structure, DCMH can achieve better results than nondeep hashing methods, while CMFH and CMSSH produce the weakest results due to un(semi)supervised learning mechanisms. For crossview feature learning schemes, CCA and CVFL achieve superior performance on TUBerlin Extension and Sketchy datasets, respectively. Our DSH can consistently outperform all other methods in Table
3. The superior performance of DSH is also demonstrated in 64bit precisionrecall curves and HD2 curves along different code lengths (shown in Fig. 5) by comparing the Area Under the Curve (AUC). Besides, we illustrate SNE visualization in Fig. 7 where the analogous DSH distributions of the test sketches and image gallery intuitively reflect the effectiveness of DSH codes. Lastly, some query examples with top20 SBIR retrieval results are shown in Fig. 6.(a) Image retrieval gallery  (b) Test sketch queries 
DSH Component Analysis: We have evaluated the effectiveness of different components of DSH in Table 4. Specifically, we construct a heterogeneous deep net by only using C2Net (Top) and C1Net (Bottom) channels with the same binary coding scheme. It produces around and MAP decreases by only using images and sketches on the respective datasets, which sufficiently proves the importance of sketchtokens in order to mitigate the geometric distortion. We also observe that only using either the crossview pairwise loss term or the semantic factorization loss term will result in worse performance than applying the full model, since the crossview similarities and the intrinsic semantic correlations captured in DSH can complement each other and simultaneously benefit the final MAPs.
Method 

Sketchy  

C2Net (Top) + C1Net (Bottom) only  0.497  0.682  
C2Net (Top) + C2Net (Middle) only  0.379  0.507  
Using Crossview Pairwise Loss only  0.522  0.715  
Using Semantic Factorization Loss only  0.485  0.667  
Our proposed full DSH model  0.570  0.783 
5 Conclusion
In this paper, we proposed a novel deep hashing framework, named deep sketch hashing (DSH), for fast sketchbased image retrieval (SBIR). Particularly, a semiheterogeneous deep architecture was designed to encode freehand sketches and natural images, together with the auxiliary sketchtokens which can effectively mitigate the geometric distortion between the two modalities. To train DSH, binary codes and deep hash functions were jointly optimized in an alternating manner. Extensive experiments validated the superiority of DSH over the stateoftheart methods in terms of retrieval accuracy and time/storage complexity.
References
 [1] K. Bozas and E. Izquierdo. Large scale sketch based image retrieval using patch hashing. In International Symposium on Visual Computing, 2012.
 [2] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through crossmodality metric learning using similaritysensitive hashing. In CVPR, 2010.
 [3] X. Cao, H. Zhang, S. Liu, X. Guo, and L. Lin. Symfish: A symmetryaware flip invariant sketch histogram shape descriptor. In CVPR, 2013.
 [4] Y. Cao, M. Long, and J. Wang. Correlation hashing network for efficient crossmodal retrieval. arXiv preprint arXiv:1602.06697, 2016.
 [5] Y. Cao, M. Long, J. Wang, Q. Yang, and P. S. Yu. Deep visualsemantic hashing for crossmodal retrieval. In KDD, 2016.
 [6] Y. Cao, C. Wang, L. Zhang, and L. Zhang. Edgel index for largescale sketchbased image search. In CVPR, 2011.
 [7] Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, and L. Zhang. Mindfinder: interactive sketchbased image search on millions of images. In ACM MM, 2010.
 [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
 [9] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
 [10] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In CVPR, 2014.
 [11] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Trans. Graph., 31(4):44–1, 2012.
 [12] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. An evaluation of descriptors for largescale image retrieval from sketched feature lines. Computers & Graphics, 34(5):482–498, 2010.
 [13] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. Sketchbased image retrieval: Benchmark and bagoffeatures descriptors. IEEE Transactions on Visualization and Computer Graphics, 17(11):1624–1636, 2011.
 [14] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015.
 [15] T. Furuya and R. Ohbuchi. Hashing crossmodal manifold for scalable sketchbased 3d model retrieval. In International Conference on 3D Vision, 2014.
 [16] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518–529, 1999.
 [17] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. TPAMI, 35(12):2916–2929, 2013.
 [18] R. Hu, M. Barnard, and J. Collomosse. Gradient field descriptor for sketch based retrieval and localization. In ICIP, 2010.
 [19] R. Hu and J. Collomosse. A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Computer Vision and Image Understanding, 117(7):790–806, 2013.
 [20] R. Hu, T. Wang, and J. Collomosse. A bagofregions approach to sketchbased image retrieval. In ICIP, 2011.
 [21] S. James, M. J. Fonseca, and J. Collomosse. Reenact: Sketch based choreographic design from archival dance footage. In ICMR, 2014.
 [22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
 [23] Q.Y. Jiang and W.J. Li. Deep crossmodal hashing. arXiv preprint arXiv:1602.02255, 2016.
 [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [25] B. Kulis and K. Grauman. Kernelized localitysensitive hashing for scalable image search. In CVPR, 2009.
 [26] S. Kumar and R. Udupa. Learning hash functions for crossview similarity search. In IJCAI, 2011.
 [27] K. Li, K. Pang, Y.Z. Song, T. Hospedales, H. Zhang, and Y. Hu. Finegrained sketchbased image retrieval: The role of partaware attributes. In WACV, 2016.
 [28] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person reidentification by local maximal occurrence representation and metric learning. In CVPR, 2015.
 [29] J. J. Lim, C. L. Zitnick, and P. Dollár. Sketch tokens: A learned midlevel representation for contour and object detection. In CVPR, 2013.
 [30] Z. Lin, G. Ding, M. Hu, and J. Wang. Semanticspreserving hashing for crossview retrieval. In CVPR, 2015.
 [31] L. Liu, Z. Lin, L. Shao, F. Shen, G. Ding, and J. Han. Sequential discrete hashing for scalable crossmodality similarity retrieval. TIP, 26(1):107–118, 2017.
 [32] L. Liu and L. Shao. Sequential compact code learning for unsupervised image hashing. TNNLS, 27(12):2526–2536, 2016.
 [33] L. Liu, M. Yu, and L. Shao. Multiview alignment hashing for efficient image search. TIP, 24(3):956–966, 2015.

[34]
L. Liu, M. Yu, and L. Shao.
Projection bank: From highdimensional data to mediumlength binary codes.
In ICCV, 2015.  [35] L. Liu, M. Yu, and L. Shao. Latent struture preserving hashing. IJCV, 2016.
 [36] W. Liu, C. Mu, S. Kumar, and S.F. Chang. Discrete graph hashing. In NIPS, 2014.
 [37] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang. Supervised hashing with kernels. In CVPR, 2012.
 [38] W. Liu, J. Wang, S. Kumar, and S.F. Chang. Hashing with graphs. In ICML, 2011.
 [39] D. G. Lowe. Object recognition from local scaleinvariant features. In CVPR, 1999.
 [40] Y. Matsui, K. Aizawa, and Y. Jing. Sketch2manga: Sketchbased manga retrieval. In ICIP, 2014.
 [41] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
 [42] S. Parui and A. Mittal. Similarityinvariant sketchbased image retrieval in large databases. In ECCV, 2014.
 [43] Y. Qi, Y.Z. Song, H. Zhang, and J. Liu. Sketchbased image retrieval via siamese convolutional neural network. In ICIP, 2016.
 [44] M. Raginsky and S. Lazebnik. Localitysensitive binary codes from shiftinvariant kernels. In NIPS, 2009.
 [45] S. Rastegar, M. Soleymani, H. R. Rabiee, and S. Mohsen Shojaee. Mdlcw: A multimodal deep learning framework with cross weights. In CVPR, 2016.
 [46] J. M. Saavedra. Sketch based image retrieval using a soft computation of the histogram of edge local orientations (shelo). In ICIP, 2014.
 [47] J. M. Saavedra, J. M. Barrios, and S. Orand. Sketch based image retrieval using learned keyshapes (lks). 2015.

[48]
J. M. Saavedra and B. Bustos.
An improved histogram of edge local orientations for sketchbased
image retrieval.
In
Joint Pattern Recognition Symposium
, 2010.  [49] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphic, 35(4):119, 2016.
 [50] F. Shen, W. Liu, S. Zhang, Y. Yang, and H. T. Shen. Learning binary codes for maximum inner product search. In ICCV, 2015.
 [51] F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In CVPR, 2015.
 [52] B. Siddiquie, B. White, A. Sharma, and L. S. Davis. Multimodal image retrieval for complex queries using small codes. In ACM MM.
 [53] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Intermedia hashing for largescale retrieval from heterogeneous data sources. In ACM SIGMOD, 2013.
 [54] X. Sun, C. Wang, C. Xu, and L. Zhang. Indexing billions of images for sketchbased retrieval. In ACM MM, pages 233–242, 2013.
 [55] B. Thompson. Canonical correlation analysis. Encyclopedia of statistics in behavioral science, 2005.
 [56] K.Y. Tseng, Y.L. Lin, Y.H. Chen, and W. H. Hsu. Sketchbased image retrieval on mobile devices using compact hash bits. In ACM MM, pages 913–916, 2012.
 [57] F. Wang, L. Kang, and Y. Li. Sketchbased 3d shape retrieval using convolutional neural networks. In CVPR, 2015.
 [58] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.
 [59] H. Wold. Partial least squares. Encyclopedia of statistical sciences, 1985.
 [60] W. Xie, Y. Peng, and J. Xiao. Crossview feature learning for scalable social image analysis. In AAAI, 2014.
 [61] T. Yao, F. Long, T. Mei, and Y. Rui. Deep semanticpreserving and rankingbased hashing for image retrieval. In IJCAI, 2016.
 [62] Q. Yu, F. Liu, Y.Z. Song, T. Xiang, T. M. Hospedales, and C. C. Loy. Sketch me that shoe. In CVPR, 2016.
 [63] Q. Yu, Y. Yang, Y.Z. Song, T. Xiang, and T. Hospedales. Sketchanet that beats humans. In BMVC, 2015.
 [64] D. Zhang and W.J. Li. Largescale supervised multimodal hashing with semantic correlation maximization. In AAAI, 2014.
 [65] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao. Sketchnet: sketch classification with web images. In CVPR, 2016.
 [66] Z. Zhang, Y. Chen, and V. Saligrama. Efficient training of very deep neural networks for supervised hashing. In CVPR, 2016.
 [67] Y. Zhen and D.Y. Yeung. Coregularized hashing for multimodal data. In NIPS, 2012.
 [68] J. Zhou, G. Ding, and Y. Guo. Latent semantic sparse hashing for crossmodal similarity search. In ACM SIGIR, 2014.
 [69] R. Zhou, L. Chen, and L. Zhang. Sketchbased image retrieval on a large scale database. In ACM MM, pages 973–976, 2012.
 [70] H. Zhu, M. Long, J. Wang, and Y. Cao. Deep hashing network for efficient similarity retrieval. In AAAI, 2016.
 [71] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. Linear crossmodal hashing for efficient multimedia search. In ACM MM, 2013.