1 Introduction
Matching real images with handfree sketches has recently aroused extensive research interest in computer vision, multimedia and machine learning, forming the term of sketchbased image retrieval (SBIR). Differing the conventional textimage crossmodal retrieval, SBIR delivers a more applicable scenario where the targeted candidate images are conceptually unintelligible but visualizable to user. Several works have been proposed handling the SBIR task by learning realvalued representations
[16, 17, 24, 25, 49, 50, 52, 56, 58, 68, 69]. As an extension of conventional data hashing techniques [20, 51, 21, 53], crossmodal hashing [4, 13, 71, 37, 34, 27, 5, 6] show great potential in retrieving heterogeneous data with high efficiency due to the computationally cheap Hamming space matching, which is recently adopted to largescale SBIR in [39] with impressive performance. Entering the era of big data, it is always feasible and appreciated to seek binary representation learning methods for fast SBIR.However, the aforementioned works suffer from obvious drawbacks. Given a fixed set of categories of training and test data, these methods successfully manage to achieve sound SBIR performance, which is believed to be a relatively easy task as the visual knowledge from all concepts has been explored during parameter learning, while in a reallife scenario, there is no guarantee that the training data categories cover all concepts of potential retrieval queries and candidates in the database. An extreme case occurs when test data are subjected to an absolutely different set of classes, excluding the trained categories. Unfortunately, experiments show that existing crossmodal hashing and SBIR works generally fail on this occasion as the learned retrieval model has no conceptual knowledge about what to find.
Considering both the traintest category exclusion and retrieval efficiency, a novel but realistic task yields zeroshot SBIR hashing. Fig. 1 briefly illustrates the difference between our task and conventional SBIR task. In conventional SBIR and crossmodal hashing, the categories of training data include the ones of test data, marked as ‘A’ and ‘B’ in Fig. 1. On the other hand, for the zeroshot task, though training data are still subjected to class ‘A’ and ‘B’, test sketches and images are coming from other categories, i.e., ‘plane’ and ‘cat’ in this case. In the rest of this paper, we denote the training and test categories as seen and unseen classes, since they are respectively known and unknown to the retrieval model.
Our zeroshot SBIR hashing setting is a special case of zeroshot learning in inferring knowledge out of the training samples. However, existing works basically focus on singlemodal zeroshot recognition [55, 74, 75, 32], and are not suitable for efficient image retrieval. In [66], an inspiring zeroshot hashing scheme is proposed for largescale data retrieval. Although [66] suggests a reasonable zeroshot traintest split close to Fig. 1 for retrieval experiments, it is still not capable for crossmodal hashing and SBIR.
Regarding the drawbacks and the challenging task discussed above, a novel zeroshot sketchimage hashing (ZSIH) model is proposed in this paper, simultaneously delivering (1) crossmodal hashing, (2) SBIR and (3) zeroshot
learning. Leveraging stateoftheart deep learning and generative hashing techniques, we formulate our deep network according to the following problems and themes:

[label=()]

Not all regions in an image or sketch are informative for crossmodal mapping.

The heterogeneity between image and sketch data needs to be mitigated during training to produce unified binary codes for matching.

Since visual knowledge alone is inadequate for zeroshot SBIR hashing, a backpropagatable deep hashing solution transferring semantic knowledge to the unseen classes is desirable.
The contributions of this work are summarized as follows:

To the best of our knowledge, ZSIH is the first zeroshot hashing work for largescale SBIR.

We propose an endtoend threenetwork structure for deep generative hashing, handling the traintest category exclusion and search efficiency with attention model, Kronecker fusion and graph convolution.

The ZSIH model successfully produces reasonable retrieval performance under the zeroshot setting, while existing methods generally fail.
Related Works. General crossmodal binary representation learning methods [4, 13, 71, 34, 57, 43, 37, 27, 6, 62, 18, 54, 65, 5, 42] target to map largescale heterogeneous data with low computational cost. SBIR, including finegrained SBIR, learns shared representations to specifically mitigate the expressional gap between handcrafted sketches and real images [16, 17, 24, 25, 47, 48, 49, 50, 52, 56, 58, 61, 68, 69, 76, 77], while the efficiency issue is not considered. Zeroshot learning [19, 32, 74, 75, 55, 46, 7, 2, 63, 3, 35, 11, 8, 26, 73, 14, 36, 64, 28, 67, 40] is also related to our work, though it does not originally focus on crossmodal retrieval. Among the existing researches, zeroshot hashing (ZSH) [66] and deep sketch hashing (DSH) [39] are the two closest works to this paper. DSH [39] considers fast SBIR with deep hashing technique, but it fails to handle the zeroshot setting. ZSH [66] extends the traditional zeroshot task to a retrieval scheme.
2 The Proposed ZSIH Model
This work focuses on solving the problem of handfree SBIR using deep binary codes under the zeroshot setting, where the image and sketch data belonging to the seen categories are only used for training. The proposed deep networks are expected to be capable for encoding and matching the unseen sketches with images, categories of which have never appeared during training.
We consider a multimodal data collection from seen categories covering both real images and sketch images for training, where indicates the set size. For the simplicity of presentation, it is assumed that image and sketch data with the same index , i.e., and share the same category label. Additionally, similar to many conventional zeroshot learning algorithms, our model requires a set of semantic representations in transferring supervised knowledge to the unseen data. The aim is to learn two deep hashing functions and for images and sketches respectively. Given a set of imagesketch data belonging to the unseen categories for test, the proposed deep hashing functions encode these unseen data into binary codes, i.e., , where refers to the original data dimensionality and is the targeted hash code length. Concretely, as the proposed model handles SBIR under the zeroshot setting, there should be no intersection between the seen categories for training and the unseen classes for test, i.e., .
2.1 Network overview
The proposed ZSIH model is an endtoend deep neural network for
zeroshot sketchimage hashing. The architecture of ZSIH is illustrated in Fig. 2, which is composed of three concatenated deep neural networks, i.e., the image/sketch encoders and the multimodal network, to tackle the problems discussed above.2.1.1 Image/sketch encoding networks
As is shown in Fig. 2, the networks with light blue and grey background refer to the binary encoders and
for images and sketches respectively. An image or sketch is firstly rendered to a set of corresponding convolutional layers to produce a feature map, and then the attention model mixes informative parts into a single feature vector for further operation. The AlexNet
[33] before the last pooling layer is built to obtain the feature map. We introduce the attention mechanism in solving issue 1, of which the structure is close to [58] with weighted pooling to produce a 256D feature. Binary encoding is performed by a fullyconnected layer taking input from the attention model with a nonlinearity. During training, and are regularized by the output of the multimodal network, so these two encoders are supposed to be able learn modalfree representations for zeroshot sketchimage matching.2.1.2 Multimodal network as code learner
The multimodal network only functions during training. It learns the joint representations for sketchimage hashing, handling the problem 2 of modal heterogeneity. One possible solution for this is to introduce a fused representation layer taking inputs from both image and sketch modality for further encoding. Inspired by Hu et al. [23], we find the Kronecker product fusion layer suitable for our model, which is discussed in Sec. 2.2. Shown in Fig. 2, the Kronecker layer takes inputs from the image and sketch attention model, and produces a single feature vector for each pair of data points. We index the training images and sketches in a coherent category order. Therefore the proposed network is able to learn compact codes for both images and sketches with clear categorical information.
However, simply mitigating the model heterogeneity does not fully solves the challenges in ZSIH. As is mentioned in problem 3, for zeroshot tasks, it is essential to leverage the semantic information of training data to generalize knowledge from the seen categories to the unseen ones. Suggested by many zeroshot learning works [32, 19, 66], the semantic representations, e.g., word vectors [44], implicitly determine the categorylevel relations between data points from different classes. Based on this, during the joint code learning process, we novelly enhance the hidden neural representations by the semantic relations within a batch of training data using the graph convolutional networks (GCNs) [10, 31]. It can be observed in Fig. 2 that two graph convolutional layers are built in the multimodal network, successively following the Kronecker layer. In this way, the inbatch data points with strong latent semantic relations are entitled to interact during gradient computation. Note that the output length of the second graph convolutional layer for each data point is exactly the target hash code length, i.e., . The formulation of the semantic graph convolution layer is given in Sec. 2.3.
To obtain binary codes as the supervision of and , we introduce the stochastic generative model [9]
for hashing. A backpropagatable structure of stochastic neurons is built on the top of the second graph convolutional layer, producing hash codes. Shown in Fig.
2, a decoding model is topped on the stochastic neurons, reconstructing the semantic information. By maximizing the decoding likelihood with gradientbased methods, the whole network is able to learn semanticaware hash codes, which also accords our perspective of issue 3 for zeroshot sketchimage hashing. We elaborate on this design in Sec. 2.4 and 2.5.2.2 Fusing sketch and image with Kronecker layer
Sketchimage feature fusion plays an important role in our task as is addressed in problem 2 of Sec. 1. An informationrich fused neural representation is in demand for accurate encoding and decoding. To this end, we utilize the recent advances in Kroneckerproductbased feature learning [23] as the fusion network. Denoting the attention model outputs of a sketchimage pair from the same category as and , a nonlinear data fusion operation can be derived as
(1) 
Here
is a thirdorder tensor of fusion parameters and
denotes tensor dot product. We use the left subscript to indicate on which axis tensor dot operates. Decompositing with Tucker decomposition [60], the fused output of the Kronecker layer in our model is derived as(2) 
resulting in a 65536D feature vector. Here is the Kronecker product operation between two tensors, and
are trainable linear transformation parameters.
refers to the activation function, which is the
[45] nonlinearity for this layer.Kronecker layer [23] is supposed to be a better choice in feature fusion for ZSIH than many conventional methods such as layer concatenation or factorized model [70]. This is because the Kronecker layer largely expands the feature dimensionality of the hidden states with a limited number of parameters, and thus consequently stores more expressive structural relation between sketches and images.
2.3 Semanticrelationenhanced hidden representation with graph convolution
In this subsection, we describe how the categorical semantic relations are enhanced in our ZSIH model using GCNs. Considering a batch of training data consisting of categorycoherent sketchimage pairs with their semantic representations , we denote the hidden state of the th layer in the multimodal network of this training batch as to be rendered to a graph convolutional layer. As is mentioned in Sec. 2.1.2, for our graph convolutional layers, each training batch is regarded as an vertex graph. Therefore, a convolutional filter parameterized by can be applied to , producing the th hidden state . Suggested by [31], this can be approached by a layerwise propagation rule, i.e.,
(3) 
using the firstorder approximation of the localized graph filter [10, 22]. Again, here is the activation function and refers to the linear transformation parameter. is an selfconnected inbatch adjacency and can be defined by . It can be seen in Fig. 2 that the inbatch adjacency is determined by the semantic representations , of which each entry can be computed by . In the proposed ZSIH model, two graph convolutional layers are built, with output feature dimensions of and for a whole batch. We choose the nonlinearity for the first layer and the function for the second one to restrict the output values between 0 and 1.
Intuitively, the graph convolutional layer proposed by [31] can be construed as performing elementary row transformation on a batch of data from fullyconnected layer before activation according to the graph Laplacian of . In this way, the semantic relations between different data points are intensified within the network hidden states, benefiting our zeroshot hashing model in exploring the semantic knowledge. Traditionally, correlating different deep representations can be tackled by adding a tracelike regularization term in the learning objective. However, this introduces additional hyper parameters to balance the loss terms and the hidden states in the network of different data points are still isolated.
2.4 Stochastic neurons and decoding network
The encoderdecoder model for ZSIH is introduced in this subsection. Inspired by [9]
, a set of latent probability variables
are obtained from the second graph convolutional layer output respective to corresponding to the hash code for a sketchimage pair with the semantic feature . The stochastic neurons [9] are imposed to to produce binary codes through a sampling procedure:(4) 
where
are random variables. As is proved in
[9], this structure can be differentiable, allowing error backpropagation from the decoder to the previous layers. Therefore, the posterior of , i.e., , is approximated by a Multinoulli distribution:(5) 
We follow the idea of generative hashing to build a decoder on the top of the stochastic neurons. During optimization of ZSIH, this decoder is regularized by the semantic representations using the following Gaussian likelihood with the reparametrization trick [30], i.e.,
(6) 
where and are implemented by fullyconnected layers with identity activations. To this end, the whole network can be trained entoend. The learning objective is given in the next subsection.
2.5 Learning objective and optimization
The learning objective of the whole network for a batch of sketch and image data is defined as follows:
(7) 
Concretely, the expectation term in Eq. (7) simulates the variationallike learning objectives [30, 9] as a generative model. However, we are not exactly lowerbounding any data prior distribution since it is generally not feasible for our ZSIH network. here is an empiricallybuilt loss, simultaneously maximizing the output code entropy via and preserving the semantic knowledge for the zeroshot task by . The singlemodel encoding functions and are trained by the stochastic neurons outputs of the multimodal network using L2 losses. The sketchimage similarities can be reflected in assigning related sketches and images with the sample code. To this end, and are able to encode outofsample data without additional category information, as the imposed training codes are semanticknowledgeaware. The gradient of our learning objective w.r.t. the network parameter
can be estimated by a Monte Carlo process in sampling
using the small random signal according to Eq. (4), which can be derived as(8) 
As forms up into an inverse crossentropy loss and is reparametrized, this estimated gradient can be easily computed. Alg. 1 illustrates the whole training process of the proposed ZSIH model, where the operator refers to the Adam optimizer [29] for adaptive gradient scaling. Different from many existing deep crossmodal and zeroshot hashing models [5, 39, 66, 27] which require alternating optimization procedures, ZSIH can be efficiently and conveniently trained endtoend with SGD.
2.6 Outofsample extension
When the network of ZSIH is trained, it is able to hash image and sketch data from the unseen classes for matching. The codes can be obtained as follows:
(9) 
where is the size of test data. As is shown in Fig. 2, the encoding networks and are standing on their own. Semantic representations of test data are not required and there is no need to render data to the multimodal network. Thus, encoding test data is nontrivial and can be efficient.
Method  Cross  Binary  Zero  Sketchy (Extended)  TUBerlin (Extended)  

Modal  Code  Shot  32 bits  64 bits  128 bits  32 bits  64 bits  128 bits  
ZSH [66]  ✓  ✓  0.146  0.165  0.168  0.132  0.139  0.153  
CCA [59]  ✓  0.092  0.089  0.084  0.083  0.074  0.062  
CMSSH [4]  ✓  ✓  0.094  0.096  0.111  0.073  0.077  0.080  
CMFH [13]  ✓  ✓  0.115  0.116  0.125  0.114  0.118  0.135  
SCMOrth [71]  ✓  ✓  0.105  0.107  0.093  0.089  0.092  0.095  
SCMSeq [71]  ✓  ✓  0.092  0.100  0.084  0.084  0.087  0.072  
CVH [34]  ✓  ✓  0.076  0.075  0.072  0.065  0.061  0.055  
SePHRand [37]  ✓  ✓  0.108  0.097  0.094  0.071  0.065  0.070  
SePHKM [37]  ✓  ✓  0.069  0.066  0.071  0.067  0.068  0.065  
DSH [39]  ✓  ✓  0.137  0.164  0.165  0.119  0.122  0.146  
ZSIH  ✓  ✓  ✓  0.232  0.254  0.259  0.201  0.220  0.234 
3 Experiments
3.1 Implementation details
The proposed ZSIH model is implemented with the popular deep learning toolbox Tensorflow
[1]. We utilize the settings of AlexNet [33]pretrained on ImageNet
[12] before the last pooling layer to build our image and sketch CNNs. The attention mechanism is inspired by Song et al. [58] without the shortcut connection. The attended 256D feature is obtained by a weighted pooling operation according to the attention map. All configurations of our network are provided in Fig. 2. We obtain the semantic representation of each data point using the 300D word vector [44] according to the class name. When the class name is not included in the word vector dictionary, it is replaced by a synonym. For all of our experiments, the hyperparameter is set to with a training batch size of 250. Our network is able to be trained endtoend.3.2 Zeroshot experimental settings
To perform SBIR with binary codes under the novellydefined zeroshot crossmodal setting, the experiments of this work are taken on two largescale sketch datasets, i.e., Sketchy [52] and TUBerlin [15], with extended images obtained from [39]
. We follow the SBIR evaluation metrics in
[39] where sketch queries and image retrieval candidates with the same label are marked as relevant, while our retrieval performances are reported based on nearest neighbour search in the hamming space.Sketchy Dataset [52] (Extended). This dataset originally consists of handdrawn sketches and corresponding images from 125 categories. With the extended real images provided by Liu et al. [39], the total size of the whole image set yields . We randomly pick 25 classes of sketches and images as the unseen test set for SBIR, and data from the rest 100 seen classes are used for training. During the test phase, the sketches from the unseen classes are taken as retrieval queries, while the retrieval gallery is built using all the images from the unseen categories. Note that the test classes are not presenting during training for zeroshot retrieval.
TUBerlin Dataset [15] (Extended). The TUBerlin dataset contains sketches subjected to 250 categories. We also utilize the extended nature images provided in [39, 72] with a total size of . 30 classes of images and sketches are randomly selected to form the retrieval gallery and query set respectively. The rest data are used for training. Since the quantities of real images from different classes are extremely imbalanced, we additionally require each test category have at least 400 images when picking the test set.
3.3 Comparison with existing methods
Type  Method  Sketchy (Extended)  TUBerlin (Extended)  

mAP  Precision  Feature  Retrieval  mAP  Precision  Feature  Retrieval  
@all  @100  Dimension  Time (s)  @all  @100  Dimension  Time (s)  
SBIR  Softmax Baseline  0.099  0.176  4096  0.083  0.139  4096  
Siamese CNN [49]  0.143  0.183  64  0.122  0.153  64  
SaN [69]  0.104  0.129  512  0.096  0.112  512  
GN Triplet [52]  0.211  0.310  1024  0.189  0.241  1024  
3D Shape [61]  0.062  0.070  64  0.057  0.063  64  
DSH (64 bits) [39]  0.164  0.227  64 (binary)  0.122  0.198  64 (binary)  
ZeroShot  CMT [55]  0.084  0.096  300  0.065  0.082  300  
DeViSE [19]  0.071  0.078  300  0.067  0.075  300  
SSE [74]  0.108  0.154  100  0.096  0.133  220  
JLSE [75]  0.126  0.178  100  0.107  0.165  220  
SAE [32]  0.210  0.302  300  0.161  0.210  300  
ZSH (64 bits) [66]  0.165  0.217  64 (binary)  0.139  0.174  64 (binary)  
Proposed  ZSIH (64 bits)  0.254  0.340  64 (binary)  0.220  0.291  64 (binary) 
As crossmodal hashing for SBIR under the zeroshot setting has never been proposed before to the best of our knowledge, the quantity of potential related existing baselines is limited. Our task can be regarded as a combination of conventional crossmodal hashing, SBIR and zeroshot learning. Therefore, we adopt existing methods according to these themes for retrieval performance evaluation. We use the seenunseen splits identical to ours for training and testing the selected baselines. The deeplearningbased baselines are retrained endtoend using the zeroshot setting mentioned above. For the nondeep baselines, we extract the respective AlexNet [33] fc_7 features pretrained on the seen sketches and images as model training inputs for a fair comparison with our deep model.
Description  Sketchy  TU 

Kron. layer concatenation  0.228  0.207 
Kron. layer MFB [70]  0.236  0.211 
Stochastic neuron bit regularization  0.187  0.158 
Decoder classifier  0.162  0.133 
Without GCNs  0.233  0.171 
GCNs word vector fusion  0.219  0.176 
for GCNs  0.062  0.055 
for GCNs  0.241  0.202 
ZSIH (full model)  0.254  0.220 
CrossModal Hashing Baselines. Several stateoftheart crossmodal hashing works are introduced including CMSSH [4], CMFH [13], SCM [71], CVH [34], SePH [37] and DSH [39], where DSH [39] can also be subjected to an SBIR model and thus is closely related to our work. In addition, CCA [59] is considered as a conventional crossmodal baseline, though it learns realvalued joint representations.
ZeroShot Baselines. Existing zeroshot learning works are not originally designed for crossmodal search. We select a set of stateoftheart zeroshot learning algorithms as benchmarks, including CMT [55], DeViSE [19], SSE [74], JLSE [75], SAE [32] and the zeroshot hashing model, i.e., ZSH [66]. For CMT [55], DeViSE [19] and SAE [55], two sets of 300D embedding functions are trained for sketches as images with the word vectors [44] as the semantic information for nearest neighbour retrieval, and the classifiers used in these works are ignored. SSE [74] and JLSE [75] are based on seenunseen class mapping, so the output embedding sizes are set to 100 and 220 for Sketchy [52] and TUBerlin [15] dataset respectively. We train two modalspecific encoders of ZSH [66] simultaneously for our task.
SketchImage Mapping Baselines. Siamese CNN [49], SaN [69], GN Triplet [52], 3D Shape [61] and DSH [39] are involved as SBIR baselines. We follow the instructions of the original papers to build and train the networks under our zeroshot setting. A softmax baseline is additionally introduced, which is based on computing the 4096D AlexNet [33] feature distances pretrained on the seen classes for nearest neighbour search.
Results and Analysis. The zeroshot crossmodal retrieval meanaverage precisions (mAP@all) of ZSIH and several hashing baselines are given in Tab. 1, while the corresponding precisionrecall (PR) curves and precision@100 scores are illustrated in Fig. 3. The performance margins between ZSIH and the selected baselines are significant, suggesting the existing crossmodal hashing methods fail to handle our zeroshot task. ZSH [66] turns out to be the only wellknown zeroshot hashing model and it attains relatively better results than other baselines. However, it is originally designed for singlemodal data retrieval. DSH [39] leads the SBIR performance under the conventional crossmodal hashing setting, but we observe a dramatic performance drop when extending it to the unseen categories. Some retrieval results are provided in Fig. 4. Fig. 5 shows the 32bit tSNE [41] results of ZSIH on the training set and test set, where a clearly scattered map on the unseen classes can be observed. We also illustrate the retrieval performance w.r.t. the number of seen classes in Fig. 5. It can be seen that ZSIH is able to produce acceptable retrieval performance as long as an adequate number of seen classes is provided to explore the semantic space.
The comparisons with SBIR and zeroshot baselines are shown in Tab. 2, where an akin performance margin to the one of Tab. 1 can be observed. To some extent, the SBIR baselines based on positivenegative samples, e.g., Siamese CNN [49] and GN Triplet [52], have the ability to generalize the learned representations to unseen classes. SAE [32] produces closest performance to ZSIH among the zeroshot learning baselines. Similar to ZSH [66], these zeroshot baselines suffer from the problem of mitigating the modality heterogeneity. Furthermore, most of the methods in Tab. 2 learn realvalued representations, which leads to poor retrieval efficiency when performing nearest neighbour search in the highdimensional continuous space.
3.4 Ablation study
Some ablation study results are reported in this subsection to justify the plausibility of our proposed model.
Baselines. The baselines in this subsection are built by modifying some parts of the original ZSIH model. To demonstrate the effectiveness of the Kronecker layer for data fusion, we introduce two baselines by replacing the Kronecker layer [23] with the conventional feature concatenation and the multimodal factorized bilinear pooling (MFB) layer [70]. Regularizing the output bits with quantization error and bit decorrelation loss identical to [38] is also considered as a baseline in replacing the stochastic neurons [9]. The impact of the semanticaware encodingdecoding design is evaluated by substituting a classifier for the semantic decoder. We introduce another baseline by replacing the graph convolutional layers [31] with conventional fully connected layers. Fusing the word embedding to the multimodal network is also tested in replacement of graph convolution. Several different hyperparameter settings of are also reported.
Results and Analysis. The ablation study results are demonstrated in Tab. 3. We only report the 64bit mAP on the two datasets for comparison in order to ensure the paper content to be concise. It can be seen that the reported baselines typically underperform the proposed model. Both feature concatenation and MFB [70] produce reasonable retrieval performances, but the figures are still clearly lower than our original design. We speculate this is because the Kronecker layer considerably expands the hidden state dimensionality and therefore, the network is able to store more information for crossmodal hashing. When testing the baseline of bit regularization similar to [38], we experience an unstable training procedure easily leading to overfitting. The quantization error and bit decorrelation loss introduce additional hyperparameters to the model, making the training procedure hard. Replacing the semantic decoder with a classifier results in a dramatic performance fall as the classifier basically provides no semantic information and fails to generalize knowledge from the seen classes to the unseen ones. Graph convolutional layer [31] also plays an important role in our model. The mAP drops by about
when removing it. Graph convolution enhances hidden representations and knowledge within the neural network by correlating the data points that are semantically close, benefiting our
zeroshot task. As to the hyperparameters, a large value of , e.g., , generally leads to a tightlyrelated graph adjacency, making data points from different categories hard to be recognized. On the contrary, an extreme small value , e.g., , suggests a sparselyconnected graph with binarylike edges, where only data points from the same category are linked. This is also suboptimal in exploring the semantic relation for zeroshot tasks.4 Conclusion
In this paper, a novel but realistic task of efficient largescale zeroshot SBIR hashing was studied and successfully tackled by the proposed zeroshot sketchimage hashing (ZSIH) model. We designed an endtoend threenetwork deep architecture to learn shared binary representations and encode sketch/image data. Modality heterogeneity between sketches and images was mitigated by a Kronecker layer with attended features. Semantic knowledge was introduced in assistance of visual information by graph convolutions and a generative hashing scheme. Experiments suggested the proposed ZSIH model significantly outperforms existing methods in our zeroshot SBIR hashing task.
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for finegrained image classification. In CVPR, 2015.
 [3] Z. AlHalah, M. Tapaswi, and R. Stiefelhagen. Recovering the missing link: Predicting classattribute associations for unsupervised zeroshot learning. In CVPR, 2016.
 [4] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through crossmodality metric learning using similaritysensitive hashing. In CVPR, 2010.
 [5] Y. Cao, M. Long, J. Wang, and S. Liu. Collective deep quantization for efficient crossmodal retrieval. In AAAI, 2017.
 [6] Y. Cao, M. Long, J. Wang, Q. Yang, and P. S. Yu. Deep visualsemantic hashing for crossmodal retrieval. In ACM SIGKDD, 2016.
 [7] S. Changpinyo, W.L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zeroshot learning. In CVPR, 2016.
 [8] S. Changpinyo, W.L. Chao, and F. Sha. Predicting visual exemplars of unseen classes for zeroshot learning. In ICCV, 2017.
 [9] B. Dai, R. Guo, S. Kumar, N. He, and L. Song. Stochastic generative hashing. In ICML, 2017.
 [10] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.
 [11] B. Demirel, R. G. Cinbis, and N. I. Cinbis. Attributes2classname: A discriminative model for attributebased unsupervised zeroshot learning. In ICCV, 2017.
 [12] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [13] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In CVPR, 2014.
 [14] Z. Ding, M. Shao, and Y. Fu. Lowrank embedded ensemble semantic dictionary for zeroshot learning. In CVPR, 2017.
 [15] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Transactions on Graphics, 31(4):44–1, 2012.
 [16] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. An evaluation of descriptors for largescale image retrieval from sketched feature lines. Computers & Graphics, 34(5):482–498, 2010.
 [17] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. Sketchbased image retrieval: Benchmark and bagoffeatures descriptors. IEEE transactions on visualization and computer graphics, 17(11):1624–1636, 2011.
 [18] V. Erin Liong, J. Lu, Y.P. Tan, and J. Zhou. Crossmodal deep variational hashing. In ICCV, 2017.
 [19] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visualsemantic embedding model. In NIPS, 2013.
 [20] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, 1999.
 [21] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
 [22] D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.

[23]
G. Hu, Y. Hua, Y. Yuan, Z. Zhang, Z. Lu, S. S. Mukherjee, T. M. Hospedales,
N. M. Robertson, and Y. Yang.
Attributeenhanced face recognition with neural tensor fusion networks.
In ICCV, 2017.  [24] R. Hu and J. Collomosse. A performance evaluation of gradient field hog descriptor for sketch based image retrieval. In CVIU, 2013.
 [25] R. Hu, T. Wang, and J. Collomosse. A bagofregions approach to sketchbased image retrieval. In ICIP, 2011.
 [26] H. Jiang, R. Wang, S. Shan, Y. Yang, and X. Chen. Learning discriminative latent attributes for zeroshot classification. In ICCV, 2017.
 [27] Q.Y. Jiang and W.J. Li. Deep crossmodal hashing. In CVPR, 2017.
 [28] N. Karessli, Z. Akata, B. Schiele, and A. Bulling. Gaze embeddings for zeroshot image classification. In CVPR, 2017.
 [29] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [30] D. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2014.
 [31] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In ICLR, 2017.

[32]
E. Kodirov, T. Xiang, and S. Gong.
Semantic autoencoder for zeroshot learning.
In CVPR, 2017.  [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [34] S. Kumar and R. Udupa. Learning hash functions for crossview similarity search. In IJCAI, 2011.
 [35] C. H. Lampert, H. Nickisch, and S. Harmeling. Attributebased classification for zeroshot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
 [36] Y. Li, D. Wang, H. Hu, Y. Lin, and Y. Zhuang. Zeroshot recognition using dual visualsemantic mapping paths. In CVPR, 2017.
 [37] Z. Lin, G. Ding, M. Hu, and J. Wang. Semanticspreserving hashing for crossview retrieval. In CVPR, 2015.
 [38] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015.
 [39] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: Fast freehand sketchbased image retrieval. In CVPR, 2017.
 [40] Y. Long, L. Liu, Y. Shen, L. Shao, and J. Song. Towards affordable semantic searching: Zeroshot. retrieval via dominant attributes. In AAAI, 2018.
 [41] L. v. d. Maaten and G. Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
 [42] D. Mandal, K. N. Chaudhury, and S. Biswas. Generalized semantic preserving hashing for nlabel crossmodal retrieval. In CVPR, 2017.
 [43] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber. Multimodal similaritypreserving hashing. IEEE transactions on pattern analysis and machine intelligence, 36(4):824–830, 2014.
 [44] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR Workshop, 2013.
 [45] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
 [46] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zeroshot learning by convex combination of semantic embeddings. In ICLR, 2014.
 [47] K. Pang, Y.Z. Song, T. Xiang, and T. Hospedales. Crossdomain generative learning for finegrained sketchbased image retrieval. In BMVC, 2017.
 [48] S. Parui and A. Mittal. Similarityinvariant sketchbased image retrieval in large databases. In ECCV, 2014.
 [49] Y. Qi, Y.Z. Song, H. Zhang, and J. Liu. Sketchbased image retrieval via siamese convolutional neural network. In ICIP, 2016.
 [50] J. M. Saavedra. Sketch based image retrieval using a soft computation of the histogram of edge local orientations (shelo). In ICIP, 2014.
 [51] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7), 2009.
 [52] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics, 35(4):119, 2016.
 [53] Y. Shen, L. Liu, and L. Shao. Unsupervised deep generative hashing. In British Machine Vision Conference (BMVC), 2017.
 [54] Y. Shen, L. Liu, L. Shao, and J. Song. Deep binaries: Encoding semanticrich cues for efficient textualvisual cross retrieval. In ICCV, 2017.
 [55] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zeroshot learning through crossmodal transfer. In NIPS, 2013.
 [56] J. Song, Y.Z. Song, T. Xiang, and T. Hospedales. Finegrained image retrieval: the text/sketch input dilemma. In BMVC, 2017.
 [57] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Intermedia hashing for largescale retrieval from heterogeneous data sources. In ACM SIGMOD, 2013.
 [58] J. Song, Q. Yu, Y.Z. Song, T. Xiang, and T. M. Hospedales. Deep spatialsemantic attention for finegrained sketchbased image retrieval. In ICCV, 2017.
 [59] B. Thompson. Canonical correlation analysis. Encyclopedia of statistics in behavioral science, 2005.
 [60] L. R. Tucker. Some mathematical notes on threemode factor analysis. Psychometrika, 31(3):279–311, 1966.
 [61] F. Wang, L. Kang, and Y. Li. Sketchbased 3d shape retrieval using convolutional neural networks. In CVPR, 2015.
 [62] B. Wu, Q. Yang, W.S. Zheng, Y. Wang, and J. Wang. Quantized correlation hashing for fast crossmodal search. In IJCAI, 2015.
 [63] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zeroshot classification. In CVPR, 2016.
 [64] X. Xu, F. Shen, Y. Yang, D. Zhang, H. Tao Shen, and J. Song. Matrix trifactorization with manifold regularizations for zeroshot learning. In CVPR, 2017.
 [65] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao. Pairwise relationship guided deep hashing for crossmodal retrieval. In AAAI, 2017.
 [66] Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen. Zeroshot hashing via transferring supervised knowledge. In ACM MM, 2016.
 [67] M. Ye and Y. Guo. Zeroshot classification with discriminative semantic representation learning. In CVPR, 2017.
 [68] Q. Yu, F. Liu, Y.Z. Song, T. Xiang, T. M. Hospedales, and C.C. Loy. Sketch me that shoe. In CVPR, 2016.
 [69] Q. Yu, Y. Yang, Y.Z. Song, T. Xiang, and T. Hospedales. Sketchanet that beats humans. In BMVC, 2015.
 [70] Z. Yu, J. Yu, J. Fan, and D. Tao. Multimodal factorized bilinear pooling with coattention learning for visual question answering. In ICCV, 2017.
 [71] D. Zhang and W.J. Li. Largescale supervised multimodal hashing with semantic correlation maximization. In AAAI, 2014.
 [72] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao. Sketchnet: Sketch classification with web images. In CVPR, 2016.
 [73] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zeroshot learning. In CVPR, 2017.
 [74] Z. Zhang and V. Saligrama. Zeroshot learning via semantic similarity embedding. In ICCV, 2015.
 [75] Z. Zhang and V. Saligrama. Zeroshot learning via joint latent similarity embedding. In CVPR, 2016.
 [76] Y. Zhen and D.Y. Yeung. Coregularized hashing for multimodal data. In NIPS, 2012.
 [77] R. Zhou, L. Chen, and L. Zhang. Sketchbased image retrieval on a large scale database. In ACM MM, 2012.