1 Introduction
Touchscreen devices, such as smartphone and iPad, enable users to draw freehand sketches conveniently. These sketches are highly iconic, succinct, and abstract representations, and usually convey richer and more accurate information than texts in some scenarios. Consequently, they spawned many novel applications. One of the most representative examples is the SketchBased Image Retrieval (SBIR), which has attracted significant attention from the computer vision community during the past decades
[14, 47, 34, 22, 51, 32, 22]. For the SBIR task, learning good representations for both sketches and photos is of vital importance and is considered as a challenging problem [47, 48, 34, 28, 29, 38].We give an illustrative example of the SBIR task in Figure 1. Given a query sketch, the objective is to find all of its relevant photos that are semantically related, e.g, they come from the same category. This task appears to be easy for us humans but is very difficult for a machine. The main challenge comes from the fact that there is a significant gap between the data representation in the two domains: the sketches are represented by highly iconic, abstract and sparse lines, whilst the photos are composed of dense color pixels with rich texture information.
Recently, many works have been proposed to address this problem. A popular approach is constructing a good intermediate representation, i.e., converting photos to edge maps [47, 34, 22, 51] or translating sketches into the photo domain using generative models [51]. Another widely adopted approach is learning a semiheterogeneous network in an endtoend manner [32, 22]. These approaches have one thing in common – they all aim to find a feature space in which the gap between sketches and photos is minimized. Therefore, a loss function, e.g., semantic factorization loss [22] and semantic loss [51], that aims to minimize the domain discrepancy and intraclass distance is usually constructed.
In this work, we argue that only minimizing the domain discrepancy and intraclass distance is not sufficient for achieving accurate SBIR. Even the distance (in a certain feature space) between samples from the same class is small, it is still possible that there exist irrelevant samples near the query sample, leading to poor retrieval results. Motivated by this intuition, we propose a novel loss function, named Euclidean Margin Softmax (EMS), that minimizes the intraclass distance and maximizes the interclass distance simultaneously. We show that the EMS loss is able to yield highly accurate retrieval results, which surpass all existing algorithms by a large margin.
Further, to accompany the proposed loss function, we introduce a conditional neural network architecture, that could incorporate our prior knowledge about which domain the sample comes from. Specifically, our base network is the ResNeXt model
[43] with Squeeze and Excitation (SE) module [10], since it not only has high representation power but also enables us to encode the conditional information conveniently. In each SE module, the convolutional features are firstly squeezed into a low dimensional embedding by an encoder network, then they are decoded to generate a channelwise attention vector, which is applied to the original feature maps. Based on the SE module, we can simply append one binary code, which indicates which domain the sample comes from, to the low dimensional embedding. Through extensive experiments, we show that this change is simple yet highly effective.
Contributions. We highlight the main contributions of this work. (1) A novel loss function that simultaneously minimizes intraclass distance and maximizes interclass distances is proposed. (2) We propose a novel conditional network architecture that could incorporate the additional information about the domain attribution, which helps to boost the retrieval performance. (3) New stateoftheart results have been obtained on several competitive SBIR tasks. In addition, we show our model can be directly extended to address the challenging zeroshot SBIR task.
2 Related Work
SBIR. The task of SketchBased Image Retrieval (SBIR) aims at retrieving the images that are of similar semantic meaning as the query sketch. A typical solution is to learn a shared embedding space for both sketches and images. Such a common space facilitates the ranking of similarity of sketches and images. Previous methods [25, 4, 11, 28]
employed the handcraft sketch features to represent the sketches. Recent deep learning based architectures enable the crossdomain learning in an endtoend manner
[38, 47, 48, 34]. To accelerate the retrieval in a largescale dataset, hashing based models [22, 32, 51, 46] have also been studied.Feature embedding. Representation learning is studied in computer vision community [8, 43, 12, 10, 35, 17]; but the sketchbased representation learning is relatively less studies [48, 34]. Among all the different deep architectures have been investigated and studied, such as ResNet, ResNeXt, and DenseNet. Among them, ResNeXt [43] incorporated both the residual learning and group convolution; and thus it is adopted as the building structure in this paper. The recent SE module [10] made networks capable of choosing relatively important channels with feature maps, and thus it is also used here. Remarkably, to effectively embed cross domain features, several hashing methods [5, 2, 18, 49, 20, 13] and crossmodal embedding methods [1, 37, 21, 44] have been studied.
Loss functions. Many metric learning based methods [9, 26, 42, 40]
proposed learning deep features by the loss functions of Euclidean distance. In order to make the learned feature more discriminative, other variants of loss functions have been investigated recently, such as contrastive loss
[3, 7] and triplet loss [31], Lsoftmax [24], Asoftmax [23] and LMCL [39]losses. Particularly, the contrastive and triplet losses aim at increasing the Euclidean distance margin, while Lsoftmax, Asoftmax, and LMCL losses are designed to expand the angular margin. Remarkably, the latter three losses make simple modification on softmax loss and greatly improve the performance on face recognition tasks. Furthermore, prototypical loss
[33] is also a variant of softmax which incorporates the Euclidean distance.3 Approach
In this section, we first give an overview of the proposed approach, and then introduce the conditional network architecture, the Euclidean Margin Loss and a hashing method for efficient retrieval.
3.1 Overview
Problem Setup. We formulate the sketchbased image retrieval (SBIR) task as [22, 51]. We have the set of realistic photos, and sketches as , and respectively. In the supervised setting, the sketch set is split into train and test sets, with the same label set . Therefore, given a query sketch in the test sketch set, the goal of our SBIR task is to retrieve the best matched , such that . Note that the same photo set is used as both training and sets and retrieval galleries as the setting defined in [22].
In the zeroshot SBIR task, we divide into and individually; and we have source domain , , and target domain , ; and . The zeroshot SBIR model is trained on source data , and tested on target domain .
To facilitate the sketch retrieval tasks, we project the photos and sketches into a single shared metric space. We introduce a novel architecture – CSEResNeXt101 as illustrated in Figure 2. Our architecture optimizes that the instances of sketches and photos in the same/different class should be close/far to each other. By virtue of such an optimization process, the learned metric space will have large interclass and small intraclass distance over the photo and sketch set. Given one sketch or a photo , our CSEResNeXt101 network can extract its feature vector as the output of last fully connected layer.
3.2 Network Architecture
The key challenge of learning the space for SBIR is how to efficiently learn to preserve the semantic consistency over the sketch and photo domains. Particularly, in the sketch domain, we have highly iconic and abstract sketches with various levels of deformation; and the photos are natural images with the color, texture and shape information. To address this challenge and facilitate the SBIR, we propose a novel network architecture – Conditional SE ResNeXt101 (CSEResNeXt101) as shown in Figure 2. It is composed of forty CSE ResNeXt Blocks. Each block integrates the Conditional SE (CSE) module into the ResNeXt block. The CSE module is illustrated in Figure 3.
Rather than using independent subnetworks to separately process the photos and sketches, CSEResNeXt101 directly learns a single subnetwork to jointly analyze the photos and sketches. Such a SiameseNetlike network is inspired by the fact that SiameseNet is efficient in learning the embedding space across different domains (e.g., imagetext embedding [41], or person reidentification [36]).
Another novelty of our CSEResNeXt101 comes from the newly proposed Conditional SE (CSE) module [10]. In particular, we observe that it has been shown that the network with SE module has achieved remarkable performance on the ILSVRC 2017 classification tasks. The SE module also provides an explicit mechanism to reweight the importance of channels after each block in the network. Due to the nature of our crossdomain tasks, we can generalize SE to CSE module with a very simple change. As shown in Figure 3
, our CSE module utilizes an autoencoder subnetwork followed by a sigmoid activation; Within the space learned by autoencoder, a binary code is added to indicate whether the input image is a sketch or a photo. The outputs of sigmoid activation are passed as the feature tensor attention vector over the feature channels. This conditional autoencoder structure can thus help to capture different characteristics of input images conditioned on which domain they come from.
3.3 Euclidean Margin Softmax
Softmax loss Revisit. Softmax loss is widely used in classification tasks. Given a feature vector with its groundtruth class , this loss can be formed as
(1) 
where (i.e., ) denotes the activation of th category; and totally, we have category and training instances. and are the class weight parameters of the th category, which are optimized by Eq (1). The class label of testing instance is computed as . In binary classification, if , the instance will be assigned the class label , and vice versa.
EMS loss. The SBIR task has the sketches and photos, which are from two different domains. The loss function should be optimized to learn a shared space, which has very large interclass distance, and small intraclass distance over the photo and sketch set. To this end, we propose a novel loss function – Euclidean Margin Softmax (EMS) loss. The EMS is a generalization of softmax loss in Eq (1). It is defined by,
(2) 
We explain the parameters in Eq (2); and
indicates the feature extracted by the last layer of CSEResNeXt101 network. (1)
is the center of th category. We take the center as the parameters, and update dynamically, rather than directly use the average feature center. (2) In Eq (2), we employ the Euclidean distance to measure the confidence of being . (3) is the margin constant, which helps take account the different data distributions of each class. Particularly, in the binary classification case, we can category as class if , and, otherwise, as class 2.EMS V.S. Prototypical loss. We discuss the difference between EMS loss and prototypical loss [33] which is,
(3) 
The prototypical loss is used in oneshot classification where only few training instances are available for each class. Thus is directly computed as the averaged mean of training instances; in contrast, our EMS loss is a generalized softmax loss, which optimizes from training data. Furthermore, the prototypical loss is optimizing the image instances only, while our EMS aims at optimizing the training instances of different classes across different domains. Therefore a margin constant is introduced to make a balance between enlarging the interclass distance and shrinking the intraclass distance.
EMS Vs. Angular Margin loss. We further discuss the difference between EMS loss and angular margin loss. Besides Euclidean distance, angular distance based loss functions, such as, ASoftmax [23] and LMCL [39], are also employed in learning a shared space in many tasks, e.g., face recognition. These angular margin losses aim at learning a discriminative distribution on a hypersphere. As shown in Table 1, they define different similarity functions for the instances of different classes. Note that is an artificial piecewise function that serves as the extension of , in order to overcome its nonmonotonicity. Nevertheless, it is nontrivial to define the in Asoftmax function as stated in [39]. The scalar
in LMCL is used to expand the range of similarity function; otherwise, the output of softmax function would be closed to the uniform distribution over all categories.
ASoftmax  

LMCL 
Theoretical Analysis of EMS. The property of EMS is determined by the value of . Intuitively, the larger makes decision boundaries, closer to corresponding prototypes and the distribution of features, more compact. In this case, the metric space can be well discriminative. However, the large
introduces instability into training, due to the intrinsic variances among samples in each category. Thus it is necessary to find the minimum
to ensure that, for every sample, and in metric space, the maximum intraclass distance is smaller than minimum interclass distance. We prove that is required in all cases. In binary category case, is the minimum value of . With the growth of the number of categories, the minimum value is reduced monotonously. So is sufficient to guarantee perfect discrimination. We illustrate its necessity in multiclass cases, since, if two prototypes are far from the others, their relationship will resemble the one in binary category case. The details of proof can be found in Appendix.Zeroshot retrieval tasks. In addition to the classical categorical SBIR tasks, our framework can be easily extended to zeroshot SBIR task. Specifically, in zeroshot SBIR task, we have the source ( and ) and target ( and ) data. Our model is trained on the source data and directly tested on the target data.
3.4 Dimension Reduction Hashing
To accelerate the retrieval speed, we propose a simple hashing scheme to encode the features generated by CSEResNeXt101 network from one sketch or a photo . Our hashing scheme projects into a lowdimensional binary space. This is a postprocess step after training CNNs. The hashing scheme is implemented as an autoencoder, whose encoder conducts the dimensionreduction mapping, and the decoder conducts the inverse mapping. The objective function is composed of,
(4) 
where is the reconstruction loss, which maintains the structure among the centers in a lowdimension space; and the scatter loss prevents the prototype being closed to each other in the lowdimension space. is the number of categories. Our hashing scheme can encode all photos and query sketch into a lowdimension binary space by a function after the encoder. The hamming distance can thus be used for the retrieval tasks.
Methods  TUBerlin Extension  Sketchy Extension 

HOG [4]  0.091  0.115 
GFHOG [11]  0.119  0.157 
SHELO [28]  0.123  0.182 
LKS [29]  0.157  0.190 
Siamese CNN [27]  0.322  0.481 
SaN [48]  0.154  0.208 
GN Triplet [30]  0.187  0.529 
3D Shape [38]  0.072  0.084 
SiameseAlexNet [22]  0.367  0.518 
TripletAlexNet [22]  0.448  0.573 
CSEResNeXt101  0.820  0.958 
4 Experiments
In this section, we evaluate our method on two tasks, including Categorylevel SBIR (CSBIR), and ZeroShot Category SBIR (ZSCSBIR) .
4.1 Datasets and Settings
Categorylevel SBIR. Our model is evaluated on two largescale sketchphoto datasets: TUBerlin [6] Extension and Sketchy [30] Extension. The former includes 20,000 sketches uniformly distributed among 250 categories. Additionally, 204,489 natural images provided in [50] are utilized as the photo gallery. The Sketchy database consists of 75,471 handdrawn sketches and 12,500 corresponding photos from 125 categories. It was extended by another 60,502 photos for CSBIR task in [22]. Following the settings in [22, 51], 10/50 sketches from each category are picked as the query set for TUBerlin/Sketchy dataset, and the rest are used for training. All gallery photos are used in both training and testing phase.
ZeroShot Categorylevel SBIR. We also compare the performance of our model with previous methods on ZeroShot Categorylevel SBIR task, where we still follow the setting of categorylevel retrieval but split the category into source and target domain as stated in Sec. 3.1: we randomly select 30/25 categories as target domain for TUBerlin / Sketchy Extension datasets respectively. Same as [32], we only select categories that contain more than 400 images to form the target domain. We train our network on source domain with the same process as standard Categorylevel SBIR and direct test it on target domain.
Implementation.
Our method is implemented using Pytorch with single TitanX GPU. We use Adam optimizer
[15] with parameters . The learning rate is set to and linearly decayed to during training. We construct the batches with the size, and train networks with 15 epochs. The models and codes will be released upon the acceptance.
4.2 Results on Categorylevel SBIR
Methods  TUBerlin Extension  Sketchy Extension  
32 bits  64 bits  128 bits  32 bits  64 bits  128 bits  
CrossModality Hashing Methods (binary codes)  CMFH [5]  0.149  0.202  0.180  0.320  0.490  0.190 
CMSSH [2]  0.121  0.183  0.175  0.206  0.211  0.211  
SCMSeq [49]  0.211  0.276  0.332  0.306  0.417  0.671  
SCNOrth [49]  0.217  0.301  0.263  0.346  0.536  0.616  
CVH [18]  0.214  0.294  0.318  0.325  0.525  0.624  
SePH [20]  0.198  0.270  0.282  0.534  0.607  0.640  
DCMH [13]  0.274  0.382  0.425  0.560  0.622  0.656  
DSH [22]  0.358  0.521  0.570  0.653  0.711  0.783  
GDH [51]  0.563  0.690  0.651  0.724  0.811  0.784  
CrossView Feature Learning Methods (realvalue vectors)  CCA [37]  0.276  0.366  0.365  0.361  0.555  0.705 
XQDA [19]  0.191  0.197  0.201  0.460  0.557  0.550  
PLSR [21]  0.141 (4096d)  0.462 (4096d)  
CVFL [44]  0.289 (4096d)  0.675 (4096d)  
Ours  CSEResNeXt101  0.791  0.817  0.819  0.949  0.952  0.958 
Competitors. There are several categories of competitors as listed in Table 2: (1) handcraft feature based models: LSK [29], SEHLO [28], HOG [4] and GFHOG [11]; (2) deep learning based models: 3D Shape [38], SketchaNet (SaN) [48], GN Triplet [30] , Siamese CNN [27], SiameseAlexNet and TripletAlexNet [22]. The Mean Average Precision (MAP) is reported.
Results. The results are summarized in Table 2. Remarkably, our model outperforms all the competitors by a very large margin. It achieves a MAP improvement of 0.372/0.385 over the stateoftheart realvalued based method – TripletAlexNet. This demonstrates the efficacy of our model. Note that the improved performance is due to the novel structure, and the EMS loss function used here. We give further analysis in the ablation study.
4.3 Hashing Results on Categorylevel SBIR
Methods  Dimension  TUBerlin Extension  

MAP  Precision @100  
Siamese CNN [27]  64  0.122  0.153 
SaN [48]  512  0.096  0.112 
GN Triplet [30]  1024  0.189  0.241 
3D Shape [38]  64  0.057  0.063 
DSH [22]  64  0.122  0.198 
SSE [52]  100  0.096  0.133 
JLSE [53]  100  0.107  0.165 
SAE [16]  300  0.161  0.210 
ZSH [45]  64  0.139  0.174 
ZSIH [32]  64  0.220  0.291 
Ours  512  0.259  0.369 
64  0.165  0.252 
Competitors. (1) Our hashing model is compared against 8 crossmodal hashing methods: Collective Matrix Factorization Hashing (CMFH) [5], CrossModal SemiSupervised Hashing (CMSSH) [2], CrossView Hashing(CVH) [18], Semantic Correlation Maximization (SCMSeq and SCMOrth) [49], SemanticsPreserving Hashing(SePH) [20], Deep CrossModality Hashing (DCMH) [13], Deep Sketch Hash (DSH) [22] and Generative DomainMigration Hashing (GDH) [51]. (2) We also compare 4 crossview feature embedding methods: CCA [37], PLSR [21], XQDA [19] and CVFL [44]. We still report the MAP.
Training cost. Our hashing scheme is taken as a postprocessing step, in order to make our framework comparable to previous hashing based SBIR models. With the computed features by CSEResNeXt101, our hashing scheme is trained for 10000 steps; and the whole process can be finished within 1 minute on our computer.
Results. We summarize our results in Table 3. Our method achieves the best performance among all hashingbased methods and crossmodal learning methods. Critically, our model improves MAP with a scale over 0.12 in all conditions compared with GDH [51] which is the stateoftheart method on this task. This further demonstrates the effectiveness of our proposed framework in the SBIR tasks.
Our model can be trained efficiently. We only need to train a single network – CSEResNeXt101 in an endtoend manner; and edge maps are not utilized here. In contrast, the training process of the other competitors is a bit complex. For example, GDH [51] utilized the cycleconsistent GANs to transfer sketches into photos. DSH [22] requires the precomputed edge maps to bridge the gap between sketches and photos, and semantic representation (wordvec) is used as prior of interrelationship among categories. These methods are trained by two steps of optimization in each iteration: one for binary code and the other for network parameters.
4.4 Results of ZeroShot CSBIR
Competitors. We compare our method on ZSCSBIR task with 5 SBIR methods: 3D Shape [38], SketchaNet (SaN) [48], GN Triplet [30] , Siamese CNN [27] and DSH [22], 5 zeroshot methods: SSE [52], JLSE [53], SAE [16], ZSH [45] and ZSIH [32].
Results. We report the results in Table 4
. Note that even directly utilizing our model to ZSCSBIR task, our model can still beat all the other ZSCSBIR competitors. This result validates the ability of proposed CSEResNeXt101 to generalize to unseen categories. We also note that our hashing result has a larger reduction compared with the nonhashing result under zeroshot setting. It makes sense since our hashing method is not optimized for ZSCSBIR. Particularly, our hashing scheme only learns to discriminatively binarize prototypes of classes in source domain, while prototypes of unseen classes might be indiscernible in form of binary codes.
4.5 Ablation Study
We conduct extensive ablation study to evaluate each component of our model.
Network structure. In this part, we compare the performance using different network structures for the embedding. The sketches and photos can contain similar semantic information, but will be very different in appearance: sketches use the strokes to roughly express the contour of main objects while photos use colorful patches to express objects and backgrounds. Deep Networks are thought to learn lowlevel feature with early layers and highlevel feature with later layers. By intuition, we can use semiheterogeneous network as in [22, 32] to process two domains of images using two branches of CNNs which aggregate at the later stage. It is a natural question of how many layers should be used to embed the lowlevel feature of sketches and photos separately. We conduct the ablation study to validate this point.
As is shown in Figure 4, we investigate 6 possible merging positions that can be used to aggregate the sketch/photo branches, which separately process the sketches and photos, i.e., without using the SiameseNetstyle. The results are reported in Table 5. We consider using both softmax, and EMS to optimize different variants of our network. Judging from those results, we draw the conclusion that, the variants of the earlier aggregation have better performance, possibly due to more parameters are shared. These results also reveal the fact that deep CNNs are capable of dealing with two domains of images simultaneously. This validates that the proposed network is reasonable in addressing the SBIR task.
Merging position  loss  

softmax  EMS ()  
5  0.670  0.678 
4  0.705  0.745 
3  0.727  0.768 
2  0.730  0.764 
1  0.738  0.775 
0  0.737  0.780 
Network modules. We compare various types of CNNs as well as their variants with SqueezeandExcitation (SE) modules [10]
and the newly proposed conditional SE (CSE) module. Intrinsically, the SE and CSE module can enhance the ability to learn different attention over feature channels, and thus enable a dynamic and implicit feature selection mechanism to our networks. The MAP results are shown in Table
6. All networks are trained and tested in the same setting. We can find that (1) deeper networks perform better, due to their larger capacity in learning the crossmodal images; (2) SE module enhances the ability of CNNs to process inputs from multidomains. Moreover, our CSE module is better than SE module on SBIR task, since the auxiliary binary code is introduced to make SE better learn to select important sketch/photo feature channels.TUBerlin Extension  Sketchy Extension  
AlexNet  0.528  0.863 
VGG16  0.676  0.930 
DenseNet121  0.768  0.942 
ResNet101  0.772  0.945 
SEResNet101  0.790  0.947 
CSEResNet101  0.801  0.949 
ResNeXt101  0.780  0.949 
SEResNeXt101  0.807  0.954 
CSEResNeXt101  0.820  0.958 
Performance of different network variants; all of which are pretrained on ImageNet. We use EMS loss with
to train these networks.Analysis of components in Hashing. We train an autoencoder for dimensionreduction embedding for hashing; and our objective function consists of two parts: reconstruction loss and scatter loss. We also try to combine quantization loss with two loss to ensure intraclass compactness in lowdimension space. The results reported in Table 7 show that the quantization loss does not improve the performance, given that its optimization process is quite time consuming.
TUBerlin Extension  Sketchy Extension  

32  64  128  32  64  128  
s  0.799  0.810  0.823  0.947  0.953  0.958 
r+q  0.097  0.014  0.008  0.054  0.079  0.024 
r+s  0.791  0.817  0.819  0.949  0.952  0.958 
q+s  0.795  0.805  0.824  0.945  0.952  0.957 
r+q+s  0.791  0.814  0.820  0.949  0.954  0.957 
MAP of binary code encoded by Encoders trained with objective functions that have different composition. r ,q and s represent reconstruction loss, quantilization loss, scatter loss respectively. We use CSEResNeXt101 structure.
Different loss functions. Besides model architecture, the objective function also serves as a crucial component in learning discriminative features. We compare our EMS loss with (i) two angular margin losses: ASoftmax [23] and LMCL [39], and original softmax loss; (ii) squared Euclidean Margin Softmax loss (squared EMS) which use squared Euclidean distance instead of Euclidean distance in EMS, and (iii) prototypical loss [33]. We also compare the performance of our model using different margin . We use the same CSEResNeXt101. The MAP on two datasets are reported in Table 8. The results again show the advance of our EMS loss over the other types of losses.
Loss  TUBerlin Extension  Sketchy Extension 
prototypical loss [33]  0.403  0.858 
prototypical loss [33]  0.405  0.854 
squared EMS ()  0.799  0.957 
Softmax  0.747  0.932 
ASoftmax [23] ()  0.776  0.944 
LMCL [39] ()  0.828  0.954 
EMS ()  0.408  0.869 
EMS ()  0.812  0.950 
EMS ()  0.820  0.958 
EMS ()  0.828  0.956 
EMS ()  0.823  0.958 
EMS ()  0.810  0.959 
EMS ()  0.798  0.955 
Euclidean margin v.s. Angular margin. Our EMS loss achieves better performance than both ASoftmax and LMCL in Table 8. Specifically, LMCL achieves comparable performance with EMS, which shows that both angular distance and Euclidean distance can learn discriminative deep embeddings. But the EMS has only one hyperparameter that can be easily crossvalidated (we suggest in most cases), while in LMCL one need to decide the scale and margin that is highly depended on the number of categories in dataset. The performance of ASoftmax is a little bit worse than that of LMCL and EMS since the prototypes of difficult categories will have small angular distances if optimized by the ASoftmax loss, as explained in [39].
Figure 5 visualizes the distribution of minimum interclass distance for instances of each category. There is a noticeable peak at a small value in histograms of corresponded with ASoftmax loss on both datasets. This peak indicates the small interclass distance among some classes, and some photos in one class would be wrongly retrieved by sketches in another class. In contrast, the phenomenon does not exist in LMCL and EMS loss, as in Figure 5.
EMS V.S. prototypical loss. The results in Table 8 shows that the EMS loss with margin, outperforms prototypical loss to a large extent. when , the networks by prototypical loss [33] and EMS loss achieve almost the same results due to their similar formulation. In addition, we notice the two ways of updating the class centers in prototypical loss hit almost similar results.
The effects of different margin . Performance of our model grows with the increase of value , because it forces the network to learn discriminative representations. But when is too large, the performance stops rising and even begins falling. As revealed in Figure 6: (1) when , the intraclass distances of part of instances are greater than the minimum interclass distance, which leads to bad retrieval performance; (2) when and
, the minimum interclass distances are greater than intraclass distances generally, which explains the better performance in these cases. But the standard deviation within some categories do not decrease with larger
due to intrinsic variance, which explains why the performance stops rising or begins falling when is large.(a)  (b)  (d) 
5 Conclusion
In this paper, we have introduced two innovations, i.e., a novel loss function and a conditional network architecture. The proposed Euclidean Margin Loss (EMS) enables us to learn highly discriminative features, which facilitates highly accurate SBIR, while the Conditional Squeeze and Excitation (CSE) block allows us to incorporate the domain information of each sample explicitly. Both the loss and architecture are intuitive and simple to implement. On two popular benchmark SBIR datasets, the proposed model has achieved new stateoftheart results.
References

[1]
G. Andrew, R. Arora, J. Bilmes, and K. Livescu.
Deep canonical correlation analysis.
In
International Conference on Machine Learning
, pages 1247–1255, 2013. 
[2]
M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios.
Data fusion through crossmodality metric learning using
similaritysensitive hashing.
In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
, pages 3594–3601. IEEE, 2010.  [3] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.
 [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
 [5] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2075–2082, 2014.
 [6] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Trans. Graph., 31(4):44–1, 2012.
 [7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In null, pages 1735–1742. IEEE, 2006.
 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [9] J. Hu, J. Lu, and Y.P. Tan. Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1875–1882, 2014.
 [10] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
 [11] R. Hu, M. Barnard, and J. Collomosse. Gradient field descriptor for sketch based retrieval and localization. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 1025–1028. IEEE, 2010.
 [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [13] Q.Y. Jiang and W.J. Li. Deep crossmodal hashing. CoRR, 2016.
 [14] T. Kato, T. Kurita, N. Otsu, and K. Hirata. A sketch retrieval method for full color image databasequery by visual example. In Pattern Recognition, 1992. Vol. I. Conference A: Computer Vision and Applications, Proceedings., 11th IAPR International Conference on, pages 530–533. IEEE, 1992.
 [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [16] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zeroshot learning. arXiv preprint arXiv:1704.08345, 2017.

[17]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012. 
[18]
S. Kumar and R. Udupa.
Learning hash functions for crossview similarity search.
In
IJCAI proceedingsinternational joint conference on artificial intelligence
, volume 22, page 1360, 2011.  [19] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person reidentification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2197–2206, 2015.
 [20] Z. Lin, G. Ding, M. Hu, and J. Wang. Semanticspreserving hashing for crossview retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3864–3872, 2015.
 [21] H. Liu, Z. Ma, J. Han, Z. Chen, and Z. Zheng. Regularized partial least squares for multilabel learning. International Journal of Machine Learning and Cybernetics, 9(2):335–346, 2018.
 [22] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: Fast freehand sketchbased image retrieval. In Proc. CVPR, pages 2862–2871, 2017.
 [23] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 1, 2017.
 [24] W. Liu, Y. Wen, Z. Yu, and M. Yang. Largemargin softmax loss for convolutional neural networks. In ICML, pages 507–516, 2016.
 [25] D. G. Lowe. Object recognition from local scaleinvariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999.
 [26] J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou. Multimanifold deep metric learning for image set classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1137–1145, 2015.
 [27] Y. Qi, Y.Z. Song, H. Zhang, and J. Liu. Sketchbased image retrieval via siamese convolutional neural network. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 2460–2464. IEEE, 2016.
 [28] J. M. Saavedra. Sketch based image retrieval using a soft computation of the histogram of edge local orientations (shelo). In Image Processing (ICIP), 2014 IEEE International Conference on, pages 2998–3002. IEEE, 2014.
 [29] J. M. Saavedra, J. M. Barrios, and S. Orand. Sketch based image retrieval using learned keyshapes (lks). In BMVC, volume 1, page 7, 2015.
 [30] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4):119, 2016.
 [31] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
 [32] Y. Shen, L. Liu, F. Shen, and L. Shao. Zeroshot sketchimage hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3598–3607, 2018.
 [33] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 [34] J. Song, Q. Yu, Y.Z. Song, T. Xiang, and T. M. Hospedales. Deep spatialsemantic attention for finegrained sketchbased image retrieval. In ICCV, pages 5552–5561, 2017.
 [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [36] R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human reidentification. In ECCV, 2016.
 [37] J. Vía, I. Santamaría, and J. Pérez. Canonical correlation analysis (cca) algorithms for multiple data sets: Application to blind simo equalization. In Signal Processing Conference, 2005 13th European, pages 1–4. IEEE, 2005.
 [38] F. Wang, L. Kang, and Y. Li. Sketchbased 3d shape retrieval using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1875–1883, 2015.
 [39] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018.
 [40] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning finegrained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1386–1393, 2014.
 [41] L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving imagetext embeddings. In CVPR, 2016.
 [42] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
 [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [44] W. Xie, Y. Peng, and J. Xiao. Crossview feature learning for scalable social image analysis. In AAAI, pages 201–207, 2014.
 [45] Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen. Zeroshot hashing via transferring supervised knowledge. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1286–1295. ACM, 2016.
 [46] S. K. Yelamarthi, S. K. Reddy, A. Mishra, and A. Mittal. A zeroshot framework for sketch based image retrieval. In European Conference on Computer Vision, pages 316–333. Springer, Cham, 2018.
 [47] Q. Yu, F. Liu, Y.Z. Song, T. Xiang, T. M. Hospedales, and C.C. Loy. Sketch me that shoe. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 799–807, 2016.
 [48] Q. Yu, Y. Yang, F. Liu, Y.Z. Song, T. Xiang, and T. M. Hospedales. Sketchanet: A deep neural network that beats humans. International journal of computer vision, 122(3):411–425, 2017.
 [49] D. Zhang and W.J. Li. Largescale supervised multimodal hashing with semantic correlation maximization. In AAAI, volume 1, page 7, 2014.
 [50] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao. Sketchnet: Sketch classification with web images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1105–1113, 2016.
 [51] J. Zhang, F. Shen, L. Liu, F. Zhu, M. Yu, L. Shao, H. T. Shen, and L. Van Gool. Generative domainmigration hashing for sketchtoimage retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 297–314, 2018.
 [52] Z. Zhang and V. Saligrama. Zeroshot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pages 4166–4174, 2015.
 [53] Z. Zhang and V. Saligrama. Zeroshot learning via joint latent similarity embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6034–6042, 2016.
6 Appendix: Theoretical Analysis of EMS Loss
In this section, we will (1) give a formal definition of maximum intraclass distance and minimum interclass distance; (2) show that for margin in EMS loss, is sufficient and necessary to ensure that the maximum intraclass distance is smaller than minimum interclass distance, regardless of the number of categories.
6.1 Definition
Since we treat both sketch and photo as instances, we define the merged dataset as:
where and are mappings that map photos/sketches into a feature space, and represent photo, sketch and category respectively. They are detailedly illustrated in Sec. 3.1. For convenience, we also define .
Maximum Intraclass Distance and Minimum Interclass Distance
For category , the maximum intraclass distance can be defined by
and the minimum interclass distance:
Here we give a formulation of our objective, which is the maximum intraclass distance being smaller than minimum interclass distance, by proposition :
(5) 
Solve Problem with EMS
Instead of optimizing the distance among instances directly as indicated by Eq. 5, the proposed EMS loss uses prototypes to characterize the distribution of instances in feature space. If this EMS loss is well optimized, instances will be closer to their corresponding prototype than other prototypes in feature space. This relationship can be described as
(6) 
6.2 Finding Boundaries for
The later induction is based on the assumption that our EMS loss can be well optimized, i.e. Eq. 6 holds , and the assumption that . Now the question is: what is the range of that is sufficient and necessary for ? In the rest of this section, we firstly calculate the closed form of and then prove that if , then . To this end, we only need to find the lower bound of , with regard to the number of categories: . Next, we prove . Finally, we prove is sufficient and necessary for .
Lemma 1
If , is a nball (ball in ndimensional space) with center and radius
Proof
If ,
Lemma 2
If , then
Proof
With slightly expanding to , region becomes , where
So we can conclude that . Thus
(8) 
Now we rewrite (Eq. 7) as
(9) 
Lemma 3
is sufficient and necessary for .
Proof
We can write . Now , which are two nballs with same radius and different centers. The maximum intraclass distance is the diameter of each nball:
The minimum interclass distance is the distance between two centers minus the diameter:
Let , we can get the result or . We abandon the latter solution since only when does it make sense. So in binary class case, is sufficient and necessary for .
Lemma 4
is necessary for .
Proof
Consider an extreme condition, where two prototypes are far from the other prototypes. We notice that
Since we have no constraints on location of prototypes, this condition can always be likely to hold, regardless of the value of . When all the rest prototypes satisfy this condition for both , we have and , which is same as in binary case. So becomes necessary to ensure the correctness of , and thus Lemma 4 is true.
Lemma 5
is sufficient for .
Proof
According to (Eq. 9), if we want to prove that proposition (4) holds, we have to show that every distinct pair satisfy . We remove a category , where and , from and forms such that . Suppose satisfies Eq. 9, we have
where  
When is not changed and prototypes are not moved,
and
Thus, is satisfied for any pair where , even if we directly adopt when . So we can conclude that is sufficient for . By Lemma 3, , we can conclude that is sufficient for .