Learning Large Euclidean Margin for Sketch-based Image Retrieval

by   Peng Lu, et al.
West Virginia University
FUDAN University

This paper addresses the problem of Sketch-Based Image Retrieval (SBIR), for which bridge the gap between the data representations of sketch images and photo images is considered as the key. Previous works mostly focus on learning a feature space to minimize intra-class distances for both sketches and photos. In contrast, we propose a novel loss function, named Euclidean Margin Softmax (EMS), that not only minimizes intra-class distances but also maximizes inter-class distances simultaneously. It enables us to learn a feature space with high discriminability, leading to highly accurate retrieval. In addition, this loss function is applied to a conditional network architecture, which could incorporate the prior knowledge of whether a sample is a sketch or a photo. We show that the conditional information can be conveniently incorporated to the recently proposed Squeeze and Excitation (SE) module, lead to a conditional SE (CSE) module. Extensive experiments are conducted on two widely used SBIR benchmark datasets. Our approach, although being very simple, achieved new state-of-the-art on both datasets, surpassing existing methods by a large margin.


page 1

page 2

page 3

page 4


ACNet: Approaching-and-Centralizing Network for Zero-Shot Sketch-Based Image Retrieval

The huge domain gap between sketches and photos and the highly abstract ...

Instance-level Sketch-based Retrieval by Deep Triplet Classification Siamese Network

Sketch has been employed as an effective communicative tool to express t...

Stacked Semantic-Guided Network for Zero-Shot Sketch-Based Image Retrieval

Zero-shot sketch-based image retrieval (ZS-SBIR) is a task of cross-doma...

Optimized Feature Space Learning for Generating Efficient Binary Codes for Image Retrieval

In this paper we propose an approach for learning low dimensional optimi...

ALMN: Deep Embedding Learning with Geometrical Virtual Point Generating

Deep embedding learning becomes more attractive for discriminative featu...

3D Shape Retrieval via Irrelevance Filtering and Similarity Ranking (IF/SR)

A novel solution for the content-based 3D shape retrieval problem using ...

Learning from Label Relationships in Human Affect

Human affect and mental state estimation in an automated manner, face a ...

1 Introduction

Touch-screen devices, such as smartphone and iPad, enable users to draw free-hand sketches conveniently. These sketches are highly iconic, succinct, and abstract representations, and usually convey richer and more accurate information than texts in some scenarios. Consequently, they spawned many novel applications. One of the most representative examples is the Sketch-Based Image Retrieval (SBIR), which has attracted significant attention from the computer vision community during the past decades

[14, 47, 34, 22, 51, 32, 22]. For the SBIR task, learning good representations for both sketches and photos is of vital importance and is considered as a challenging problem [47, 48, 34, 28, 29, 38].

We give an illustrative example of the SBIR task in Figure 1. Given a query sketch, the objective is to find all of its relevant photos that are semantically related, e.g, they come from the same category. This task appears to be easy for us humans but is very difficult for a machine. The main challenge comes from the fact that there is a significant gap between the data representation in the two domains: the sketches are represented by highly iconic, abstract and sparse lines, whilst the photos are composed of dense color pixels with rich texture information.

Figure 1: An illustration of the SBIR task. Photos with green/red border are relevant/irrelevant photos, respectively.

Recently, many works have been proposed to address this problem. A popular approach is constructing a good intermediate representation, i.e., converting photos to edge maps [47, 34, 22, 51] or translating sketches into the photo domain using generative models [51]. Another widely adopted approach is learning a semi-heterogeneous network in an end-to-end manner [32, 22]. These approaches have one thing in common – they all aim to find a feature space in which the gap between sketches and photos is minimized. Therefore, a loss function, e.g., semantic factorization loss [22] and semantic loss [51], that aims to minimize the domain discrepancy and intra-class distance is usually constructed.

In this work, we argue that only minimizing the domain discrepancy and intra-class distance is not sufficient for achieving accurate SBIR. Even the distance (in a certain feature space) between samples from the same class is small, it is still possible that there exist irrelevant samples near the query sample, leading to poor retrieval results. Motivated by this intuition, we propose a novel loss function, named Euclidean Margin Softmax (EMS), that minimizes the intra-class distance and maximizes the inter-class distance simultaneously. We show that the EMS loss is able to yield highly accurate retrieval results, which surpass all existing algorithms by a large margin.

Further, to accompany the proposed loss function, we introduce a conditional neural network architecture, that could incorporate our prior knowledge about which domain the sample comes from. Specifically, our base network is the ResNeXt model

[43] with Squeeze and Excitation (SE) module [10]

, since it not only has high representation power but also enables us to encode the conditional information conveniently. In each SE module, the convolutional features are firstly squeezed into a low dimensional embedding by an encoder network, then they are decoded to generate a channel-wise attention vector, which is applied to the original feature maps. Based on the SE module, we can simply append one binary code, which indicates which domain the sample comes from, to the low dimensional embedding. Through extensive experiments, we show that this change is simple yet highly effective.

Contributions. We highlight the main contributions of this work. (1) A novel loss function that simultaneously minimizes intra-class distance and maximizes inter-class distances is proposed. (2) We propose a novel conditional network architecture that could incorporate the additional information about the domain attribution, which helps to boost the retrieval performance. (3) New state-of-the-art results have been obtained on several competitive SBIR tasks. In addition, we show our model can be directly extended to address the challenging zero-shot SBIR task.

2 Related Work

SBIR. The task of Sketch-Based Image Retrieval (SBIR) aims at retrieving the images that are of similar semantic meaning as the query sketch. A typical solution is to learn a shared embedding space for both sketches and images. Such a common space facilitates the ranking of similarity of sketches and images. Previous methods [25, 4, 11, 28]

employed the hand-craft sketch features to represent the sketches. Recent deep learning based architectures enable the cross-domain learning in an end-to-end manner

[38, 47, 48, 34]. To accelerate the retrieval in a large-scale dataset, hashing based models [22, 32, 51, 46] have also been studied.

Feature embedding. Representation learning is studied in computer vision community [8, 43, 12, 10, 35, 17]; but the sketch-based representation learning is relatively less studies [48, 34]. Among all the different deep architectures have been investigated and studied, such as ResNet, ResNeXt, and DenseNet. Among them, ResNeXt [43] incorporated both the residual learning and group convolution; and thus it is adopted as the building structure in this paper. The recent SE module [10] made networks capable of choosing relatively important channels with feature maps, and thus it is also used here. Remarkably, to effectively embed cross domain features, several hashing methods [5, 2, 18, 49, 20, 13] and cross-modal embedding methods [1, 37, 21, 44] have been studied.

Loss functions. Many metric learning based methods [9, 26, 42, 40]

proposed learning deep features by the loss functions of Euclidean distance. In order to make the learned feature more discriminative, other variants of loss functions have been investigated recently, such as contrastive loss

[3, 7] and triplet loss [31], L-softmax [24], A-softmax [23] and LMCL [39]

losses. Particularly, the contrastive and triplet losses aim at increasing the Euclidean distance margin, while L-softmax, A-softmax, and LMCL losses are designed to expand the angular margin. Remarkably, the latter three losses make simple modification on softmax loss and greatly improve the performance on face recognition tasks. Furthermore, prototypical loss

[33] is also a variant of softmax which incorporates the Euclidean distance.

3 Approach

In this section, we first give an overview of the proposed approach, and then introduce the conditional network architecture, the Euclidean Margin Loss and a hashing method for efficient retrieval.

3.1 Overview

Problem Setup. We formulate the sketch-based image retrieval (SBIR) task as [22, 51]. We have the set of realistic photos, and sketches as , and respectively. In the supervised setting, the sketch set is split into train and test sets, with the same label set . Therefore, given a query sketch in the test sketch set, the goal of our SBIR task is to retrieve the best matched , such that . Note that the same photo set is used as both training and sets and retrieval galleries as the setting defined in [22].

In the zero-shot SBIR task, we divide into and individually; and we have source domain , , and target domain , ; and . The zero-shot SBIR model is trained on source data , and tested on target domain .

To facilitate the sketch retrieval tasks, we project the photos and sketches into a single shared metric space. We introduce a novel architecture – CSE-ResNeXt-101 as illustrated in Figure 2. Our architecture optimizes that the instances of sketches and photos in the same/different class should be close/far to each other. By virtue of such an optimization process, the learned metric space will have large inter-class and small intra-class distance over the photo and sketch set. Given one sketch or a photo , our CSE-ResNeXt-101 network can extract its feature vector as the output of last fully connected layer.

3.2 Network Architecture

Figure 2: Overview of our CSE-ResNeXt-101 structure. Our network embeds both sketches and photos into a unified metric space. Please refer to the supplementary for the network details.

The key challenge of learning the space for SBIR is how to efficiently learn to preserve the semantic consistency over the sketch and photo domains. Particularly, in the sketch domain, we have highly iconic and abstract sketches with various levels of deformation; and the photos are natural images with the color, texture and shape information. To address this challenge and facilitate the SBIR, we propose a novel network architecture – Conditional SE ResNeXt-101 (CSE-ResNeXt-101) as shown in Figure 2. It is composed of forty CSE ResNeXt Blocks. Each block integrates the Conditional SE (CSE) module into the ResNeXt block. The CSE module is illustrated in Figure 3.

Rather than using independent sub-networks to separately process the photos and sketches, CSE-ResNeXt-101 directly learns a single sub-network to jointly analyze the photos and sketches. Such a SiameseNet-like network is inspired by the fact that SiameseNet is efficient in learning the embedding space across different domains (e.g., image-text embedding [41], or person re-identification [36]).

Another novelty of our CSE-ResNeXt-101 comes from the newly proposed Conditional SE (CSE) module [10]. In particular, we observe that it has been shown that the network with SE module has achieved remarkable performance on the ILSVRC 2017 classification tasks. The SE module also provides an explicit mechanism to re-weight the importance of channels after each block in the network. Due to the nature of our cross-domain tasks, we can generalize SE to CSE module with a very simple change. As shown in Figure 3

, our CSE module utilizes an auto-encoder sub-network followed by a sigmoid activation; Within the space learned by auto-encoder, a binary code is added to indicate whether the input image is a sketch or a photo. The outputs of sigmoid activation are passed as the feature tensor attention vector over the feature channels. This conditional auto-encoder structure can thus help to capture different characteristics of input images conditioned on which domain they come from.

Figure 3: The structure of conditional SE module.

3.3 Euclidean Margin Softmax

Softmax loss Revisit. Softmax loss is widely used in classification tasks. Given a feature vector with its ground-truth class , this loss can be formed as


where (i.e., ) denotes the activation of -th category; and totally, we have category and training instances. and are the class weight parameters of the -th category, which are optimized by Eq (1). The class label of testing instance is computed as . In binary classification, if , the instance will be assigned the class label , and vice versa.

EMS loss. The SBIR task has the sketches and photos, which are from two different domains. The loss function should be optimized to learn a shared space, which has very large inter-class distance, and small intra-class distance over the photo and sketch set. To this end, we propose a novel loss function – Euclidean Margin Softmax (EMS) loss. The EMS is a generalization of softmax loss in Eq (1). It is defined by,


We explain the parameters in Eq (2); and

indicates the feature extracted by the last layer of CSE-ResNeXt-101 network. (1)

is the center of -th category. We take the center as the parameters, and update dynamically, rather than directly use the average feature center. (2) In Eq (2), we employ the Euclidean distance to measure the confidence of being . (3) is the margin constant, which helps take account the different data distributions of each class. Particularly, in the binary classification case, we can category as class if , and, otherwise, as class 2.

EMS V.S. Prototypical loss. We discuss the difference between EMS loss and prototypical loss [33] which is,


The prototypical loss is used in one-shot classification where only few training instances are available for each class. Thus is directly computed as the averaged mean of training instances; in contrast, our EMS loss is a generalized softmax loss, which optimizes from training data. Furthermore, the prototypical loss is optimizing the image instances only, while our EMS aims at optimizing the training instances of different classes across different domains. Therefore a margin constant is introduced to make a balance between enlarging the inter-class distance and shrinking the intra-class distance.

EMS Vs. Angular Margin loss. We further discuss the difference between EMS loss and angular margin loss. Besides Euclidean distance, angular distance based loss functions, such as, A-Softmax [23] and LMCL [39], are also employed in learning a shared space in many tasks, e.g., face recognition. These angular margin losses aim at learning a discriminative distribution on a hypersphere. As shown in Table 1, they define different similarity functions for the instances of different classes. Note that is an artificial piece-wise function that serves as the extension of , in order to overcome its non-monotonicity. Nevertheless, it is non-trivial to define the in A-softmax function as stated in [39]. The scalar

in LMCL is used to expand the range of similarity function; otherwise, the output of softmax function would be closed to the uniform distribution over all categories.

Table 1: The similarity functions of A-Softmax and LMCL.

Theoretical Analysis of EMS. The property of EMS is determined by the value of . Intuitively, the larger makes decision boundaries, closer to corresponding prototypes and the distribution of features, more compact. In this case, the metric space can be well discriminative. However, the large

introduces instability into training, due to the intrinsic variances among samples in each category. Thus it is necessary to find the minimum

to ensure that, for every sample, and in metric space, the maximum intra-class distance is smaller than minimum inter-class distance. We prove that is required in all cases. In binary category case, is the minimum value of . With the growth of the number of categories, the minimum value is reduced monotonously. So is sufficient to guarantee perfect discrimination. We illustrate its necessity in multi-class cases, since, if two prototypes are far from the others, their relationship will resemble the one in binary category case. The details of proof can be found in Appendix.

Zero-shot retrieval tasks. In addition to the classical categorical SBIR tasks, our framework can be easily extended to zero-shot SBIR task. Specifically, in zero-shot SBIR task, we have the source ( and ) and target ( and ) data. Our model is trained on the source data and directly tested on the target data.

3.4 Dimension Reduction Hashing

To accelerate the retrieval speed, we propose a simple hashing scheme to encode the features generated by CSE-ResNeXt-101 network from one sketch or a photo . Our hashing scheme projects into a low-dimensional binary space. This is a post-process step after training CNNs. The hashing scheme is implemented as an auto-encoder, whose encoder conducts the dimension-reduction mapping, and the decoder conducts the inverse mapping. The objective function is composed of,


where is the reconstruction loss, which maintains the structure among the centers in a low-dimension space; and the scatter loss prevents the prototype being closed to each other in the low-dimension space. is the number of categories. Our hashing scheme can encode all photos and query sketch into a low-dimension binary space by a function after the encoder. The hamming distance can thus be used for the retrieval tasks.

Methods TU-Berlin Extension Sketchy Extension
HOG [4] 0.091 0.115
GF-HOG [11] 0.119 0.157
SHELO [28] 0.123 0.182
LKS [29] 0.157 0.190
Siamese CNN [27] 0.322 0.481
SaN [48] 0.154 0.208
GN Triplet [30] 0.187 0.529
3D Shape [38] 0.072 0.084
Siamese-AlexNet [22] 0.367 0.518
Triplet-AlexNet [22] 0.448 0.573
CSE-ResNeXt-101 0.820 0.958
Table 2: MAP results of CSBIR on TU-Berlin Extension and Sketchy Extension datasets.

4 Experiments

In this section, we evaluate our method on two tasks, including Category-level SBIR (CSBIR), and Zero-Shot Category SBIR (ZS-CSBIR) .

4.1 Datasets and Settings

Category-level SBIR. Our model is evaluated on two large-scale sketch-photo datasets: TU-Berlin [6] Extension and Sketchy [30] Extension. The former includes 20,000 sketches uniformly distributed among 250 categories. Additionally, 204,489 natural images provided in [50] are utilized as the photo gallery. The Sketchy database consists of 75,471 hand-drawn sketches and 12,500 corresponding photos from 125 categories. It was extended by another 60,502 photos for CSBIR task in [22]. Following the settings in [22, 51], 10/50 sketches from each category are picked as the query set for TU-Berlin/Sketchy dataset, and the rest are used for training. All gallery photos are used in both training and testing phase.

Zero-Shot Category-level SBIR. We also compare the performance of our model with previous methods on Zero-Shot Category-level SBIR task, where we still follow the setting of category-level retrieval but split the category into source and target domain as stated in Sec. 3.1: we randomly select 30/25 categories as target domain for TU-Berlin / Sketchy Extension datasets respectively. Same as [32], we only select categories that contain more than 400 images to form the target domain. We train our network on source domain with the same process as standard Category-level SBIR and direct test it on target domain.


Our method is implemented using Pytorch with single TitanX GPU. We use Adam optimizer

[15] with parameters . The learning rate is set to and linearly decayed to during training. We construct the batches with the size

, and train networks with 15 epochs. The models and codes will be released upon the acceptance.

4.2 Results on Category-level SBIR

Methods TU-Berlin Extension Sketchy Extension
32 bits 64 bits 128 bits 32 bits 64 bits 128 bits
Cross-Modality Hashing Methods (binary codes) CMFH [5] 0.149 0.202 0.180 0.320 0.490 0.190
CMSSH [2] 0.121 0.183 0.175 0.206 0.211 0.211
SCM-Seq [49] 0.211 0.276 0.332 0.306 0.417 0.671
SCN-Orth [49] 0.217 0.301 0.263 0.346 0.536 0.616
CVH [18] 0.214 0.294 0.318 0.325 0.525 0.624
SePH [20] 0.198 0.270 0.282 0.534 0.607 0.640
DCMH [13] 0.274 0.382 0.425 0.560 0.622 0.656
DSH [22] 0.358 0.521 0.570 0.653 0.711 0.783
GDH [51] 0.563 0.690 0.651 0.724 0.811 0.784
Cross-View Feature Learning Methods (real-value vectors) CCA [37] 0.276 0.366 0.365 0.361 0.555 0.705
XQDA [19] 0.191 0.197 0.201 0.460 0.557 0.550
PLSR [21] 0.141 (4096-d) 0.462 (4096-d)
CVFL [44] 0.289 (4096-d) 0.675 (4096-d)
Ours CSE-ResNeXt-101 0.791 0.817 0.819 0.949 0.952 0.958
Table 3: MAP results of Hashing CSBIR. Our model is compared against the previous SBIR methods on TU-Berlin Extension and Sketchy Extension. 32, 64, and 128 represents the length of generated hashing codes.

Competitors. There are several categories of competitors as listed in Table 2: (1) hand-craft feature based models: LSK [29], SEHLO [28], HOG [4] and GF-HOG [11]; (2) deep learning based models: 3D Shape [38], Sketch-a-Net (SaN) [48], GN Triplet [30] , Siamese CNN [27], Siamese-AlexNet and Triplet-AlexNet [22]. The Mean Average Precision (MAP) is reported.

Results. The results are summarized in Table 2. Remarkably, our model outperforms all the competitors by a very large margin. It achieves a MAP improvement of 0.372/0.385 over the state-of-the-art real-valued based method – Triplet-AlexNet. This demonstrates the efficacy of our model. Note that the improved performance is due to the novel structure, and the EMS loss function used here. We give further analysis in the ablation study.

4.3 Hashing Results on Category-level SBIR

Methods Dimension TU-Berlin Extension
MAP Precision @100
Siamese CNN [27] 64 0.122 0.153
SaN [48] 512 0.096 0.112
GN Triplet [30] 1024 0.189 0.241
3D Shape [38] 64 0.057 0.063
DSH [22] 64 0.122 0.198
SSE [52] 100 0.096 0.133
JLSE [53] 100 0.107 0.165
SAE [16] 300 0.161 0.210
ZSH [45] 64 0.139 0.174
ZSIH [32] 64 0.220 0.291
Ours 512 0.259 0.369
64 0.165 0.252
Table 4: ZS-CSBIR results on TU-Berlin Extension and Sketchy Extension. : denotes the length of hashing codes; the rest are the real value features.

Competitors. (1) Our hashing model is compared against 8 cross-modal hashing methods: Collective Matrix Factorization Hashing (CMFH) [5], Cross-Modal Semi-Supervised Hashing (CMSSH) [2], Cross-View Hashing(CVH) [18], Semantic Correlation Maximization (SCMSeq and SCM-Orth) [49], Semantics-Preserving Hashing(SePH) [20], Deep CrossModality Hashing (DCMH) [13], Deep Sketch Hash (DSH) [22] and Generative Domain-Migration Hashing (GDH) [51]. (2) We also compare 4 cross-view feature embedding methods: CCA [37], PLSR [21], XQDA [19] and CVFL [44]. We still report the MAP.

Training cost. Our hashing scheme is taken as a post-processing step, in order to make our framework comparable to previous hashing based SBIR models. With the computed features by CSE-ResNeXt-101, our hashing scheme is trained for 10000 steps; and the whole process can be finished within 1 minute on our computer.

Results. We summarize our results in Table 3. Our method achieves the best performance among all hashing-based methods and cross-modal learning methods. Critically, our model improves MAP with a scale over 0.12 in all conditions compared with GDH [51] which is the state-of-the-art method on this task. This further demonstrates the effectiveness of our proposed framework in the SBIR tasks.

Our model can be trained efficiently. We only need to train a single network – CSE-ResNeXt-101 in an end-to-end manner; and edge maps are not utilized here. In contrast, the training process of the other competitors is a bit complex. For example, GDH [51] utilized the cycle-consistent GANs to transfer sketches into photos. DSH [22] requires the pre-computed edge maps to bridge the gap between sketches and photos, and semantic representation (wordvec) is used as prior of inter-relationship among categories. These methods are trained by two steps of optimization in each iteration: one for binary code and the other for network parameters.

4.4 Results of Zero-Shot CSBIR

Competitors. We compare our method on ZS-CSBIR task with 5 SBIR methods: 3D Shape [38], Sketch-a-Net (SaN) [48], GN Triplet [30] , Siamese CNN [27] and DSH [22], 5 zero-shot methods: SSE [52], JLSE [53], SAE [16], ZSH [45] and ZSIH [32].

Results. We report the results in Table 4

. Note that even directly utilizing our model to ZS-CSBIR task, our model can still beat all the other ZS-CSBIR competitors. This result validates the ability of proposed CSE-ResNeXt-101 to generalize to unseen categories. We also note that our hashing result has a larger reduction compared with the non-hashing result under zero-shot setting. It makes sense since our hashing method is not optimized for ZS-CSBIR. Particularly, our hashing scheme only learns to discriminatively binarize prototypes of classes in source domain, while prototypes of unseen classes might be indiscernible in form of binary codes.

4.5 Ablation Study

We conduct extensive ablation study to evaluate each component of our model.

Figure 4: Illustration of bifurcated ResNeXt-101. This network is an example of semi-heterogeneous network aggregating at stage 4.

Network structure. In this part, we compare the performance using different network structures for the embedding. The sketches and photos can contain similar semantic information, but will be very different in appearance: sketches use the strokes to roughly express the contour of main objects while photos use colorful patches to express objects and backgrounds. Deep Networks are thought to learn low-level feature with early layers and high-level feature with later layers. By intuition, we can use semi-heterogeneous network as in [22, 32] to process two domains of images using two branches of CNNs which aggregate at the later stage. It is a natural question of how many layers should be used to embed the low-level feature of sketches and photos separately. We conduct the ablation study to validate this point.

As is shown in Figure 4, we investigate 6 possible merging positions that can be used to aggregate the sketch/photo branches, which separately process the sketches and photos, i.e., without using the SiameseNet-style. The results are reported in Table 5. We consider using both softmax, and EMS to optimize different variants of our network. Judging from those results, we draw the conclusion that, the variants of the earlier aggregation have better performance, possibly due to more parameters are shared. These results also reveal the fact that deep CNNs are capable of dealing with two domains of images simultaneously. This validates that the proposed network is reasonable in addressing the SBIR task.

Merging position loss
softmax EMS ()
5 0.670 0.678
4 0.705 0.745
3 0.727 0.768
2 0.730 0.764
1 0.738 0.775
0 0.737 0.780
Table 5: Performance of semi-heterogeneous networks aggregating at different stages on TU-Berlin Extension.

Network modules. We compare various types of CNNs as well as their variants with Squeeze-and-Excitation (SE) modules [10]

and the newly proposed conditional SE (CSE) module. Intrinsically, the SE and CSE module can enhance the ability to learn different attention over feature channels, and thus enable a dynamic and implicit feature selection mechanism to our networks. The MAP results are shown in Table

6. All networks are trained and tested in the same setting. We can find that (1) deeper networks perform better, due to their larger capacity in learning the cross-modal images; (2) SE module enhances the ability of CNNs to process inputs from multi-domains. Moreover, our CSE module is better than SE module on SBIR task, since the auxiliary binary code is introduced to make SE better learn to select important sketch/photo feature channels.

TU-Berlin Extension Sketchy Extension
AlexNet 0.528 0.863
VGG-16 0.676 0.930
DenseNet-121 0.768 0.942
ResNet-101 0.772 0.945
SE-ResNet-101 0.790 0.947
CSE-ResNet-101 0.801 0.949
ResNeXt-101 0.780 0.949
SE-ResNeXt-101 0.807 0.954
CSE-ResNeXt-101 0.820 0.958
Table 6:

Performance of different network variants; all of which are pre-trained on ImageNet. We use EMS loss with

to train these networks.

Analysis of components in Hashing. We train an auto-encoder for dimension-reduction embedding for hashing; and our objective function consists of two parts: reconstruction loss and scatter loss. We also try to combine quantization loss with two loss to ensure intra-class compactness in low-dimension space. The results reported in Table 7 show that the quantization loss does not improve the performance, given that its optimization process is quite time consuming.

TU-Berlin Extension Sketchy Extension
32 64 128 32 64 128
s 0.799 0.810 0.823 0.947 0.953 0.958
r+q 0.097 0.014 0.008 0.054 0.079 0.024
r+s 0.791 0.817 0.819 0.949 0.952 0.958
q+s 0.795 0.805 0.824 0.945 0.952 0.957
r+q+s 0.791 0.814 0.820 0.949 0.954 0.957
Table 7:

MAP of binary code encoded by Encoders trained with objective functions that have different composition. r ,q and s represent reconstruction loss, quantilization loss, scatter loss respectively. We use CSE-ResNeXt-101 structure.

Different loss functions. Besides model architecture, the objective function also serves as a crucial component in learning discriminative features. We compare our EMS loss with (i) two angular margin losses: A-Softmax [23] and LMCL [39], and original softmax loss; (ii) squared Euclidean Margin Softmax loss (squared EMS) which use squared Euclidean distance instead of Euclidean distance in EMS, and (iii) prototypical loss [33]. We also compare the performance of our model using different margin . We use the same CSE-ResNeXt-101. The MAP on two datasets are reported in Table 8. The results again show the advance of our EMS loss over the other types of losses.

Figure 5: Distribution of minimum inter-class distance. We use to represent angular distance.
Loss TU-Berlin Extension Sketchy Extension
prototypical loss [33] 0.403 0.858
prototypical loss [33] 0.405 0.854
squared EMS () 0.799 0.957
Softmax 0.747 0.932
A-Softmax [23] () 0.776 0.944
LMCL [39] () 0.828 0.954
EMS () 0.408 0.869
EMS () 0.812 0.950
EMS () 0.820 0.958
EMS () 0.828 0.956
EMS () 0.823 0.958
EMS () 0.810 0.959
EMS () 0.798 0.955
Table 8: Performance of models with different losses. means that we take the class centers as the parameters, and update by back-propagation.

Euclidean margin v.s. Angular margin. Our EMS loss achieves better performance than both A-Softmax and LMCL in Table 8. Specifically, LMCL achieves comparable performance with EMS, which shows that both angular distance and Euclidean distance can learn discriminative deep embeddings. But the EMS has only one hyper-parameter that can be easily cross-validated (we suggest in most cases), while in LMCL one need to decide the scale and margin that is highly depended on the number of categories in dataset. The performance of A-Softmax is a little bit worse than that of LMCL and EMS since the prototypes of difficult categories will have small angular distances if optimized by the A-Softmax loss, as explained in [39].

Figure 5 visualizes the distribution of minimum inter-class distance for instances of each category. There is a noticeable peak at a small value in histograms of corresponded with A-Softmax loss on both datasets. This peak indicates the small inter-class distance among some classes, and some photos in one class would be wrongly retrieved by sketches in another class. In contrast, the phenomenon does not exist in LMCL and EMS loss, as in Figure 5.

EMS V.S. prototypical loss. The results in Table 8 shows that the EMS loss with margin, outperforms prototypical loss to a large extent. when , the networks by prototypical loss [33] and EMS loss achieve almost the same results due to their similar formulation. In addition, we notice the two ways of updating the class centers in prototypical loss hit almost similar results.

The effects of different margin . Performance of our model grows with the increase of value , because it forces the network to learn discriminative representations. But when is too large, the performance stops rising and even begins falling. As revealed in Figure 6: (1) when , the intra-class distances of part of instances are greater than the minimum inter-class distance, which leads to bad retrieval performance; (2) when and

, the minimum inter-class distances are greater than intra-class distances generally, which explains the better performance in these cases. But the standard deviation within some categories do not decrease with larger

due to intrinsic variance, which explains why the performance stops rising or begins falling when is large.

(a) (b) (d)
Figure 6: Visualization of inter-class and intra-class distance on Sketchy Extension dataset. The blue lines denote average minimum inter-class distance. The orange bars represent the distribution of intra-class distance of each category.

5 Conclusion

In this paper, we have introduced two innovations, i.e., a novel loss function and a conditional network architecture. The proposed Euclidean Margin Loss (EMS) enables us to learn highly discriminative features, which facilitates highly accurate SBIR, while the Conditional Squeeze and Excitation (CSE) block allows us to incorporate the domain information of each sample explicitly. Both the loss and architecture are intuitive and simple to implement. On two popular benchmark SBIR datasets, the proposed model has achieved new state-of-the-art results.


  • [1] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In

    International Conference on Machine Learning

    , pages 1247–1255, 2013.
  • [2] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In

    Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on

    , pages 3594–3601. IEEE, 2010.
  • [3] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.
  • [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
  • [5] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2075–2082, 2014.
  • [6] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Trans. Graph., 31(4):44–1, 2012.
  • [7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In null, pages 1735–1742. IEEE, 2006.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [9] J. Hu, J. Lu, and Y.-P. Tan. Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1875–1882, 2014.
  • [10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
  • [11] R. Hu, M. Barnard, and J. Collomosse. Gradient field descriptor for sketch based retrieval and localization. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 1025–1028. IEEE, 2010.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
  • [13] Q.-Y. Jiang and W.-J. Li. Deep cross-modal hashing. CoRR, 2016.
  • [14] T. Kato, T. Kurita, N. Otsu, and K. Hirata. A sketch retrieval method for full color image database-query by visual example. In Pattern Recognition, 1992. Vol. I. Conference A: Computer Vision and Applications, Proceedings., 11th IAPR International Conference on, pages 530–533. IEEE, 1992.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [16] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345, 2017.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [18] S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In

    IJCAI proceedings-international joint conference on artificial intelligence

    , volume 22, page 1360, 2011.
  • [19] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2197–2206, 2015.
  • [20] Z. Lin, G. Ding, M. Hu, and J. Wang. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3864–3872, 2015.
  • [21] H. Liu, Z. Ma, J. Han, Z. Chen, and Z. Zheng. Regularized partial least squares for multi-label learning. International Journal of Machine Learning and Cybernetics, 9(2):335–346, 2018.
  • [22] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proc. CVPR, pages 2862–2871, 2017.
  • [23] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 1, 2017.
  • [24] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In ICML, pages 507–516, 2016.
  • [25] D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999.
  • [26] J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou. Multi-manifold deep metric learning for image set classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1137–1145, 2015.
  • [27] Y. Qi, Y.-Z. Song, H. Zhang, and J. Liu. Sketch-based image retrieval via siamese convolutional neural network. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 2460–2464. IEEE, 2016.
  • [28] J. M. Saavedra. Sketch based image retrieval using a soft computation of the histogram of edge local orientations (s-helo). In Image Processing (ICIP), 2014 IEEE International Conference on, pages 2998–3002. IEEE, 2014.
  • [29] J. M. Saavedra, J. M. Barrios, and S. Orand. Sketch based image retrieval using learned keyshapes (lks). In BMVC, volume 1, page 7, 2015.
  • [30] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4):119, 2016.
  • [31] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [32] Y. Shen, L. Liu, F. Shen, and L. Shao. Zero-shot sketch-image hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3598–3607, 2018.
  • [33] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [34] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In ICCV, pages 5552–5561, 2017.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [36] R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human re-identification. In ECCV, 2016.
  • [37] J. Vía, I. Santamaría, and J. Pérez. Canonical correlation analysis (cca) algorithms for multiple data sets: Application to blind simo equalization. In Signal Processing Conference, 2005 13th European, pages 1–4. IEEE, 2005.
  • [38] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1875–1883, 2015.
  • [39] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018.
  • [40] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1386–1393, 2014.
  • [41] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In CVPR, 2016.
  • [42] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
  • [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
  • [44] W. Xie, Y. Peng, and J. Xiao. Cross-view feature learning for scalable social image analysis. In AAAI, pages 201–207, 2014.
  • [45] Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen. Zero-shot hashing via transferring supervised knowledge. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1286–1295. ACM, 2016.
  • [46] S. K. Yelamarthi, S. K. Reddy, A. Mishra, and A. Mittal. A zero-shot framework for sketch based image retrieval. In European Conference on Computer Vision, pages 316–333. Springer, Cham, 2018.
  • [47] Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T. M. Hospedales, and C.-C. Loy. Sketch me that shoe. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 799–807, 2016.
  • [48] Q. Yu, Y. Yang, F. Liu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Sketch-a-net: A deep neural network that beats humans. International journal of computer vision, 122(3):411–425, 2017.
  • [49] D. Zhang and W.-J. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, volume 1, page 7, 2014.
  • [50] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao. Sketchnet: Sketch classification with web images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1105–1113, 2016.
  • [51] J. Zhang, F. Shen, L. Liu, F. Zhu, M. Yu, L. Shao, H. T. Shen, and L. Van Gool. Generative domain-migration hashing for sketch-to-image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 297–314, 2018.
  • [52] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pages 4166–4174, 2015.
  • [53] Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6034–6042, 2016.

6 Appendix: Theoretical Analysis of EMS Loss

In this section, we will (1) give a formal definition of maximum intra-class distance and minimum inter-class distance; (2) show that for margin in EMS loss, is sufficient and necessary to ensure that the maximum intra-class distance is smaller than minimum inter-class distance, regardless of the number of categories.

6.1 Definition

Since we treat both sketch and photo as instances, we define the merged dataset as:

where and are mappings that map photos/sketches into a feature space, and represent photo, sketch and category respectively. They are detailedly illustrated in Sec. 3.1. For convenience, we also define .

Maximum Intra-class Distance and Minimum Inter-class Distance

For category , the maximum intra-class distance can be defined by

and the minimum inter-class distance:

Here we give a formulation of our objective, which is the maximum intra-class distance being smaller than minimum inter-class distance, by proposition :


Solve Problem with EMS

Instead of optimizing the distance among instances directly as indicated by Eq. 5, the proposed EMS loss uses prototypes to characterize the distribution of instances in feature space. If this EMS loss is well optimized, instances will be closer to their corresponding prototype than other prototypes in feature space. This relationship can be described as


For convenience, we denote as a region where

Also, we denote as a region where

It is easy to prove that . Note that if Eq. 6 holds,

and thus we can derive a sufficient condition for (Eq. 5):


We denote Eq. 7 as proposition .

6.2 Finding Boundaries for

The later induction is based on the assumption that our EMS loss can be well optimized, i.e. Eq. 6 holds , and the assumption that . Now the question is: what is the range of that is sufficient and necessary for ? In the rest of this section, we firstly calculate the closed form of and then prove that if , then . To this end, we only need to find the lower bound of , with regard to the number of categories: . Next, we prove . Finally, we prove is sufficient and necessary for .

Lemma 1

If , is a n-ball (ball in n-dimensional space) with center and radius


If ,

Lemma 2

If , then


With slightly expanding to , region becomes , where

So we can conclude that . Thus


Now we rewrite (Eq. 7) as


Since Eq. 8,

which means if Eq. 9 holds for , it also holds for .

Lemma 3

is sufficient and necessary for .


We can write . Now , which are two n-balls with same radius and different centers. The maximum intra-class distance is the diameter of each n-ball:

The minimum inter-class distance is the distance between two centers minus the diameter:

Let , we can get the result or . We abandon the latter solution since only when does it make sense. So in binary class case, is sufficient and necessary for .

Lemma 4

is necessary for .


Consider an extreme condition, where two prototypes are far from the other prototypes. We notice that

Since we have no constraints on location of prototypes, this condition can always be likely to hold, regardless of the value of . When all the rest prototypes satisfy this condition for both , we have and , which is same as in binary case. So becomes necessary to ensure the correctness of , and thus Lemma 4 is true.

Lemma 5

is sufficient for .


According to (Eq. 9), if we want to prove that proposition (4) holds, we have to show that every distinct pair satisfy . We remove a category , where and , from and forms such that . Suppose satisfies Eq. 9, we have


When is not changed and prototypes are not moved,


Thus, is satisfied for any pair where , even if we directly adopt when . So we can conclude that is sufficient for . By Lemma 3, , we can conclude that is sufficient for .