Instance-level Sketch-based Retrieval by Deep Triplet Classification Siamese Network

11/28/2018 ∙ by Peng Lu, et al. ∙ Queen Mary University of London FUDAN University 0

Sketch has been employed as an effective communicative tool to express the abstract and intuitive meanings of object. Recognizing the free-hand sketch drawing is extremely useful in many real-world applications. While content-based sketch recognition has been studied for several decades, the instance-level Sketch-Based Image Retrieval (SBIR) tasks have attracted significant research attention recently. The existing datasets such as QMUL-Chair and QMUL-Shoe, focus on the retrieval tasks of chairs and shoes. However, there are several key limitations in previous instance-level SBIR works. The state-of-the-art works have to heavily rely on the pre-training process, quality of edge maps, multi-cropping testing strategy, and augmenting sketch images. To efficiently solve the instance-level SBIR, we propose a new Deep Triplet Classification Siamese Network (DeepTCNet) which employs DenseNet-169 as the basic feature extractor and is optimized by the triplet loss and classification loss. Critically, our proposed DeepTCNet can break the limitations existed in previous works. The extensive experiments on five benchmark sketch datasets validate the effectiveness of the proposed model. Additionally, to study the tasks of sketch-based hairstyle retrieval, this paper contributes a new instance-level photo-sketch dataset - Hairstyle Photo-Sketch dataset, which is composed of 3600 sketches and photos, and 2400 sketch-photo pairs.



There are no comments yet.


page 4

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of touch-screen devices such as iPad, and Surface, everyone can easily draw simple sketches. Different from the text, the sketch always contains richer and yet more abstract information. Sketch Based Image Retrieval (SBIR) has been studied for more than three decades (Kato et al., 1992). Most previous research efforts are made on category-level or fine-grained level SBIR tasks (Cao et al., 2010; Hu and Collomosse, 2013b; Eitz et al., 2010, 2011a; F. et al., 2015). Recently, pioneer works, including TripletSN (Yu et al., 2016) and DSSA (Song et al., 2017), have investigated the problem of instance-level SBIR. Such an SBIR task nevertheless is extremely useful in many real-world applications. For example, one can search for his/her particular interested photo by a roughly drawn sketch.

Recently, the instance-level SBIR has been extensively studied in TripletSN (Yu et al., 2016), and DSSA (Song et al., 2017), which jointly learned the branches of sketches and photos. Particularly, in both works, the authors firstly converted photos to edge maps using Edgebox (Zitnick and Dollár, 2014), which greatly diminishes the domain gap between sketch and raw photos. Then they extracted features from both sketches and edge maps with a sufficiently pre-trained Siamese network. Meanwhile, the triplet loss (Yu et al., 2016) and HOLEF loss (Song et al., 2017) were utilized to update parameters of the network and learn a fine-grained discriminative embedding on high-level feature space. After training, the authors compared the distance of query sketches and candidate edge maps in feature space to retrieve the most similar edge map for each sketch. However, note that in order to efficiently learn the feature representations of edge maps and sketches, their models (Yu et al., 2016; Song et al., 2017) have to be pre-trained on many datasets, such as TU-Berlin Eitz et al. (2012)

, edge maps of ImageNet-

. The quality of edge maps thus performs a key role in better training previous models (Yu et al., 2016; Song et al., 2017). Additionally, previous works have to rely on augmenting sketch images in order to gather enough training sketch images.

To break these limitations, we propose a new architecture – Deep Triplet Classification Siamese Network (DeepTCNet) in this paper. Our model utilizes the DenseNet-169 as the feature extractor in the SiameseNet style. The whole network is organized in an end-to-end manner. We introduce two types of loss functions,

namely, triplet loss (Schroff et al., 2015) and classification loss (Liu et al., 2017b), to optimize the network. Critically, our model uses the images as the input, rather than the edge maps used in previous works (Yu et al., 2016; Song et al., 2017)

. In order to address the matching problem between sketch and photo images, the triplet loss learns to make the sketch instances closer to the positive photo images, but far from the negative photo images. Furthermore, since the sketch and photo come from two different visual domains, we utilize three classification loss to bridge the gap between different domains. Particularly, we introduce an auxiliary task by classifying the paired sketch and photo images closer to each other in the embedding space learned by our DeepTCNet. More specifically, we employ three types of classification losses, i.e., softmax loss, spherical loss, and center loss.

On the other hand, most of the previous works are evaluated on two instance-level SBIR datasets – QMUL-Chair and QMUL-Shoe. These existing datasets focus on the sketch based retrieval tasks of chairs and shoes. In contrast, we are interested in the task of the hairstyle photo-sketch matching which has important business potential such as the hairstyle recommendation. Remarkably, hairstyle shapes the outline of face and conveys the characteristic and attitudes of one person. And the sketch, as an efficient tool, is introduced here to help understand the hairstyles. For instance, the customer can search for his/her interested hairstyle photo by a roughly drawn sketch in the Barbershop. In such a case, it is not meaningful for category-level SBIR hairstyle Yin et al. (2017)

, which classifies the hairstyles into dozens of classes. One can not ignore the huge variances within each hairstyle class. Such the variances of direction, density, outlines of hairs and how the hairstyle is aligned with faces, intrinsically make it very difficult in systematically and precisely modelling hairstyle.

To this end, we, for the first time, present a novel instance-level SBIR dataset Hairstyle Photo-Sketch dataset (HPSD). The HPSD is derived from the Hairstyledataset111This dataset has been released and utilized as competition dataset for zero-shot recognition in one AI grand challenge (Yin et al., 2017). In particular, 1200 photos are uniformly picked from 40 representative categories of Hairstyledataset. The photos are chosen with high diversity. Given one photo, two sketches with different complexity are drawn by moderately trained drawers. Thus totally we obtain 3600 images in this dataset. This HPSD can serve as the new testbed for instance-level SBIR tasks. The detailed steps of constructing the HPSD have been discussed in Sec. 3. Compared with existing QMUL-Chair and QMUL-Shoe datasets (Yu et al., 2016; Sangkloy et al., 2016), our dataset is more challenging on the scale of dataset, and the faces/hairstyles containing the significant textual information. We hope that this dataset will greatly benefit the research into sketch based hairstyle modelling.

Contributions. We make several contributions in this paper.

  1. We propose a novel instance-level SBIR dataset Hairstyle Photo-Sketch dataset (HPSD) which focuses on instance-level hairstyle SBIR task, and can thus be taken as the testbed of new hairstyle retrieval algorithms.

  2. We propose a novel Deep Triplet Classification Siamese Network to efficiently solve the instance-level SBIR tasks. Our DeepTCNet can beat the state-of-the-art algorithms by a large margin. Particularly, Our model has several novel points:

    1. We introduce the classification loss as the auxiliary task to efficiently learn our whole network;

    2. Our network can directly use RGB photo images as input, and thus save huge pre-trained cost necessary in previous works (Yu et al., 2016; Sangkloy et al., 2016);

    3. Our network does not rely on the quality of edge maps as previous works (Yu et al., 2016; Sangkloy et al., 2016).

2 Related Work

The task of Sketch-Based Image Retrieval (SBIR) aims at utilizing the sketch drawing of visual objects by amateurs to find the best matched image examples from an image collection. The sketch-based image retrieval has been studied for nearly three decades (Kato et al., 1992). Besides retrieving images from the sketch queries, the early approaches of SBIR, especially in computer graphics community, enables the synthesis of image approaches, such as Sketch2photo (Chen et al., 2009) and PhotoSketcher (Eitz et al., 2011b). These classical approaches have been reviewed in (SMEULDERS et al., 2000).

Features in SBIR.

With the proliferation of the touch and pen-based devices, there have been a growing number of studies in addressing the SBIR problem in Multimedia community. Before the development of deep neural network, these works mostly focused on the design of sophisticated features

(Saavedra et al., 2015; Cao et al., 2011, 2013). Such hand-crafted features are inspired by the early object descriptors, like BoW (Hu et al., 2011; Mathias et al., 2011) and HOG, Gradient Field HOG (Hu and Collomosse, 2013b). Furthermore, traditional ranking methods such as rank correlation (Eitz et al., 2011a), and rankSVM (Yu et al., 2016)

are also used on some SBIR methods. Recently, the Deep Convolution Neural Networks (CNNs)

(Krizhevsky et al., 2012) and its variants have been employed to best learn the sketch features (Yu et al., 2017; Seddati et al., 2015; Song et al., 2017). Hash-based methods (Liu et al., 2017a; Zhao et al., 2015; Zhu et al., 2016) have also been discussed as searching efficient features for SBIR.

Learning end-to-end architectures in SBIR. Siamese networks (F. et al., 2015; Yu et al., 2016; Song et al., 2017) have been utilized for SBIR in an end-to-end way. In Wang et al.(F. et al., 2015), a variant of Siamese networks have been used in both photo branch and sketch branch. Triplet Siamese Networks (Yu et al., 2016; Song et al., 2017) which leverage
three branches for sketches, positive images and negative images, have also been shown very efficient for SBIR tasks. However, to well train their models, Yu et al. (Yu et al., 2016) and Song et al. (Song et al., 2017) used the well-designed triplets contributed by human annotations. The model proposed in (Song et al., 2017) leverages higher order energy function into original triplet loss which catches the inter-dimension relationship among high-level features. The hard negative mining has been investigated and utilized to find difficult triplet examples and boost the performance of models (Hermans et al., 2017; Felzenszwalb et al., 2010; Schroff et al., 2015).

Cross Domain Matching. SBIR can be formulated as the category-level or instance-level retrieval tasks. Most previous works on SBIR are about category-level (Hu and Collomosse, 2013b; Hu et al., 2011; Mathias et al., 2011, 2010; F. et al., 2015) or fine-grained category-level SBIR (Li et al., 2016, 2014b; Sangkloy et al., 2016), which target at finding the image category corresponded to the sketch query. The instance-level SBIR was firstly introduced in Yu et al. (Yu et al., 2016). In general, the SBIR is learned to align two different domains, i.e., sketches and images. Thus it can be taken as one special case of cross domain matching. They key to solving Cross Domain Matching problem is to obtain the discriminative features of two different domains. The Siamese neural network and its variants have also been widely used in other cross domain matching tasks, such as person re-identification (Yi et al., 2014; Li et al., 2014a; Ahmed et al., 2015; Ding et al., 2015), face verification (Schroff et al., 2015) or image caption (Xu et al., 2015; Karpathy and Fei-Fei, 2015; Vinyals et al., 2015).

Figure 1: Examples of face photo-sketch pairs in out dataset. Two types of sketches, i.e., simple and complex sketches are drawn for each face photo. The complex sketches are used in our experiments.

Metric Learning. Many works in face verification and person re-identification can capture the fine-grained features by metric learning (Schroff et al., 2015; Liu et al., 2017b; Wen et al., 2016; Deng et al., 2017; Wang et al., 2018; Zhang et al., 2017; Yi et al., 2014; Paisitkriangkrai et al., 2015; Xiong et al., 2014; Hirzer et al., 2012a). Schroff et al. (Schroff et al., 2015)

proposed the triplet loss on face recognition task and achieved the best performance at that time. In

(Wen et al., 2016; Deng et al., 2017), they added a new center loss or marginal loss to measure the intra-class loss. Some researchers found that only use the Euclidean loss may not be good enough to learn best discriminative features. So they designed the softmax loss based on Angular Margin (Liu et al., 2017b; Wang et al., 2018). This method can learn discriminative features in manifold space which is better for fine-grained tasks. Range Loss (Zhang et al., 2017) was designed for solving the long-tail problem in face verification task. Recently, metrics learned by relaxed pairs are used for person re-identification (Hirzer et al., 2012b). Nevertheless, the performance of the model is penalized by the triplet loss which is computed by the triplet data. So there are some recent study about how to improve the triplet loss (Schroff et al., 2015; Cheng et al., 2016b; Hermans et al., 2017). An improved triplet loss based on hard negtive mining had been proposed in (Hermans et al., 2017). Yi et al. (Yi et al., 2014) designed a loss function based on similarity matrix between features. There was also a kernel-based metric learning method for person re-identification (Xiong et al., 2014). Besides, there was a method combining the metrics to learn a better model (Paisitkriangkrai et al., 2015).


Figure 2: Examples of hairstyles of different cultures used in our dataset.

Datasets of SBIR. Several previous benchmark datasets for SBIR are designed for category-level retrieval tasks, including TU-Berlin dataset (Eitz et al., 2012), and Flickr15k dataset (Hu and Collomosse, 2013b)

. Specifically, TU-Berlin dataset has 20000 sketches uniformly distributed over 250 categories, and it generally integrates other photo datasets that share the same categories, such as PASCAL VOC

(Everingham et al., 2010), for SBIR tasks. For example, the dataset proposed in (Li et al., 2014c) integrates 1120 sketches from TU-Berlin and 7267 images from PASCAL from 14 selected classes. Flickr15k dataset includes 14660 images of 33 categories, and 10 sketches drawn for each category given the images of each corresponding class. Recently, the fine-grained sketch datasets have also been introduced, such as QMUL-Shoe and QMUL-Chair datasets (Yu et al., 2016), and Sketchy dataset (Sangkloy et al., 2016). In particular, two earliest instance-level sketch-photo datasets proposed in (Yu et al., 2016), the QUML-Shoe and QUML-Chair, contains totally 419 sketch-photo pairs of shoes and 297 pairs of chairs with densely annotated 32000 triplet rankings where each sketch is painted by the amateurs if given the corresponding photo. However, the small size of data scale reduces the reliability of evaluation of SBIR models on these datasets. The Sketchy dataset (Sangkloy et al., 2016) is composed of photographic objects from 125 categories. It collects 75471 free-hand human sketches from of 12500 objects. Comparing with this dataset, our hairstyle sketch dataset is an instance-level SBIR. Very recently, Google released the largest doodling dataset by launching an online game222 to collect more sketch images.

3 Hairstyle Photo-Sketch dataset

Figure 3: Overview of the whole network structure.

Our Hairstyle Photo-Sketch dataset is designed to study the tasks of instance-level SBIR, as well as related tasks, such as hairstyle recommendation or editing. This dataset is a nontrivial extension of existing hairstyle30K dataset (Yin et al., 2017). There are totally 3600 sketches and photos, and 2400 sketch-photo pairs. Particularly, two types of sketches, namely, simple and complex sketches, have been collected for each photo. Fig. 1 shows some examples.

3.1 Uniqueness

The tasks of face photo-sketch matching have also been investigated in previous works with the corresponding datasets include CUFS (Wang and Tang, 2009), VIPSL (Gao et al., 2012) and IIIT-D (Bhatt et al., 2012). The sketches of these previous datasets are drawn by the artists; and the texture information and regions between bright area and shadow are precisely expressed in each image.

Comparing with these datasets, we highlight several differences. Firstly, we target at understanding the sketch images are contributed by the common persons who are not artists. For example, our contributors as well as the annotators of Sketchy dataset (Sangkloy et al., 2016) only use very simple strokes to outline the main object or its features in an image. Secondly and more interestingly, since we aim at the instance-level SBIR task of different hairstyle, the characteristics of human faces in images are not the deterministic factor of retrieving the most similar sketch image. This leads to new challenging to existing face-based deep architectures (Liu et al., 2017b; Sun et al., 2015; Vaquero et al., 2009; Wen et al., 2016). Finally, as a research task itself, the hairstyle can convey the person’s characteristics (Liu et al., 2014). However, there is very limited hairstyle diversity in the existing face photo-sketch dataset. This is no previous dataset like our Hairstyle Photo-Sketch dataset including the various hairstyle sketch modelling in a comprehensive manner. Additionally, our hairstyle sketch dataset can also potentially be very useful to other SBIR tasks, such as fine-grained hairstyle classification, stroke classification; and even the hairstyle synthesis and modification can also be benefited by this Hairstyle Photo-Sketch dataset.

3.2 Data collection

Hairstyle photos. The photos of hairstyle photo-sketch dataset are reused from the hairstyle30k dataset (Yin et al., 2017) which is a fine-grained dataset and designed for hairstyle related tasks such as hairstyle classification, hairstyle synthesis and editing. Among all the 64 types of hairstyles in hairstyle30k, we select 40 most common used fine-grained hairstyle images, e.g., Afro, Bald, Bob hair and so on. These selected categories are very representative and cover the most popular hairstyles in various countries such as the United Kingdom, China, Japanese and so on. The illustrative images are listed in Fig. 2. From each category of hairstyle, we select 30 most representative photos which have the hairstyles of different persons with very different viewpoints and poses. Totally 1200 photos are collected from these 40 fine-grained hairstyle classes.

Collecting sketches. We further collect the hairstyle sketches given the selected photos. To avoid the variances of sketches caused by different artists (Berger et al., 2013), we invite one drawer to complete all the drawing works of the sketches. Specifically, we showed the real hairstyle image to the drawer and asked the drawer to sketch the photo using Wacom. The drawer has moderate art training and thus can represent the professional hairstyle designer who is interested in sketching and changing the hairstyles. For each photo, we require both simple and complex sketch versions. The collected hairstyle sketches are shown in Fig. 1.

4 Methodology and Network Overview

In this section, we systematically develop and introduce each component of our framework as shown in Fig 3. Specifically, we give the problem setup and overview in Sec. 4.1. We discuss our key difference with existing works which used the edge map as the input to SBIR in Sec. 4.2. The details of our main framework structure ,as well as the loss functions, are discussed in Sec. 5.

4.1 Overview

We give the problem setup and overview of our framework. Given the sketch and a candidate collection of photos, , the goal of instance-level SBIR is to find the best matched photo from the corresponding candidate photo set for the query sketch . Naively, one can easily address this task by computing and ranking the similarity of the features of and each photo if the features computed are discriminative and representative enough. To that end, the SBIR algorithms have to efficiently solve the following two key questions,

  1. how to make framework effectively learn the features of sketch and photos to bridge the gap between the sketch and photo domains?

  2. how to learn fine-grained photo/sketch features to capture the subtle differences of each photo/sketch instance?

The Question (1) is the most common challenges existed for any SBIR, while the Question (2) is more essential in our task in order to facilitate the instance-level SBIR.

To address these challenges, this paper presents a Deep Triplet Classification Siamese Network (DeepTCNet) as illustrated in Fig 3. Our DeepTCNet has three components. The first component includes three branches of dense blocks to process and extract the feature maps of the initial inputs of sketches and photos. The second and third components are the Euclidean Triplet loss and one classification loss layer to integrate these three branches into an end-to-end framework.

Our DeepTCNet is shown in Fig 3. The input of the network is a triplet of sketch anchor, positive photo image and negative photo image. The whole network is trained in an end-to-end manner. Given a query sketch image, the DeepTCNet will output the best matched face photo. Each component will be detailed in the next subsections.

4.2 Input Images of DeepTCNet

What we use as the input images of DeepTCNet are the RBG photo images and the expanded sketch images. Specifically, the expanded refers to duplicate each sketch image into 3 channels as the input image to our model. We use the DenseNet-169 feature extractor pre-trained on the ImageNet- dataset. In our experiments, we show that our proposed architectures can efficiently learn to model these input images.

We highlight that such input images actually are different from those in (Yu et al., 2016; Song et al., 2017). Particularly, in (Yu et al., 2016; Song et al., 2017), the input images of their Siamese Network are the edge maps, rather than the RGB photos. Intuitively, it may be reasonable to firstly compute the edge maps of input images, in order to bridge the gap between photo and sketch domains. However, directly using the edge map still introduces additional issues: (1) Very difficulty in pre-training the reasonable feature extractors for the inputs of edge maps and sketches. In fact, to train a reasonable network, Yu et al. (Yu et al., 2016) and Song et al. (Song et al., 2017) had to pre-train their models on ImageNet-, the edge map set of ImageNet- images, TU-Berlin datasets and so on. In other words, they pre-trained their models on millions of images, and tens of thousands of sketch images to only recognize several hundreds of photo-sketch pairs (QMUL-Shoe and QMUL-Chair dataset). In contrast, our models directly utilize the RGB images as the input for the DenseNet-169 which is pre-trained only on ImageNet- dataset. (2) The conversion from RGB photos into edge maps will lead to losing some information (e.g., texture) which may be important for the neural network to extract the features. To sum up, the way of our input images is more efficient to help train the network and will not lose the information of RGB images.

5 Loss Functions of DeepTCNet

Formally, we define a triplet as which consists of a query sketch , a positive photo and a negative photo . As in Fig 3, we use four dense blocks (i.e., DenseNet-169 (Huang et al., 2017)) as the basic feature extractor for each input branch. The corresponding convolutional filters of dense blocks are sharing the weights in the SiameseNet style. We denote the feature for each branch, where indicates the parameter set of Siamese dense blocks.

In general, the loss functions are very important in efficiently training the deep network. In this paper, two types of loss functions, namely, triplet loss and classification loss, are explored for training our DeepTCNet. Both loss functions help to optimize the network in achieving the good performance of instance-level SBIR. In our model, these two types of loss functions are combined into,


where are the coordinating weights for two different loss terms; and empirically set as . and represent the triplet loss and classification loss individually. The classification loss can be softmax loss, center loss, and Spherical loss which would be discussed in Sec. 5.2. indicates the penalty term. Here we use the regularization term with the weight .

Intuitively, as a classical loss function for retrieval tasks (Gong et al., 2013), the triplet loss is optimizing the sketch instances closer to the positive photo images, but far from the negative photo images. On the other hand, though the sketch and photo images come from different modalities, our DeepTCNet are using the same CNN block to extract the features from both domains. Thus, the classification loss is introduced here as the auxiliary task which aims at bridging the gap of different domains. Particularly, the classification loss enables the features of the sketches and positive images from the same pair closer to each other.

5.1 Triplet Loss

The triplet loss is widely used in the retrieval tasks, such as face verification (Taigman et al., 2014), person re-identification (Cheng et al., 2016a; Hermans et al., 2017; Liu et al., 2016a, b) and so on. In principle, it aims at learning the discriminative features of images which are important for retrieval task. Particularly, the fine-grained / instance-level retrieval task in our scenario. This loss learns to optimize a correct order between each query sketch and positive/negative photo images in the embedding space.

In our task, the triplet loss is trained on a series of triplets where and represent the positive and negative photos corresponding to the query sketch . The triplet loss learns to optimize closer to than . Such a designed purpose enables the triplet loss to be applied to many areas, such as image retrieval (Allan and Verbeek, 2009; Huang et al., 2015), person re-identification (Cheng et al., 2016b), etc. Thus the triplet loss can reduce the the intra-class variations, and enlarge the inter-class variations. Specifically, the loss is defined as


where is the Euclidean distance function. The is the margin between query-positive and query-negative distance, and we set .

5.2 Classification Loss

The triplet loss can efficiently constrain the sketch image is closer to the positive photo than the other negative photos. However, the standard triplet loss in Eq (3) is not optimized for the purpose of bridging the gap of sketch and photo domains. Specifically, as shown in Fig. 3, the same CNN blocks are used to extract features from both sketch and photo images. The extracted features of paired sketches and photos should be closed to each other. Moreover, as the Question (2) in Sec. 4.1 and quite different from the classical SBIR tasks, the instance-level SBIR needs to learn the discriminative fine-grained features of different photos.

To this end, the classification loss is introduced in our DeepTCNet as an auxiliary task to help better learn the features from photos and sketches. In general, the classification loss is utilized to enforce the extracted features of the paired sketch and photo to be close to each other. Particularly, three following types of classification losses are integrated into our DeepTCNet model as follows.


where the weight parameters are . To help the network to learn better discriminative feature of data, our classification loss combines three types of losses: Softmax loss penalizes the learned features by Euclidean distance which however has been shown not so robust to fine-grained tasks as in (Liu et al., 2017b). (2) Spherical loss further makes constraints on learning the features by angular / spherical distance; (3) additionally, center loss is added to minimize the inter-class variations in optimizing the features. More detailed experimental results of comparing different classification losses are compared in Sec. 6.6.

Softmax Loss. We employ the standard softmax classification loss is in the form of


where is the -element of the prediction score .

Spherical Loss. This loss is optimizing the angular softmax as the auxiliary task to help our instance-level SBIR. Specifically, we use the output features of branches of sketches and positive photos , and denote . We take the matched pairs of sketches and positive photos as the same class; thus we annotate the label for .

We use a fully connected layer (with the weight matrix ) to implement the spherical loss function with the input of branches of sketches and positive photos as shown in Fig. 3. Thus we can rewrite



indicates the angle between vector

and . If we normalize , make all bias and introduce an angle margin for the loss, we thus have the Spherical loss function as follows,


where should be in the range of . The decision boundary is for binary-class case. is the margin constant. We set in our case. To remove the restriction on the range of and make the function optimizable, we can expend by generalizing it to a monotonically decreasing angle function . Therefore, the spherical loss should be


where , , .

Note that Spherical loss function is employed in (Liu et al., 2017b) to learn an embedding function for the face recognition. In contrast, the Spherical function is introduced here as optimizing the Triplet Network for the first time. Particularly, this work focuses on SBIR tasks rather than classification task, and the Spherical function serves as an efficient auxiliary task to help better learn the Siamese Net.

Center Loss. The center loss targets at minimizing the intra-class variations. It is formulated as,


where is the center of the

th class of deep features. In practice, it is difficult to compute the center of all training data in one class. So we have two modifications here in order to make this loss efficiently used in training CNN.

  1. Rather than use the centers of all training data, we use the center of each mini-batch;

  2. To avoid the large perturbations of wrong data, we add a hyperparameter

    to control the update of center. As the update equations below,


    where when and otherwise. In this way, we can use the center loss for training better discriminative features.

6 Experiments

6.1 Dataset

Our model is proposed for instance-level SBIR tasks, which enables the common persons to search for images with rough sketches. We list and compare the benchmark datasets here.

QUML-Shoe and QMUL-Chair dataset (Yu et al., 2016) contain 419 shoe and 297 chair photo-sketch pairs, respectively. Following the standard splits in (Yu et al., 2016), we use 304 and 200 pairs for training, and the rest pairs remain for the validation. In both two datasets, human triplet annotations are provided. Both the images of photos and sketches are resized to .

QUML-Shoe v2 dataset is a larger version of QUML-Shoe dataset which contains 2000 photos and 6,730 sketches, where each photo corresponds to about three sketches. We use 1800 photos and their corresponding sketches as training set and the rest for test. In this dataset, human annotated triplets are not provided; so our training triplets are automatically generated: for each anchor sketch, its source photo is used as positive instance and the negative instance is random sampled from all other photos.

Sketchy database is one of the largest sketch-photo database (Sangkloy et al., 2016)

. It includes 74,425 sketches and 12,500 photos. The instances are uniformly distributed in 125 categories. We randomly sample 90% instances for training and the rest for test. To generated triplets, we use source photo of each sketches as their positive instance. We sample for negative instance from photos inside same category with sketch by a pre-defined probability, as in TripletSN and DSSA

(Yu et al., 2016; Song et al., 2017), and from photos outside in other turns.

Hairstyle Photo-Sketch Dataset (HPSD) is the dataset proposed in this paper. This newly proposed dataset has 1200 photo-sketch pairs where the photos are evenly distributed over 40 classes. Each photo is corresponding to both simple and complex types of sketches. Unless otherwise specified, the simple sketches are employed for testing. Note that in this dataset, the photos are highly distinctive on facial features, poses of heads, or outlines of hairs, even within the same hairstyle category. In our HPSD, 1000 photo-sketch pairs are used for training and the rest 200 pairs for testing. Since we do not have the human annotations of triplet pairs as (Yu et al., 2016), we randomly generate the triplet pairs. Specifically, given a query sketch, we take its corresponding ground-truth photo as the positive instances, while randomly sample the negative instance set from 5 photos within the same hairstyle categories as the positive instance, and 45 photos from the other hairstyle classes. Thus totally 50 triplets are generated for each sketch query.

6.2 Settings

We implement our model in Pytorch. Our Triplet Siamese Network uses DenseNet-169

(Huang et al., 2017) as the feature extractor in each branch. The DenseNet-169 is pre-trained on ImageNet-1K image dataset. We replace the final classifier layer with a fully connected layer, and the output size equals to feature size. Our models, dataset, and codes will be released upon the acceptance.

In the testing, we just use the negative distance between the query feature and each photo feature as ranking score ; and then choose the photo

with highest ranking score as the retrieval result. The initial learning rate is 0.0002; the model is optimized by the Adam algorithm. The batch normalization is only used in DenseNet-169. We do not use dropout. All the input images are randomly cropped into

as the input of the network. On our Hairstyle Photo-Sketch dataset, the model gets converged by 10 epochs; totally it takes 3 hours by using NVIDIA 1080Ti GPU card.

6.3 Baselines

We compare several competitors in the experiments. All these baselines are proposed for the instance-level SBIR.

TripletSN. In (Yu et al., 2016), it is combined with the triplet loss for SBIR tasks. A Siamese network is used to extract features from the triplets which consists of a query sketch, a positive instance edge map and a negative instance edge map. With the triplet loss, the model learns the representations that can capture the distance between queries and instances and thus facilitate the image retrieval tasks.

DSSA. The model is proposed in (Song et al., 2017) where the network adopted the attention mechanism by a coarse-fine fusion block. Conventional triplet loss with first order energy function is replaced by it with higher-order energy function. These modifications greatly improve the performance on SBIR tasks. According to the ablation study in DSSA, the main improvement of DSSA is derived from CFF. And in our experiments, DSSA without HOLEF loss can also achieve competative accuracy compared with complete DSSA. And the implementation of HOLEF could enormously increase the spatial complexity in the test procedure. So we only use the CFF in experiments of DSSA.

6.4 Limitations in Previous Works

In this section, we summarize the limitations existed in the previous works – TripletSN (Yu et al., 2016), and DSSA(Song et al., 2017).

Dataset Pre-training Training QMUL-Shoe (%) QMUL-Chair (%)
Top-1 Top-10 Top-1 Top-10
TripletSN 33.91 78.26 51.55 86.60
52.17 91.30 78.35 97.94
37.39 76.52 45.36 95.88
20.67 67.83 46.39 86.60
DSSA 40.87 86.09 72.16 92.78
59.13 94.78 82.47 98.97
37.39 80.00 61.74 96.91
21.74 66.96 40.21 85.57
DeepTCNet 1.74 12.17 8.25 24.74
63.48 95.65 96.91 100.00
Table 1: Performance of models with/without pre-training and training. The pre-training refers to the heavy pre-training process used in (Song et al., 2017; Yu et al., 2016). The training means that using the training data of each dataset to train the corresponding model. Note that our DeepTCNet does not use the pre-training strategy as mentioned in (Yu et al., 2016).

6.4.1 Heavily Relying on Complex Pre-training Process

In the work series of TripletSN (Yu et al., 2016), and DSSA (Song et al., 2017), the good performance of instance-level SBIR heavily relies on pre-training process, including (1) pre-training on the edge maps of ImageNet-1K, (2) pre-training on TU-Berlin (Eitz et al., 2012), and (3) pre-training on a combination of ImageNet-1K and TU-Berlin dataset for a category-level retrieval task.

The sheer volume of data-scale as well as the computational cost in the pre-training process, makes the previous works (Yu et al., 2016; Song et al., 2017) too expensive and complex in pre-training. For example, in order to pre-train the edge maps of ImageNet-1K, they have to convert millions of ImageNet-1K images into edge maps. In contrast,the QMUL-Shoe and QMUL-Chair dataset totally have only several thousands training and testing images. Furthermore, we notice that practically, the pre-training process of previous works (Yu et al., 2016; Song et al., 2017) is already a complete pipeline for the category-level SBIR model; and even can hit very competitive performance on the instance-level SBIR tasks concerned shown in Tab. 1. Critically, on QMUL-Shoe dataset, the DSSA only pre-training (i.e., Pre-training , Training ) can beat the DSSA model with only training (i.e., Pre-training , Training ).

Table 1 also reveals the fact that the pre-training process is a quite important component in (Yu et al., 2016; Song et al., 2017). Without pre-training, the performance of DSSA and TripletSN models will be degraded significantly. In contrast, our model does not need such a heavy pre-training process, and can achieve comparable or even higher accuracy on both datasets.

6.4.2 Heavily Relying on the Quality of Edge Maps

We found that the quality of edge maps is very important to results of TripletSN and DSSA. This is reasonable, since the edge maps are employed as the bridge the gap between sketch and photo images in TripletSN and DSSA. In both methods, the edge maps of photos are actually extracted by EdgeBox (Eitz et al., 2012). Thus, we tried to compare different edge maps by various methods and compare the performance of TripletSN and DSSA. Specifically, we compare the edge maps generated by (1) Canny edge detector Canny (1986); (2) XDog (Winnemöller et al., 2012); (3) EdgeBox which is produced as (Yu et al., 2016; Song et al., 2017).

Some illustrative examples of edge maps are shown in Fig. 4. The performance of TripletSN and DSSA using four types of edge maps are compared in Tab. 2. We can find both methods are very sensitive to the quality of edge maps produced. In contrast, our DeepTCNet employs an end-to-end architecture which does not need to implicitly convert the photo images into edge maps.

Figure 4: Illustrative examples of edge maps extracted by different algorithms. : the results reported in (Yu et al., 2016). : our implementation by using the same setting as (Yu et al., 2016).
Edge Extractor Methods Canny (%) XDog (%) EdgeBox (%) EdgeBox (%)
QMUL-Shoe TripletSN 32.17/75.65 32.17 / 76.52 33.91 / 77.39 52.17 / 91.30
DSSA 43.48/88.70 42.61 / 86.96 44.35 / 82.61 59.13 / 94.78
QMUL-Chair TripletSN 81.44/100.00 65.98 / 95.88 78.35 / 98.97 78.35 / 97.94
DSSA 84.54/98.97 70.10 / 96.91 82.47 / 96.91 82.47 / 98.97
Table 2: Performance of TripletSN and DSSA using different types of edge maps. : the results reported in (Yu et al., 2016). : our implementation by using the same setting as (Yu et al., 2016).

6.4.3 Multi-Cropping Testing Strategy

Dataset Methods Vanilla (%) Multi-crop(%) Improvement (%)
QMUL-Shoe TripletSN 43.48 / 87.83 52.17 / 91.30 8.69 / 3.47
DSSA 55.65 / 93.04 59.13 / 94.78 3.48 / 1.74
DeepTCNet 62.61 / 96.52 63.48 / 95.65 0.87 / 0.87
QMUL-Chair TripletSN 69.07 / 97.94 78.35 / 97.94 9.28 / 0
DSSA 76.92 / 96.91 82.47 / 98.97 5.55 / 2.06
DeepTCNet 95.88 / 100.00 96.91 / 100.00 1.03 / 0
Table 3: The Top-1/Top-10 retrieval accuracies of each model are reported.

The multi-cropping testing strategy is used in both TripletSN (Yu et al., 2016), and DSSA (Song et al., 2017)333Note that this “multi-cropping” strategy is not explicitly explained in their papers; we read this strategy in the released codes.. Specifically, each testing photo/sketch pair is reproduced into multiple (i.e., 10 in (Yu et al., 2016; Song et al., 2017)) cropped testing pairs by cutting, horizontally flipping (Simonyan and Zisserman, 2015) both the photo and sketch. The features of each cropped photo / sketch are extracted to compute the distance of each cropped photo / sketch pair, (). The final distance of this testing photo / sketch pair is averaged over the features of all cropped images . The “multi-cropping” process is visualized in Fig. 5.

Figure 5: Visualization of multi-crop testing.

The multi-cropping strategy significantly increases the computational burden in the testing stage, especially on the large-scale sketch-photo dataset, e.g., Sketchy. In contrast, in Tab.3, we compare against the vanilla

testing strategy, that is, the features extracted from only one sketch / photo / edge map image. As shown in Tab.

3, we report the Top-1 / Top-10 accuracy on both QMUL-Shoe and QMUL-Chair datasets, which are employed as the benchmark datasets in TripletSN (Yu et al., 2016), and DSSA (Song et al., 2017). We found that the TripletSN and DSSA can be benefit from this testing strategy. The performances of TripletSN increase over 8% on both datasets when using the multi-crop strategy rather than vanilla testing strategy. Besides its huge improvement to the performance, multi-crop strategy significantly increases the computational burden, especially on the large-scale sketch-photo dataset, e.g., Sketchy. In contrast, multi-crop do not improve the performance of DeepTCNet, which shows the robustness of our model over different testing strategies. For example, There are improvement if TripletSN uses the multi-crop (rather than vanilla testing strategy) on QMUL-Shoe dataset. In contrast, our DeepTCNet is very robust when we use different testing strategy.

Method QMUL-Chair (%) QMUL-Shoe (%) QMUL-Shoe v2 (%) Sketchy(%) HPSD(s) (%) HPSD(c) (%)
HOG+BoW + rankSVM 28.87 17.39 0.29
Dense HOG+rankSVM 52.57 24.35 11.63
ISN Deep + rankSVM 45.36 20.87 7.21 12.00 12.00
ICSL (Xu et al., 2016) 36.40 34.78
Deep Shape Matching (Radenovi´c et al., 2018) 81.40 54.80
LDSA (Muhammad et al., 2018) 21.17
USPG (Li et al., 2018) 26.88
Sketchy (Sangkloy et al., 2016) 37.10
Triplet SN (Yu et al., 2016) 72.16 52.17 30.93 21.63 41.50 41.50
DSSA (Song et al., 2017) 81.44 61.74 33.63 45.00 45.50
DeepTCNet 96.91 63.48 40.02 40.81 64.00 68.50
Table 4: Results of instance-level SBIR on five benchmark datasets. The numbers represent the top-1 retrieval accuracy. : results reported in (Yu et al., 2016; Song et al., 2017).

6.5 Main Results of DeepTCNet

Additional Competitors. Besides the Triplet SN and DSSA methods, we additionally compare three naive baselines as (Song et al., 2017).

  1. HOG+BoW+RankSVM: Since HOG features are widely used in the classical SBIR (Berger et al., 2013; Hu and Collomosse, 2013a). We adopt this baseline by firstly extracting the HOG features on the images, and generating 500-d BoW descriptor. The rankSVM trained by triplet annotations are employed as the retrieval model.

  2. Dense HOG+RankSVM: The 200704-d dense HOG features are extracted from the images, and the triplet annotations train the rankSVM model for the SBIR task. The dense HOG features is expected to be more informative that HOG features.

  3. ISN Deep + rankSVM. It is an improved Sketch-a-Net (ISN) (Yu et al., 2015) that is used for the sketch recognition. Particularly, the edge map is first computed for the photos. The Sketch-a-Net which is pre-trained on TU-Berlin and Imagenet- edge maps, is utilized to extract features of both photos and sketches. The features of fc6 layer are produced as the representation as (Song et al., 2017). A RankSVM is subsequently trained using these features and triplet annotation to give the predicted ranking order of edge maps for a given query sketch.

Besides, we also compare our model with ICSL (Xu et al., 2016), Deep Shape Matching (Radenovi´c et al., 2018), LDSA (Muhammad et al., 2018), USPG (Li et al., 2018), Sketchy (Sangkloy et al., 2016).

Our results. We report the results of instance-level SBIR tasks on the benchmark datasets in Tab. 4. On all datasets, our network achieves the best performance. This validates the effectiveness of our models.

Our model outperforms the second best methods by over 10 percent in Hairstyle Photo-Sketch and QUML-chair datasets. On HPSD dataset we report the performance by using the sketches of simple , i.e., HPSD (s) and complex, i.e., HPSD (c). In HPSD (s) version, our model can be better learned and hit higher retrieval accuracy. This is reasonable, since the complex sketches have much more information to help train the network.

On both HPSD and Sketchy datasets, our model performs much better than edgemap-based methods such as TripletSN (Yu et al., 2016) and DSSA (Song et al., 2017) due to the fact that the photos in these two datasets contain rich background and texture information. Additionally, Sangkloy et al. (Sangkloy et al., 2016) also used raw photo and achieve relatively high accuracy on Sketchy dataset.

We also analyze the results on the other three datasets, i.e., QUML-Chair, QUML-Shoe and QUML-Shoe v2. A bit different from HPSD and Sketchy datasets, these three datasets are about some simple objects, e.g., shoes and chairs. So the extracted edge maps are clear enough to help train the network. Moreover, our model still outperforms the other baselines in these datasets. It further shows the good capability of our DeepTCNet in extracting good discriminative and representative features for instance-level SBIR tasks; and it demonstrates the efficacy of introducing classification loss in bridging the gap of different domains.

Quantitative results. Given one query sketch image, we show the top-10 best matched photos in Fig. 6. Interestingly, since our model is optimized by the exact sketch-photo pairs (instance-level SBIR task), there is no much constraints to optimize the other top- () results. As shown in Fig. 6, there may be very different photos appeared in the top-10 retrieval results.

Figure 6: Retrieval results given some sketches as anchors and we show top-10 retrieval results.

6.6 Ablation Study

Combination of different losses. We display the performance of various combination of different losses in Tab. 5. In each combination, we keep the same DeepTCNet architecture and change different loss functions in optimizing the network. This ablation study can help us better understand the roles of each loss used in our DeepTCNet. (1)A natural question would be asked to compare the performance of using triplet loss versus classification loss. This study can also be reflected in the Tab. 5. Intuitively, the triplet loss is of central important, since we are targeting at a retrieval task. Nevertheless, even only using the softmax loss only (i.e., directly taking the instance-level SBIR as the classification task), the results are also comparable to those of using Triplet loss only. (2) The combination of triplet and Spherical losses (i.e., Triplet +Spherical), actually achieve the best performance on three datasets: QMUL-Chair, Sketchy, and HPSD(s). However, this combination is unable on the datasets of QUML-Shoe and QUML-Shoes v2. This shows that the Spherical loss is actually more important to help bridge the gaps of different domains, than the softmax and center loss. The other two losses are also helpful when the Spherical loss performs a bit inferior on the other datasets.

Losses QUML-Shoe (%) QUML-Chair (%) QUML-Shoes v2 (%) Sketchy (%) HPSD(s) (%)
Triplet 26.96 81.44 29.43 0.38 49.00
Centre 21.74 61.86 6.61 0.11 18.50
Sphere 23.48 75.26 1.20 10.45 44.00
Softmax 26.09 82.47 23.12 17.28 36.00
Triplet+Centre 59.13 92.78 34.89 12.87 56.00
Triplet+Spherical 57.39 96.91 38.74 40.81 65.00
Triplet+Softmax 59.13 91.75 37.84 35.19 63.00
DeepTCNet 63.48 95.88 40.02 36.32 64.00
Table 5: Ablation study of combining different losses. The deep architecture of DeepTCNet is kept the same for all variants. We only use different combinations of loss functions.

Edge map vs. RGB Image. We also compare the model variants in Tab 6. We directly use the edge map generated by Yu et al. (Yu et al., 2016) to train our model. The results are much lower than our models of using RGB images. This is also reasonable, since our DenseNet-169 is pre-trained on ImageNet- images, but not the edge maps as (Yu et al., 2016; Song et al., 2017). Thus in order to fully extract the information of the edge map, it is necessary to fully pre-train the feature extractor networks on the TU-Berlin, ImageNet- images, and the edge map set of ImageNet- images as has done in (Yu et al., 2016). This is one important merit of our model, since we can skip this pre-training procedure as (Yu et al., 2016). This actually indirectly reflects the effectiveness of our proposed model.

Method QMUL-Chair (%) QMUL-Shoe (%) HPSD (%)
DSSA (Yu et al., 2016) 81.44 61.74 42.50
DeepTCNet (Edge Map) 57.73 31.30 35.50
DeepTCNet (RBG image) 85.57 56.52 54.50
Table 6: Results of different input images of our DeepTCNet.

Triplet Selection. Interestingly, we also want to study how triplet selection affects the performance. To reveal the insights of this problem, we further conduct the experiments on QMUL-Chair dataset, which has the triplet annotations contributed by human (Yu et al., 2016). Nevertheless, such human annotations are very expensive in practice. In contrast, a naive and straightforward way of triplet selection is just random selection. Specifically, given a query sketch, we can get its corresponding photo as the positive image, and randomly sampling from the others as the negative photos. By virtue of such a way, we can produce the triplet pairs by randomly generating 10, 20, 50 triplet pairs for each query sketch. The sampled triplet pairs are used to train the corresponding models. The results are summarized in Tab. 7. The whole experiments are repeated for 5 times; and averaged results are reported for R-10, R-20, and R-50. In Tab. 7, it shows that the human labelled triplet pairs can indeed benefit the performance of our model. However, how to manually choose the appropriate triplets for training is still a non-trivial, difficulty and time-consuming task for human annotators.

R-10 R-20 R-50 H-L
TripletSN 71.13 77.32 78.35 78.35
DSSA 78.35 84.54 79.38 82.47
DeepTCNet 87.63 89.69 86.60 96.91
Table 7: Results of different triplet sampling methods on QMUL-Chair. R-10, R-20, and R-50 indicate randomly generating 10, 20, and 50 triplet pairs for each query sketch. H-L represents that triplet pairs contributed by human annotation.

6.7 Visualization

In order to investigate the distribution of representations of sketches and photos in a high dimensional feature space, we use Multidimensional scaling (MDS) to project them to 2-D space, since the distance is well kept by this dimension reduction method. Using trained DenseNet-169, we extract the high-level representations of 200 test photo-sketch pairs. By using the MDS, we put the 2-D representation together with raw sketch and photo on Fig. 7. Clearly, there is no evident clusters in Fig. 7, and yet the matched sketch and photo are close to each other. This means our model successfully the learns instance-level feature of hairstyle photo-sketch pairs.

Figure 7: Visualization of testing pairs on HPSD. Blue dots represent queries and red represent instances. Each pair of blue and red star represents a photo-sketch pair. The corrected matched pair is given. The green star represents the result incorrect query of given sketch represented by the blue star.

7 Conclusion

In this paper, we propose a new Hairstyle Photo-Sketch dataset with two complexity level sketches which can be used for instance-level Sketch-Based Image Retrieval. Such a task is more challenging than category-level or fine-grained SBIR tasks. Thus it needs fine-grained feature maps to bridge the domain gap between sketches and photos. To this end, we introduce a new model called Deep Triplet Classification Siamese Network (DeepTCNet) which uses DenseNet-169 as the feature extractor, and we combine two loss functions to train the model, namely, triplet loss and classification loss. We also conduct extensive experimental evaluation on three instance-level SBIR datasets. We show that our proposed model can have better performance and avoid the huge pre-trained process which is necessary in previous methods (Yu et al., 2016; Song et al., 2017).

8 Acknowledgement

We thanks Miss Yiqing Ma for collecting the hairstyle photo and sketch images.


  • Ahmed et al. (2015)

    Ahmed E, Jones M, Marks TK (2015) An improved deep learning architecture for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3908–3916

  • Allan and Verbeek (2009) Allan M, Verbeek J (2009) Ranking user-annotated images for multiple query terms. In: British Machine Vision Conference
  • Berger et al. (2013) Berger I, Shamir A, Mahler M, Carter E, Hodgins J (2013) style and abstraction in portrait sketching. ACM Siggraph
  • Bhatt et al. (2012) Bhatt HS, Bharadwaj S, Singh R, Vatsa M (2012) Memetic approach for matching sketches with digital face images. Tech. rep.
  • Canny (1986) Canny J (1986) A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence (6):679–698
  • Cao et al. (2013) Cao X, Zhang H, Liu S, Guo X, Lin L (2013) Sym-fish: A symmetry-aware flip invariant sketch histogram shape descriptor. In: Proceedings of the IEEE International Conference on Computer Vision, pp 313–320
  • Cao et al. (2010) Cao Y, Wang H, Wang C, Li Z, Zhang L, Zhang L (2010) Mindfinder: interactive sketch-based image search on millions of images. In: Proceedings of the 18th ACM international conference on Multimedia, ACM, pp 1605–1608
  • Cao et al. (2011) Cao Y, Wang C, Zhang L, Zhang L (2011) Edgel index for large-scale sketch-based image search
  • Chen et al. (2009) Chen T, Cheng MM, Tan P, Shamir A, Hu SM (2009) Sketch2photo: internet image montage. In: ACM Siggraph Asia
  • Cheng et al. (2016a) Cheng D, Gong Y, Zhou S, JinjunWang, Zheng N (2016a) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: CVPR
  • Cheng et al. (2016b) Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016b) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1335–1344
  • Deng et al. (2017) Deng J, Zhou Y, Zafeiriou S (2017) Marginal loss for deep face recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPRW), Faces “in-the-wild” Workshop/Challenge, vol 4
  • Ding et al. (2015) Ding S, Lin L, Wang G, Chao H (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48(10):2993–3003
  • Eitz et al. (2010) Eitz M, Hildebrand K, Boubekeur T, Alexa M (2010) An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics 34(5):482–498
  • Eitz et al. (2011a) Eitz M, Hildebrand K, Boubekeur T, Alexa M (2011a) Sketch-based image retrieval: Benchmark and bag-of-features descriptors. IEEE transactions on visualization and computer graphics 17(11):1624–1636
  • Eitz et al. (2011b) Eitz M, Richter R, Hildebrand K, Boubekeur T, Alexa M (2011b) Photosketcher: Interactive sketch-based image synthesis. IEEE Computer Graphics and Applications
  • Eitz et al. (2012) Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM Trans Graph (Proc SIGGRAPH) 31(4):44:1–44:10
  • Everingham et al. (2010) Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2):303–338
  • F. et al. (2015) F W, L K, Y L (2015) Sketch-based 3d shape retrieval using convolutional neural networks. In: CVPR
  • Felzenszwalb et al. (2010) Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32(9):1627–1645
  • Gao et al. (2012) Gao X, Wang N, Tao D, Li X (2012) Face sketch–photo synthesis and retrieval using sparse representation. IEEE Transactions on circuits and systems for video technology 22(8):1213–1226
  • Gong et al. (2013) Gong Y, Ke Q, Isard M, Lazebnik S (2013) A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision
  • Hermans et al. (2017) Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:170307737
  • Hirzer et al. (2012a) Hirzer M, Roth PM, Kostinger M, Bischof H (2012a) Relaxed pairwise learned metric for person re-identification. In: ECCV
  • Hirzer et al. (2012b) Hirzer M, Roth PM, Köstinger M, Bischof H (2012b) Relaxed pairwise learned metric for person re-identification. In: European Conference on Computer Vision, Springer, pp 780–793
  • Hu and Collomosse (2013a) Hu R, Collomosse J (2013a) A performance evaluation of gra- dient field hog descriptor for sketch based image retrieval. CVIU
  • Hu and Collomosse (2013b) Hu R, Collomosse J (2013b) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. CVIU
  • Hu et al. (2011) Hu R, Wang T, Collomosse J (2011) A bag-of-regions approach to sketch based image retrieval. In: ICIP
  • Huang et al. (2017) Huang G, Liu Z, Weinberger KQ, van der Maaten L (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 1, p 3
  • Huang et al. (2015) Huang J, Feris RS, Chen Q, Yan S (2015) Cross-domain image retrieval with a dual attribute-aware ranking network. In: ICCV
  • Karpathy and Fei-Fei (2015) Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
  • Kato et al. (1992) Kato T, Kurita T, Otsu N, Hirata K (1992) A sketch retrieval method for full color image database-query by visual example. In: Vol.I. Conference A: Computer Vision and Applications, Proceedings., 11th IAPR International Conference on
  • Krizhevsky et al. (2012) Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS
  • Li et al. (2016) Li K, Pang K, Song YZ, Hospedales T, Zhang H, Hu Y (2016) Fine-grained sketch-based image retrieval: The role of part-aware attributes. In: WACV
  • Li et al. (2018) Li K, Pang K, Song J, Song YZ, Xiang T, M T, Hospedales, Zhang H (2018) Universal sketch perceptual grouping. In: arxiv
  • Li et al. (2014a) Li W, Zhao R, Xiao T, Wang X (2014a) Deepreid: Deep filter pairing neural network for person re-identification. In: CVPR
  • Li et al. (2014b) Li Y, Hospedales T, Song YZ, Gong S (2014b) Fine-grained sketch-based image retrieval by matching deformable part models. In: BMVC
  • Li et al. (2014c) Li Y, Hospedales TM, Song YZ, Gong S (2014c) Fine-grained sketch-based image retrieval by matching deformable part models
  • Liu et al. (2016a) Liu J, Zha ZJ, Tian Q, Liu D, Yao T, Ling Q, Mei T (2016a) Multi-scale triplet cnn for person re-identification. In: Proceedings of the 2016 ACM on Multimedia Conference, ACM, pp 192–196
  • Liu et al. (2016b) Liu J, Zha ZJ, Tian Q, Liu D, Yao T, Ling Q, Mei T (2016b) Multi-scale triplet cnn for person re-identification. In: ACM Multimedia
  • Liu et al. (2014) Liu L, Xing J, Liu S, Xu H, Zhou X, Yan S (2014) Wow! you are so beautiful today! ACM TMCCA
  • Liu et al. (2017a) Liu L, Shen F, Shen Y, Liu X, Shao L (2017a) Deep sketch hashing: Fast free-hand sketch-based image retrieval. In: Proc. CVPR, pp 2862–2871
  • Liu et al. (2017b) Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017b) Sphereface: Deep hypersphere embedding for face recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1
  • Mathias et al. (2010) Mathias E, Kristian H, Tamy B, Marc A (2010) An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers&Graphics
  • Mathias et al. (2011) Mathias E, Kristian H, Tamy B, Marc A (2011) Sketch-based image retrieval: Benchmark and bag-of-features descriptors. TVCG
  • Muhammad et al. (2018) Muhammad UR, Yang Y, Song YZ, Xiang T, Hospedales TM, et al. (2018) Learning deep sketch abstraction. arXiv preprint arXiv:180404804
  • Paisitkriangkrai et al. (2015) Paisitkriangkrai S, Shen C, Van Den Hengel A (2015) Learning to rank in person re-identification with metric ensembles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1846–1855
  • Radenovi´c et al. (2018) Radenovi´c F, Tolias G, Chum O (2018) Deep shape matching. In: arxiv
  • Saavedra et al. (2015) Saavedra JM, Barrios JM, Orand S (2015) Sketch based image retrieval using learned keyshapes (lks). In: BMVC, vol 1, p 7
  • Sangkloy et al. (2016) Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: Learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (proceedings of SIGGRAPH)
  • Schroff et al. (2015) Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: CVPR
  • Seddati et al. (2015) Seddati O, Dupont S, Mahmoudi S (2015) Deepsketch: deep convolutional neural networks for sketch recognition and similarity search. In: Content-Based Multimedia Indexing (CBMI), 2015 13th International Workshop on, IEEE, pp 1–6
  • Simonyan and Zisserman (2015) Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR
  • SMEULDERS et al. (2000) SMEULDERS A, WORRING M, SANTINI S, GUPTA A, JAIN R (2000) Content-based image retrieval at the end of the early years. IEEE TPAMI
  • Song et al. (2017) Song J, Qian Y, Song YZ, Xiang T, Hospedales T (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: ICCV
  • Sun et al. (2015) Sun Y, Liang D, Wang X, Tang X (2015) Deepid3: Face recognition with very deep neural networks. arXiv
  • Taigman et al. (2014) Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: Closing the gap to human-level performance in face verification. In: CVPR
  • Vaquero et al. (2009) Vaquero D, Feris R, Tran D, Brown L, Hampapur A, Turk M (2009) Attribute-based people search in surveillance environments. In: IEEE WACV
  • Vinyals et al. (2015) Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, pp 3156–3164
  • Wang et al. (2018) Wang H, Wang Y, Zhou Z, Ji X, Li Z, Gong D, Zhou J, Liu W (2018) Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:180109414
  • Wang and Tang (2009) Wang X, Tang X (2009) Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(11):1955–1967
  • Wen et al. (2016) Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, Springer, pp 499–515
  • Winnemöller et al. (2012) Winnemöller H, Kyprianidis JE, Olsen SC (2012) Xdog: an extended difference-of-gaussians compendium including advanced image stylization. Computers & Graphics 36(6):740–753
  • Xiong et al. (2014) Xiong F, Gou M, Camps O, Sznaier M (2014) Person reidentification using kernel-based metric learning methods. In: ECCV
  • Xu et al. (2015)

    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp 2048–2057

  • Xu et al. (2016) Xu P, Yin Q, Qi Y, Song YZ, Ma Z, Wang L, Guo J (2016) Instance-level coupled subspace learning for fine-grained sketch-based image retrieval. In: European Conference on Computer Vision, Springer, pp 19–34
  • Yi et al. (2014) Yi D, Lei Z, Liao S, Li SZ (2014) Deep metric learning for person re-identification. In: Pattern Recognition (ICPR), 2014 22nd International Conference on, IEEE, pp 34–39
  • Yin et al. (2017) Yin W, Fu Y, Ma Y, Jiang YG, Xiang T, Xue X (2017) Learning to generate and edit hairstyles. In: ACM MM
  • Yu et al. (2015) Yu Q, Yang Y, Song YZ, Xiang T, Hospedales T (2015) Sketch-a-net that beats humans. In: BMVC
  • Yu et al. (2016) Yu Q, Liu F, Song YZ, Xiang T, Hospedales TM, Loy CC (2016) Sketch me that shoe. In: CVPR
  • Yu et al. (2017) Yu Q, Yang Y, Liu F, Song YZ, Xiang T, Hospedales TM (2017) Sketch-a-net: a deep neural network that beats humans. IJCV
  • Zhang et al. (2017) Zhang X, Fang Z, Wen Y, Li Z, Qiao Y (2017) Range loss for deep face recognition with long-tailed training data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5409–5418
  • Zhao et al. (2015) Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, pp 1556–1564
  • Zhu et al. (2016) Zhu H, Long M, Wang J, Cao Y (2016) Deep hashing network for efficient similarity retrieval. In: AAAI, pp 2415–2421
  • Zitnick and Dollár (2014) Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: European conference on computer vision, Springer, pp 391–405