Stacked Semantic-Guided Network for Zero-Shot Sketch-Based Image Retrieval

04/03/2019 ∙ by Hao Wang, et al. ∙ The University of Sydney Columbia University Xidian University 22

Zero-shot sketch-based image retrieval (ZS-SBIR) is a task of cross-domain image retrieval from a natural image gallery with free-hand sketch under a zero-shot scenario. Previous works mostly focus on a generative approach that takes a highly abstract and sparse sketch as input and then synthesizes the corresponding natural image. However, the intrinsic visual sparsity and large intra-class variance of the sketch make the learning of the conditional decoder more difficult and hence achieve unsatisfactory retrieval performance. In this paper, we propose a novel stacked semantic-guided network to address the unique characteristics of sketches in ZS-SBIR. Specifically, we devise multi-layer feature fusion networks that incorporate different intermediate feature representation information in a deep neural network to alleviate the intrinsic sparsity of sketches. In order to improve visual knowledge transfer from seen to unseen classes, we elaborate a coarse-to-fine conditional decoder that generates coarse-grained category-specific corresponding features first (taking auxiliary semantic information as conditional input) and then generates fine-grained instance-specific corresponding features (taking sketch representation as conditional input). Furthermore, regression loss and classification loss are utilized to preserve the semantic and discriminative information of the synthesized features respectively. Extensive experiments on the large-scale Sketchy dataset and TU-Berlin dataset demonstrate that our proposed approach outperforms state-of-the-art methods by more than 20% in retrieval performance.



There are no comments yet.


page 4

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rapid development of mobile devices and the explosive growth of multimedia data on the Web, the patterns of image retrieval from large-scale datasets have greatly evolved. For example, image retrieval with free-hand sketches drawn on tablets, phones and even watches, i.e., Sketch-Based Image Retrieval (SBIR), has become increasingly common. The intuitive nature and simplicity of SBIR make it widely accepted among users for expressing the information they want to search for. Thus efforts on SBIR [9] [14], [18], [34], [52], [29], [35], [3], [13], [24], [20], [43], [4], [5], [19], [49], [32], [37] have attracted widespread attention in the research community in recent years. Nevertheless, the aforementioned works suffer from fatal defects, since they are based on the premise that all categories are known at the training stage; there is no guarantee that the scale of training data is large enough to cover all categories in a realistic scenario, meaning that a sharp decline in retrieval performance will occur when test categories do not appear at the training stage. To ensure that the learned model generalizes well to emerging classes excluding the categories existed in training data, ZS-SBIR [39], [48] is introduced to retrieve corresponding images using sketches that never appear at the training stage.

Figure 1: The unique characteristics of sketches. When compared with natural images, free-hand sketches have large intra-class variance and possess intrinsic visual sparsity, which leads to a large domain gap and consequently decreases retrieval performance.

The task of ZS-SBIR is extremely challenging due to the unique characteristics of sketches. More specifically, free-hand sketches (1) possess intrinsic visual sparsity, since they consist of an ordered set of sparse strokes, (2) have large intra-class variance, as different people tend to understand and draw sketches with various levels of abstraction, e.g., sketches of cats may depict part of the animal (especially the head) or the whole body. Details are demonstrated in Figure 1. These problems result in a large domain gap between sketches and natural images. For ZS-SBIR, it is critical to transfer visual knowledge from sketches of seen classes to unseen classes, which is regarded as a more difficult case of SBIR involving extra restrictive setup. Hence, the unique characteristics of sketches, particularly their large intra-class variance (i.e., the variation that can be seen as noise and significantly harms the transfer of visual knowledge), makes ZS-SBIR extremely challenging.

In order to study and take advantage of existing relevant methods to the greatest extent, some SBIR methods are also included in this paper. Conventional SBIR methods extract hand-crafted features of sketches (e.g., HOG [8], gradient field HOG [17], [18], SIFT [27], Histogram of Edge Local Orientations (HELO) [34], [36], and Learned KeyShapes (LKS) [35]), but achieve unsatisfactory retrieval performance due to shallow and inefficient feature representation. In recent years, deep neural networks have attracted extensive attention among a wide range of applications, especially image retrieval [10], [44] [47], [23]. Various works on SBIR, such as Sketch-a-Net [50], [49] and Deep Sketch Hashing (DSH) [25]

utilize deep-learned discriminative feature representations and significantly outperform those methods with hand-crafted features. For zero-shot learning

[22] [46], traditional methods learn linear compatibility (e.g. Attribute Label Embedding (ALE) [1], Embarrassingly Simple ZSL (ESZSL) [33]

, Semantic AutoEncoder (SAE)

[21]) and nonlinear compatibility (e.g., Latent Embeddings Model (LATEM) [45], Cross Modal Transfer (CMT) [41]

) between image and semantic embedding, which models the simple correspondence of them. Moreover, generative methods (e.g., Generative Moment Matching Network (GMMN)

[2], Semantics-Preserving Adversarial Embedding Network (SP-AEN) [6]) with deeper architectures learn more complex dependencies between image and semantic embedding, facilitating a leap forward in zero-shot learning. These developments of SBIR and zero-shot learning have created the possibility of addressing ZS-SBIR. For example, Shen et al. [39] focus on addressing the problem with deep binary hashing learning in a zero-shot setting, while Yelamarthi et al. [48] pay attention to generating natural image features with corresponding sketch features as input. Despite achieving good performance, they treat sketches as ordinary natural images when extracting features and do not explore the unique characteristics of the sketch. That means existing ZS-SBIR methods still suffer from the intrinsic visual sparsity and large intra-class variance of sketches, and consequently have difficulty in transferring visual knowledge of sketches from seen classes to unseen classes.

In this paper, we propose a novel stacked semantic-guided network to tackle the unique characteristics of sketches for ZS-SBIR. Specifically, to alleviate the intrinsic visual sparsity of sketches, we devise a multi-layer feature fusion scheme to enrich feature representation capacity without introducing extra training parameters, which incorporates intermediate features from different levels via global average pooling. Furthermore, to address the large intra-class variance of sketches, we elaborate a coarse-to-fine conditional decoder, which generates coarse-grained category-specific image features first, followed by fine-grained instance-specific image features. This progressive generation improves the transferring ability from seen to unseen classes due to the auxiliary semantic information we adopted in progressive generation implicitly and stably capture better category-level relationship compared with sketch features. We additionally enforce the generated features to maintain semantic and discriminative information by regularizing it with simple yet effective regression loss and classification loss respectively.

The main contributions of this work are as follows:

  • By taking advantage of different immediate information contained in deep neural networks, our method can enhance the representation of sketches and significantly alleviate their intrinsic visual sparsity.

  • With the assistance of category-level semantic information, our method greatly boosts retrieval performance via a stacked semantic-guided generation, which effectively addresses the large intra-class variance of sketches.

  • By utilizing simple yet effective regression loss and classification loss for the features generated from a conditional decoder, our method effectively preserves the semantic and discriminative information of the generated features respectively.

  • Experimental results on two popular large-scale datasets show that our approach significantly outperforms state-of-the-art methods by more than 20% in terms of retrieval performance.

The rest of this paper is organized as follows. We present a brief review of relevant literature in Section 2. We illustrate our novel stacked semantic-guided network for ZS-SBIR in Section 3. Comprehensive experiments and ablation studies are demonstrated to verify the effectiveness of the proposed method in Sections 4 and 5, respectively. Finally, we present further visualizations in Section 6 and our conclusion in Section 7.

2 Related Work

In this section, we demonstrate a brief review of relevant literature from both sketch-based image retrieval and zero-shot learning aspects.

Zero-Shot Sketch-Based Image Retrieval: To the best of our knowledge, there are only two prior works [39], [48] on ZS-SBIR. Shen et al. [39] proposed a Zero-shot Sketch-Image Hashing (ZSIH) model consisting of sketch binary encoders, image binary encoders, and a multi-modal learning network to mitigate heterogeneity between sketches and natural images. Yelamarthi et al. [48] attempted to generate natural image features of corresponding sketches using Conditional Variational Autoencoders (CVAE) and then conducted retrieval using generated natural image features. However, both of these works fail to consider the intrinsic visual sparsity and large intra-class variance in sketches.

Sketch-Based Image Retrieval: Related SBIR approaches can be roughly categorized into two types: methods based on hand-crafted features and those based on deep-learned features. In the first category, Hu et al. [18] proposed the gradient field HOG descriptor, which is an adaptive form of HOG descriptor suitable for sketches. Histogram of Edge Local Orientations (HELO) [34] utilized soft computation of local orientations and took spatial information into account for SBIR. Saavendra et al. [35]

proposed Learned KeyShapes (LKS) based on mid-level pattern detection. In the second category, the convolutional neural network was first applied for sketch recognition in Sketch-a-Net

[50]. Qi et al. [32] used siamese network to pull features closer for corresponding sketch-image pairs and push them away if they belonged to different categories. Furthermore, Yu et al. [49] exploited triplet loss to take full advantage of the relationships between similar and dissimilar samples for fine-grained instance-level SBIR. Deep Sketch Hashing (DSH) [25] proposed a semi-heterogeneous deep architecture with the help of auxiliary sketch-tokens. However, existing methods are not specifically elaborated for ZS-SBIR, meaning that the visual knowledge transfer from seen to unseen classes under a zero-shot setting is neglected.

Figure 2:

The architecture of our proposed model. It consists of five parts: multi-layer feature fusion networks, conditional feature encoder, coarse-to-fine conditional decoder, semantic-preserving network and discriminibility-preserving network. At the training stage, the multi-layer fused features of the sketch, natural image and semantic information are concatenated first. Synthesized natural image features are then obtained through the conditional feature encoder and coarse-to-fine conditional decoder. The whole model is optimized with KL divergence loss, reconstruction loss, regression loss and classification loss. At the test stage, noise sampled from standard a Gaussian distribution is fed into decoder to generate corresponding natural image features for the retrieval task.

Zero-Shot Learning: Due to the difficulty associated with collecting and annotating samples for training supervised methods, zero-shot learning has attracted ever-increasing attention among the research community. Existing zero-shot approaches can be divided into two categories: embedding-based and generative-based methods. In the first category, Attribute Label Embedding (ALE) [1] measured the bilinear compatibility between image and label embedding, while Embarrassingly Simple ZSL (ESZSL) [33] and Semantic AutoEncoder (SAE) [21] explicitly regularized the projection between the image embedding space and the class embedding space. Furthermore, Latent Embeddings Model (LATEM) [45] and Cross Modal Transfer (CMT) [41] utilized a non-linear component to exploit more complex correspondence between two embedding spaces. In terms of the generative-based methods, Bucher et al. [2] proposed a conditional Generative Moment Matching Network (GMMN) to generate features of unseen classes. Chen et al. [6] proposed Semantics-Preserving Adversarial Embedding Network (SP-AEN) to preserve semantic information during image feature synthesizing. Although our work adopts conditional variational autoencoders with semantic preservation similar to [48], we further elaborate a novel coarse-to-fine conditional decoder with multi-layer feature fusion and discriminability-preserving schemes to address ZS-SBIR.

3 Methodology

3.1 Problem Definition

In this paper, we focus on free-hand sketch-based image retrieval under a zero-shot setting, where sketches and natural images of seen classes are presented only during the training stage. After optimization of our proposed model, corresponding natural images are expected to be retrieved using sketches, the categories of which have never appeared at training stage.

We first describe a formal definition of the ZS-SBIR task. Given a dataset comprising quadruplets of sketch, natural image, semantic information, and class label , we split all categories into and

according to whether the category appears in the 1000 classes of ImageNet

[11]. Correspondingly, we obtain and ; there are no intersection in terms of sketch, natural image and semantic information data between them, i.e., . The model needs to learn the generation from the sketch to the corresponding natural image on . At the test stage, given an taken from sketches of , the objective of ZS-SBIR is to retrieve corresponding natural images from the test image retrieval gallery.

The architecture of the proposed model for ZS-SBIR is illustrated in Figure 2, which consists of five parts: multi-layer feature fusion networks, conditional feature encoder, coarse-to-fine conditional decoder, semantic-preserving network, and discriminability-preserving network. Note that conditional feature encoder and coarse-to-fine conditional decoder can be combined as conditional variational autoencoders. More specifically, multi-layer feature fusion networks aim to incorporate different intermediate features of deep neural networks to enhance feature representation. The conditional feature encoder approximates data distribution to an assumed distribution, such as a unit Gaussian distribution in this paper. The coarse-to-fine conditional decoder synthesizes corresponding natural image features in a progressive manner with the aid of auxiliary semantic information and sketch features. The semantic-preserving network enforces synthesized features to maintain semantic relationships among the training categories, while the discriminability-preserving network regularizes the synthesized features to be discriminative within each training category. The semantic-preserving and discriminability-preserving networks are complementary and both beneficial to the retrieval task. Detailed configurations of the architecture are listed in Table 1.

Component Type Configuration
Encoder FC1 M + 1324 units

4096 units, ReLU, BN

Dropout 0.3 drop rate
FC2 4096 units 2048 units, ReLU, BN
FC3_1 2048 units 1024 units, Tanh
FC3_2 2048 units 1024 units, Tanh
Decoder FC1 1324 units 4096 units, ReLU
FC2 5120 units M units, ReLU
Regressor FC1 M units 2048 units, ReLU
FC2 2048 units 300 units, ReLU
Classifier FC1 M units C units, SoftMax
Table 1: The detailed configuration of our proposed model on the Sketchy dataset. Note that only the dimension of image feature and the number of training classes are different on two datasets. In this paper, and are 5120 and 104 on the Sketchy dataset and 4608 and 194 on the TU-Berlin dataset, respectively.

3.2 Multi-Layer Feature Fusion Networks

Sketches are highly abstract and possess intrinsic visual sparsity compared with natural images. To alleviate this problem, one previous work [42] learned more salient features by utilizing the attention mechanism and skip connection. Dense distance-field representation [7] is calculated to replace the sparse sketch. Inspired by the visualization in [51], where various layers capture semantic concepts at different levels, we devise multi-layer feature fusion networks for the natural image and sketch to enrich feature representation capacity without adding any additional parameters to be learned.

More specifically, we denote the feature map of in the - convolutional block of the pre-trained image classification model (e.g., VGG16 in this paper) on ImageNet [11] as . Therefore, the final fused representation of multi-layer features can be formulated as


where means global spatial average pooling on the feature map, i.e., . Here means the feature of the last fully connected layer. Several features after global average pooling and are concatenated to obtain . Thus, the final enhanced feature representation consists of high-level semantic information, middle-level part information, and low-level detail information. Meanwhile, global average pooling requires no extra parameters, which guarantees that the multi-layer feature fusion will be simple yet efficient.

To make the final enhanced feature representation more compact, we adopt the Principal Component Analysis (PCA)

[16] algorithm to reduce the dimensions of the fused feature. This step not only makes computation more efficient, but also removes redundant zero elements in features. Both of these results are beneficial to the stable learning of a decoder with more robust representation as conditional input.

3.3 Conditional Feature Encoder

Generative approaches usually have significant advantages relative to discriminative ones under a zero-shot setting, as they can creatively transform a zero-shot problem into a supervised classification [26] or nearest-neighbor search problem by synthesizing corresponding features through taking input of conditional information and random noise. We therefore implement Conditional Variational Autoencoders (CVAE) for ZS-SBIR, as it can converge rapidly and be trained stably. Specifically, the variational distribution is approximated via encoder network, which is parameterized by . The conditional distribution is modeled by a decoder network parameterized by . We can formulate the variational lower bound as


where , and stand for natural image, sketch and semantic information respectively.

3.4 Coarse-to-Fine Conditional Decoder

In this paper, we focus on corresponding natural image generation with sketches as conditional input. As we all know, it is crucial to transfer visual knowledge from seen classes to unseen classes under a zero-shot setting. However, the large intra-class variance of sketches makes the learning of the conditional decoder more difficult and hence results in unsatisfactory performance. To tackle this issue, we elaborate a coarse-to-fine conditional decoder to generate coarse-grained category-specific natural image features first and then generate fine-grained instance-specific features. The progressive generation makes the learning of the conditional decoder efficient, with a consequent significant improvement on retrieval performance.

For coarse-grained corresponding natural image feature generation, we adopt category-level semantic information such as word vectors

[28], which implicitly models the category relationships between data points from different classes. The word vector for each category is unique (i.e., there is zero intra-class variance) and hence encourages the conditional decoder to be learned faster and more stably. Meanwhile, the implicit semantic similarity relationships between categories are beneficial to the transfer of visual knowledge from seen classes to unseen classes. Let be the - sample triplet in ; we can thus formulate coarse-grained generation as


where is the coarse-grained decoder and is the feature concatenation operation. Here , and are the random Gaussian noise, word vector and synthesized coarse-grained feature respectively.

For fine-grained corresponding natural image feature generation, we adopt sketch features as conditional input, since they contain more detailed information than word vectors for the target synthesized image features. The generation in this step further increases the diversity of the generated image features, leading to more precise retrieval results. We formulate fine-grained generation as


where and are the fine-grained decoder and synthesized fine-grained image features.

To simplify description, we define coarse-to-fine conditional generation entirely as


where is the proposed coarse-to-fine conditional decoder parameterized by , and is the synthesized image feature.

0:  dataset , max training iteration , batch size , =1.0, =1.0, =1.0, =0.001
1:  Initialize parameters
2:  repeat
3:     Sample mini-batch data
4:     Forward model to generate and
5:     Calculate with Eq. (2), Eq. (8), Eq. (6), Eq. (7) respectively
6:      Eq. (9)
7:     Update
8:     Update
9:     Update
10:     Update
11:  until max training iteration is reached;
Algorithm 1 Optimization Algorithm of Proposed Method

3.5 Semantic-Preserving Network

In traditional CVAE, the quality of synthesized corresponding image features are measured only by Mean Squared Error (MSE) loss, which takes only the deviation of the corresponding position into consideration. To enforce the synthesized features maintaining semantic relationships among the training categories, we apply a semantic-preserving network, which is regarded as a regressor, to constrain the synthesized feature representations via projecting back to the semantic features. The regression loss can be expressed as


where is the regressor parameterized by .

3.6 Discriminability-Preserving Network

To preserve the discriminability of the synthesized features within each training category, we incorporate classification loss to regularize the conditional generation of natural image features. We first pre-train the classification model on dataset . Then, classification loss for the synthesized features can be expressed as


where is the number of synthesized samples, is the parameters of the pre-trained model , and is the softmax operation.

3.7 Training and Inference

In addition to KL divergence loss , regression loss and classification loss , there exists another loss, i.e., reconstruction loss . The reconstruction loss can be formulated as


and is utilized to measure the quality of the synthesized feature via reconstruction.

The full objective function consists of four parts and can be formulated as


The first and second term can be combined as traditional CVAE loss and the last two terms are utilized to regularize the generation procedure.

are parameters for balancing the overall performance. We optimize the proposed model by using the standard Adam optimizer in the PyTorch package

[30]; the detailed optimization is illustrated in Algorithm 1.

At the test stage, the conditional decoder is used to synthesize a number of corresponding natural image features conditioned on semantic information and test sketch features. Then the average of the synthesized features is obtained to represent the final synthesized natural image features. We then retrieve the top (e.g., =200) images from the image retrieval gallery with based on distance. Finally, the retrieval performance is calculated among these retrieved images above.

4 Experiments

4.1 Dataset

For ZS-SBIR, there are two widely used large-scale datasets, i.e., Sketchy [37] and TU-Berlin [12]. Details of the dataset statistics in terms of numbers of images, numbers of sketches and zero-shot splits are summarized in Table 2. To evaluate the effectiveness of each method, we follow the sketch-based image retrieval evaluation criteria in [48], [25] where sketch queries and image retrieval candidates belonging to the same category are regarded as relevant. Given a query sketch and a list of ranked retrieval results, the AP for this query is defined as


where when the - retrieved candidate is corresponding to the query, otherwise . mAP for this query takes ranking information into consideration and can be formulated as

Dataset Statistics Sketchy [37] TU-Berlin [12]
Train classes 104 194
Test classes 21 56
Train sketches 62,785 15,520
Test sketches 12,694 4,480
Train images 10,400 138,839
Images to be retrieved 10,453 65,231
Table 2: Statistics for Sketchy [37] and TU-Berlin [12]. The first row shows the number of seen classes, the second row presents the number of unseen classes, the third and fourth rows display the number of sketch samples available for training and test respectively, the fifth row shows the numbers of images during the training stage, and the sixth row presents the number of images to be retrieved.

Sketchy [37] is a large-scale sketch dataset that originally comprised 75,479 sketches and 12,500 images from 125 categories. Liu et al. [25] extended the image retrieval gallery by collecting extra 60,502 images from ImageNet [11], so that the total number of images in extended Sketchy is 73,002. Following the standard zero-shot setting in [48], we partition the total 125 categories into 104 training categories as seen classes and 21 test categories as unseen classes according to whether the category appears in the 1,000 classes of ImageNet. This partition avoids violating the zero-shot assumption when utilizing models that are pre-trained on ImageNet.

TU-Berlin [12] originally consisted of 20,000 unique free-hand sketches evenly distributed over 250 object categories for sketch recognition. To perform sketch-based image retrieval, we adopt the extended version of TU-Berlin containing total 204,070 natural images. Since no standard splits setting exists for ZS-SBIR, we follow the criteria in [48] and first manually split the total 250 categories into 165 training categories as seen classes and 85 unseen categories as unseen classes. To ensure that each unseen class contains at least 400 images for retrieval evaluation, we move some categories from unseen classes to seen classes and finally yield 194 seen classes for training and 56 unseen classes for test. Compared with the Sketchy dataset, TU-Berlin is fairly challenging as it contains a large number of unseen classes and a large proportion of fine-grained categories.

Sketchy TU-Berlin
Type Evaluation Methods Precision@200 mAP@200 Precision@200 mAP@200
SBIR methods Cosine Similarity 0.094 0.045 0.050 0.031
Siamese-1 [15] 0.293 0.189 0.127 0.061
Siamese-2 [32] 0.305 0.200 0.133 0.067
Coarse-Grained Triplet [38] 0.278 0.176 0.128 0.057
Fine-Grained Triplet [38] 0.284 0.183 0.086 0.050
Zero-Shot methods Direct Regression 0.298 0.197 0.117 0.062
ESZSL [33] 0.305 0.202 0.131 0.072
SAE [21] 0.314 0.204 0.152 0.084
CVAE [48] 0.333 0.225 0.165 0.104
Proposed MLFF 0.398 0.290 0.200 0.132
MLFF+C2F 0.585 0.473 0.416 0.314
MLFF+C2F+DP 0.588 0.479 0.422 0.320
Table 3: The ZS-SBIR performance compared with existing SBIR and zero-shot approaches. In the below table, multi-layer feature fusion, coarse-to-fine generation with multi-layer feature fusion, coarse-to-fine generation with multi-layer feature fusion and discriminability preserving are denoted as MLFF, MLFF+C2F and MLFF+C2F+DP respectively. Note that we reimplement all state-of-the-art methods for fair comparison. MLFF already contains semantic preserving to compare CVAE fairly.

4.2 Implementation Details

All experiments in this paper are implemented with the popular deep learning toolbox PyTorch [30]

. Single-channel sketch images are transformed into three-channel images by copying them three times. For feature extraction, we adopt the VGG16

[40] model pre-trained on the ImageNet dataset as feature extractor for both sketches and images. To conduct multi-layer feature fusion experiments, we extract features of the layers , , and to represent different intermediate information. As category-level semantic representation, class embedding of each sample is obtained by extracting a 300 dimensional feature from pre-trained word vector model [28]

with a given category name. When the category name is not included in the word vector dictionary, we split the category name into words and average their word vectors. For the conditional feature encoder network, there are two fully connected layers with ReLU activation and Batch Normalization, and one Dropout layer between them. For the coarse-to-fine conditional decoder network, there are two fully connected layers where the first layer is conditioned on class embedding and the second is conditioned on sketch features. The regressor consists of two fully connected layers designed to project the generated feature back into conditional input. The classifier contains only one fully connected layer to regularize the synthesized feature.

The noise is drawn from a unit Gaussian with size 1024. We use an Adam optimizer with learning rate of , , and

for optimization across all datasets. The batch size, maximum number of training epochs and feature dimensions after PCA are 128, 25 and 1024 respectively. Note that paired sketches and images do exist on the Sketchy dataset; however, as we do not observe explicit paired data on the TU-Berlin dataset, we devise batch construction of paired sketches and images in a random matching way within the same category.

4.3 Comparison with Existing Methods

We conduct comprehensive experiments on two popular large-scale datasets: Sketchy and TU-Berlin. Since there are very few prior works [39], [48] on ZS-SBIR, we also compare our experimental results with existing relevant SBIR and zero-shot approaches. The metrics of evaluation are average precision and mean average precision (mAP). According to the observation in [48] that fine-tuning a network for sketch recognition can yield subtle improvements, we use an original pre-trained VGG16 model as our feature extractor for all experiments.

Since we firstly introduce a standard zero-shot split on the TU-Berlin dataset and extract features using a different package (i.e., PyTorch), all methods including the state-of-the-art ones are reimplemented in this paper. Comparisons with the SBIR and zero-shot approaches are presented in Table 3. Some observations can be drawn from these comparisons: (1) Simple calculation of cosine similarity lead to poor performance, while direct regression achieves impressive results; (2) The SBIR methods based on a siamese network [15], [32] or triplet network [38] obtain much better performance than simple baseline, e.g., cosine similarity. We believe that the learning of semantic relationships improves the transferring ability from seen classes to unseen classes. SAE [21] produces better performance due to its extra regularization for projection; (3) The best results of existing methods are obtained with CVAE [48], which learns instance-level correspondence by utilizing conditional generation with semantic-preserving constraints; (4) CVAE with multi-layer feature fusion (MLFF) outperforms the former by a large margin. This reflects that fused features after PCA can effectively reduce the intra-class variance of sketches; (5) Our proposed novel coarse-to-fine conditional decoder plays key role in ZS-SBIR, and we were consequently able to observe a leap forward in retrieval performance. In this paper, the auxiliary semantic information (i.e. zero intra-class variance) is more suitable for zero-shot learning than sketch features (i.e., large intra-class variance). Further generation using sketch features as conditional input leads to further performance improvement as the sketch features have more similar information than high-level semantic embedding; (6) The results of implementing the additional discriminability-preserving network constitute strong proof that this is a simple yet effective regularization for our decoder. In conclusion, our proposed model significantly outperforms existing state-of-the-art methods. A visualization can be found in Figure 3.

(a) Retrieval results on Sketchy
(b) Retrieval results on TU-Berlin
Figure 3: Top 16 retrieved images of some sketches on the Sketchy and TU-Berlin datasets using our proposed method and existing state-of-the-art CVAE method under a zero-shot setting. Those sketches above have never appeared at the training stage. We marked the retrieved non-relevant results that do not belong to categories of corresponding sketches with a red border. Note that the retrieved false positives share similar shape to the input sketch.

5 Ablation Study

Some ablation studies are presented in this section to verify the effectiveness of our proposed model. Specifically, we conduct experiments regarding feature type selection of image, multi-layer feature fusion, coarse-to-fine conditional generation and discriminability preserving.

5.1 Feature Type Selection of Image

Sketchy TU-Berlin
Feature Type Precision mAP Precision mAP
@200 @200 @200 @200
0.342 0.238 0.166 0.105
0.341 0.239 0.168 0.107
0.367 0.263 0.181 0.117
0.373 0.268 0.183 0.120
+ PCA (2048) 0.348 0.245 0.179 0.116
+ PCA (1024) 0.355 0.251 0.182 0.117
+ PCA (512) 0.353 0.249 0.181 0.116
+ PCA 0.355 0.251 0.182 0.117
+ PCA 0.362 0.256 0.184 0.118
+ PCA 0.388 0.281 0.197 0.129
+ PCA 0.398 0.290 0.200 0.132
Table 4: Comparing retrieval results on various forms of fused sketch features when features of the natural image are fixed according to previous selection in subsection 5.1. The effects of PCA are also taken into account.

Considering that natural image features are selected for generation, we explore the effect of various image feature types on retrieval performance. To ensure that our comparison of experimental results is fair, we fix the sketch feature type as and change the natural image feature type from to . The results are demonstrated in Figure 4. We observe that the best performance is achieved neither with an image feature type of nor with an image feature type . However, subtle differences exist between different datasets. The best performance is obtained with image feature type of on the Sketchy dataset, while it is obtained with image feature type on the TU-Berlin dataset. We believe that a trade-off exists between difficulty of generation and feature representation capacity for retrieval. More specifically, large-dimensional image features are hard to generate by have more powerful feature representation capacity.

Figure 4: Retrieval performance on both datasets. Given a fixed type of sketch feature (i.e., ), we conduct experiments with various fused feature of the natural image. Here is simplified as and similar simplicity for other types.
Figure 5: Comparison of our proposed approach with the methods conditioned on word vector and sketch features.
Figure 6: Comparison of our proposed coarse-to-fine embedding with shared sketch embedding and shared word vector embedding.

5.2 Multi-layer Feature Fusion

Although most sketch-based image retrieval methods only adopt features for representation, as a matter of fact, the information contained in other layers has great potential to improve feature discrimination capacity so as to boost performance. To incorporate more information into sketch and image representation, we devise a novel fusion scheme to combine multiple semantics from local to global descriptions without introducing extra training parameters. Specifically, we extract multiple features from various convolutional layers (e.g., , and of VGG16) and average the extracted feature map along the spatial dimension. Since the feature dimensions are large after fusion, we use PCA to reduce dimensions while still preserving a certain amount of information. We then fix the feature type of the image (e.g., for the Sketchy dataset and for the TU-Berlin dataset) and conduct experiments with different types of sketch features. Relevant results are presented in Table 4. We observe a similar performance trend on both datasets: the best result is achieved when using concatenated features of layer as sketch representation for conditional generation. In addition, the experimental performance of using the compressed sketch feature (i.e., after PCA reduction) is better than that when the original fused feature is used. The feature dimension after PCA is selected as 1024 according to experimental results in Table 4. The proposed multi-layer fusion significantly enhances feature representation and effectively alleviates the intrinsic visual sparsity of sketches.

5.3 Coarse-to-Fine Conditional Generation

Word vectors have facilitated a significant leap forward in improving our ability to analyse relationships across words, sentences and articles. Compared with the sketch features, which have large intra-class variation, word vectors are more appropriate for conditional generation in generative SBIR approaches. In our experiments, we firstly introduce the auxiliary semantic information (e.g., category-level class embedding) as conditional input to replace sketch features. Experimental results summarized in Figure 5 justify the success of replacing sketch features with word vectors. Moreover, we also conduct experiments by changing the image feature type from to . Surprisingly, we obtain the same conclusion as in subsection 5.1, namely that the best performance is acquired with image feature type of on the Sketchy dataset and image feature type of on the TU-Berlin dataset.

Figure 7: Comparing our proposed model without and with discriminability-preserving network.
Figure 8: Effect of hyper-parameter in discriminability-preserving network.
(a) CVAE
(b) Our
(c) CVAE
(d) Our
Figure 9:

Comparison of confusion matrix on Sketchy dataset and TU-Berlin dataset between existing state-of-the-art CVAE method and ours. The results are illustrated on row (1) for Sketchy and (2) for TU-Berlin respectively.

To verify the rationality of our proposed coarse-to-fine generation model, we additionally compare our method with shared embedding [31] during generation. More specifically, we opt to devise two shared embedding methods, which are shared sketch feature embedding (i.e., sketch features are concatenated to different layers as condition in decoder) and shared word vector embedding (i.e., word vectors are concatenated to different layers as condition in decoder) respectively. The results are summarized in Figure 6 and indicate that the methods with shared embedding marginally outperform original embedding (i.e., not shared) in subsection 5.2. Meanwhile, our proposed coarse-to-fine generation (i.e., hybrid embedding of sketch features and word vectors) beats both shared sketch feature embedding and shared word vector embedding by a large margin, which is a strong proof of the rationality of our method. Experimental results show that our proposed coarse-to-fine generation greatly boost the retrieval performance.

(a) Test image feature
(b) CVAE
(c) Our
(d) Test image feature
(e) CVAE
(f) Our
Figure 10: T-SNE visualization of test image features, generated features from existing state-of-the-art CVAE method, and ours. The visualization results are demonstrated on row (1) for Sketchy and (2) for TU-Berlin respectively.

5.4 Discriminability Preserving

It is well known that reconstruction is not appropriate for measuring the quality of generated features. We assume that this problem could be alleviated by regularizing the generator to produce features that can be exactly classified by a discriminative classifier pre-trained on real image features. Details are presented in Figure 7. To verify the effectiveness of discriminability regularization, we determine feature type of sketches and images according to the above experimental analysis. Specifically, we set the feature type of sketches as for both datasets and feature type of images as and on the Sketchy dataset and TU-Berlin dataset respectively. The hyper-parameter of discriminability preserving loss is tuned via a line search from to . Here we demonstrate the performance curve with different loss parameters in Figure 8 and set the parameter as on both datasets. Our proposed discriminability preserving scheme result in further improvement on retrieval performance.

6 Visualization

6.1 Confusion Matrix

The evaluation metrics of precision@200 and mAP@200 (mean average precision@200) only indicate instance-level image retrieval results while ignoring category-level image retrieval results. Here we illustrate the confusion matrix on both datasets by measuring whether the retrieved images correspond to their categories. Details are presented in Figure

9. It can be seen that the retrieved images of some categories almost all belong to their corresponding categories, while the retrieved images of other categories hardly ever belong to their corresponding categories. Further research should take the failure case into consideration and devise a better approach to transferring visual knowledge.

6.2 Feature Visualization

To help us understand the kinds of features generated by the model, we visualize features of test images, generated features from CVAE and our proposed model respectively. To simplify the visualization, we randomly take 10 categories from each dataset: giraffe, wheelchair, pear, cabin, songbird, dolphin, raccoon, window, rhinoceros, bat and saw on Sketchy and arm, armchair, ashtray, baseball bat, bell, book, bulldozer, bush, calculator, carrot and chandelier on TU-Berlin. As shown in Figure 10, our proposed method generates more compact features, which can boost retrieval performance by a large margin.

7 Conclusion and Future Work

In this paper, we we propose a novel stacked semantic-guided network to address the unique characteristics of sketches. Firstly, we consider that features of intermediate layers have great potential for improving feature representation capacity, then devise a multi-layer feature fusion scheme without adding extra training parameters. Secondly, our proposed approach learns generation from sketch feature to image feature in a progressive way. This strategy greatly boosts the retrieval performance under a zero-shot setting. To enforce the generated image features containing more semantic and discriminative information, we additionally adopt a regressor and classifier to regularize the decoder respectively. Extensive experiments demonstrate that the proposed method yields state-of-the-art performance on two widely used large-scale datasets. In this paper, our proposed model takes fixed features (i.e. deep-learned features of VGG16) as input, which could be sub-optimal. Thus, an end-to-end joint feature extraction and image retrieval approach would further boost the performance. Category-level semantic information is introduced to assist in knowledge transfer between seen classes and unseen classes. However, the semantic information may be unavailable during the test stage. In future work, we would only adopt category-level word vectors as auxiliary semantic information atthe training stage and address the ZS-SBIR problem in an end-to-end joint feature extraction and conditional feature generation framework.


  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE Trans. Pattern Anal. Mach. Intell., 38(7):1425–1438, 2016.
  • [2] M. Bucher, S. Herbin, and F. Jurie. Generating visual representations for zero-shot classification. In

    ICCV Workshop on TASK-CV: Transferring and Adapting Source Knowledge in Computer Vision

    , pages 2666–2673, 2017.
  • [3] X. Cao, H. Zhang, S. Liu, X. Guo, and L. Lin. Sym-fish: A symmetry-aware flip invariant sketch histogram shape descriptor. In ICCV, pages 313–320, 2013.
  • [4] Y. Cao, C. Wang, L. Zhang, and L. Zhang. Edgel index for large-scale sketch-based image search. In CVPR, pages 761–768, 2011.
  • [5] Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, and L. Zhang. Mindfinder: interactive sketch-based image search on millions of images. In ACM Multimedia, pages 1605–1608, 2010.
  • [6] L. Chen, H. Zhang, J. Xiao, W. Liu, and S.-F. Chang. Zero-shot visual recognition using semantics-preserving adversarial embedding network. In CVPR, pages 1043–1052, 2018.
  • [7] W. Chen and J. Hays. Sketchygan: Towards diverse and realistic sketch to image synthesis. In CVPR, pages 9416–9425, 2018.
  • [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893, 2005.
  • [9] A. Del Bimbo and P. Pala. Visual image retrieval by elastic matching of user sketches. IEEE Trans. Pattern Anal. Mach. Intell., 19(2):121–132, 1997.
  • [10] C. Deng, Z. Chen, X. Liu, X. Gao, and D. Tao. Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans. Image Process., 27(8):3893–3903, 2018.
  • [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  • [12] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Trans. Graph., 31(4):44–53, 2012.
  • [13] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Comput. Graph., 34(5):482–498, 2010.
  • [14] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. Sketch-based image retrieval: Benchmark and bag-of-features descriptors. IEEE Trans. Vis. Comput. Graphics, 17(11):1624–1636, 2011.
  • [15] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, pages 1735–1742, 2006.
  • [16] H. Hotelling. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol., 24(6):417–441, 1933.
  • [17] R. Hu, M. Barnard, and J. Collomosse. Gradient field descriptor for sketch-based retrieval and localization. In ICIP, pages 1025–1028, 2010.
  • [18] R. Hu and J. Collomosse. A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Comput. Vis. Image Understand., 117(7):790–806, 2013.
  • [19] R. Hu, T. Wang, and J. Collomosse. A bag-of-regions approach to sketch-based image retrieval. In ICIP, pages 3661–3664, 2011.
  • [20] S. James, M. J. Fonseca, and J. Collomosse. Reenact: Sketch-based choreographic design from archival dance footage. In ACM ICMR, pages 313–320, 2014.
  • [21] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. arXiv:1704.08345, 2017.
  • [22] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell., 36(3):453–465, 2014.
  • [23] C. Li, C. Deng, N. Li, W. Liu, X. Gao, and D. Tao. Self-supervised adversarial hashing networks for cross-modal retrieval. In CVPR, pages 4242–4251, 2018.
  • [24] K. Li, K. Pang, Y.-Z. Song, T. Hospedales, H. Zhang, and Y. Hu. Fine-grained sketch-based image retrieval: The role of part-aware attributes. In WACV, pages 1–9, 2016.
  • [25] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In CVPR, pages 2862–2871, 2017.
  • [26] Y. Long, L. Liu, L. Shao, F. Shen, G. Ding, and J. Han. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In CVPR, pages 6165–6174, 2017.
  • [27] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, volume 2, pages 1150–1157, 1999.
  • [28] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.
  • [29] S. Parui and A. Mittal. Similarity-invariant sketch-based image retrieval in large databases. In ECCV, pages 398–414, 2014.
  • [30] A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch, 2017.
  • [31] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871, 2017.
  • [32] Y. Qi, Y.-Z. Song, H. Zhang, and J. Liu. Sketch-based image retrieval via siamese convolutional neural network. In ICIP, pages 2460–2464, 2016.
  • [33] B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, pages 2152–2161, 2015.
  • [34] J. M. Saavedra. Sketch-based image retrieval using a soft computation of the histogram of edge local orientations (S-HELO). In ICIP, pages 2998–3002, 2014.
  • [35] J. M. Saavedra, J. M. Barrios, and S. Orand. Sketch-based image retrieval using learned keyshapes (LKS). In BMVC, volume 1, pages 1–11, 2015.
  • [36] J. M. Saavedra and B. Bustos. An improved histogram of edge local orientations for sketch-based image retrieval. In

    Joint Pattern Recognition Symposium

    , pages 432–441, 2010.
  • [37] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph., 35(4):119–130, 2016.
  • [38] F. Schroff, D. Kalenichenko, and J. Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In CVPR, pages 815–823, 2015.
  • [39] Y. Shen, L. Liu, F. Shen, and L. Shao. Zero-shot sketch-image hashing. In CVPR, pages 3598–3607, 2018.
  • [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
  • [41] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NeurIPS, pages 935–943, 2013.
  • [42] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In ICCV, pages 5552–5561, 2017.
  • [43] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval using convolutional neural networks. In CVPR, pages 1875–1883, 2015.
  • [44] L. Wang, Y. Li, J. Huang, and S. Lazebnik. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell., 2018.
  • [45] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, pages 69–77, 2016.
  • [46] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell., 2018.
  • [47] E. Yang, C. Deng, C. Li, W. Liu, J. Li, and D. Tao. Shared predictive cross-modal deep quantization. IEEE Trans. Neural Netw. Learn. Syst., (99):1–12, 2018.
  • [48] S. K. Yelamarthi, S. K. Reddy, A. Mishra, and A. Mittal. A zero-shot framework for sketch based image retrieval. In ECCV, pages 316–333, 2018.
  • [49] Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T. M. Hospedales, and C.-C. Loy. Sketch me that shoe. In CVPR, pages 799–807, 2016.
  • [50] Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. Hospedales. Sketch-a-net that beats humans. arXiv:1501.07873, 2015.
  • [51] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014.
  • [52] R. Zhou, L. Chen, and L. Zhang. Sketch-based image retrieval on a large scale database. In ACM Multimedia, pages 973–976, 2012.