Deep Semantic Parsing of Freehand Sketches with Homogeneous Transformation, Soft-Weighted Loss, and Staged Learning

10/14/2019 ∙ by Ying Zheng, et al. ∙ 47

In this paper, we propose a novel deep framework for part-level semantic parsing of freehand sketches, which makes three main contributions that are experimentally shown to have substantial practical merit. First, we introduce a new idea named homogeneous transformation to address the problem of domain adaptation. For the task of sketch parsing, there is no available data of labeled freehand sketches that can be directly used for model training. An alternative solution is to learn from the existing parsing data of real images, while the domain adaptation is an inevitable problem. Unlike existing methods that utilize the edge maps of real images to approximate freehand sketches, the proposed homogeneous transformation method transforms the data from two different domains into a homogeneous space to minimize the semantic gap. Second, we design a soft-weighted loss function as guidance for the training process, which gives attention to both the ambiguous label boundary and class imbalance. Third, we present a staged learning strategy to improve the parsing performance of the trained model, which takes advantage of the shared information and specific characteristic from different sketch categories. Extensive experimental results demonstrate the effectiveness of these methods. Specifically, to evaluate the generalization ability of our homogeneous transformation method, additional experiments at the task of sketch-based image retrieval are conducted on the QMUL FG-SBIR dataset. By integrating the proposed three methods into a unified framework, our final deep semantic sketch parsing (DeepSSP) model achieves the state-of-the-art on the public SketchParse dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Freehand sketch analysis is an important research topic in multimedia community especially for the applications of content-based retrieval [3] [8] and cross-media computing [31] [19]. As freehand sketch has a strong abstract representation ability of objects and scenes, it gains great interest from lots of researchers over the past decade. Most of sketch related works focus on the task of sketch-based image [54] [62] [49] and 3D retrieval [43] [52] [56], sketch parsing [47] [42] [41] and recognition [23] [61] [60], conversion between the real image and sketch [5] [6] [34]. In this paper, we aims at exploring the problem of part-level freehand sketch parsing. The in-depth understanding and further solution of this problem can facilitate the development of sketch related applications, such as sketch captioning, drawing assessment, and sketch-based image retrieval.

The existing works of freehand sketch parsing mainly focus on stroke-level labeling, which group strokes or line segments into semantically meaningful object parts [42] [22]. This kind of labeling method is largely different from the semantic parsing of real images, as the former only need to post the semantic label to each pixel in the sketch stroke, while the latter requires complete labeling of every pixel in the real image. For this reason, lots of existing methods for real image parsing cannot be directly applied in the field of stroke-level labeling. The part-level parsing [41] also takes freehand sketches as the input but conducts complete pixel-level labeling like the real image parsing, which can be considered as an intermediate transitional form between stroke-level labeling and real image parsing. Benefits from that, it is possible to solve the sketch parsing problem by using the preeminent deep architectures designed in the area of real image parsing.

Fig. 1: Comparative results of the baseline and our methods for the freehand sketch parsing. HT: homogeneous transformation, SWL: soft-weighted loss, SL: staged learning. It shows that incorporating these methods generates the best parsing result.

There are three main challenges in the task of part-level semantic sketch parsing: (1) semantic gap between domains of the image and sketch, (2) ambiguous label boundary and class imbalance, (3) information sharing across different sketch categories. Next, we discuss these challenges and present our methods to overcome them in the proposed deep semantic sketch parsing (DeepSSP) framework.

The semantic gap emerges in dealing with different data domains or models. Sarvadevabhatla et al. [41] first propose the task of part-level semantic sketch parsing and release the SketchParse dataset for evaluation. However, there is still no available labeled data for end-to-end training of parsing models. In consideration of the fact that there are several public datasets existed in the area of part-level real image parsing, it is quite reasonable to utilize such data to train the sketch parsing model. However, as the training data comes from the domain of real images, it is inevitable to face the challenge of semantic gap between image and sketch domains. To solve this problem, several existing works directly take the edge map of real image as a simulate form to the sketch [41] [46]. Different from these methods, we propose to transform the edge map of the real image and freehand sketch into a homogeneous space, in which two kinds of data represent the same property. In particular, we define the “stroke thickness” as one property of the homogeneous space and convert all edge maps and sketches into 1-pixel thickness. The homogeneous transformation is very simple, but it allows us to effectively train deep networks for sketch parsing.

The second challenge comes with the inherent nature of the sketch data, which consists of two aspects: ambiguous label boundary and class imbalance. The former is caused by the high abstraction of freehand sketches. The sketch only needs a small number of stroke lines to describe an object and lacks cues of color and texture. It makes the label boundary of adjacent parts ambiguous. The class imbalance indicates the variation of instance numbers for different classes, which is a common problem in many fields of computer vision, such as image classification

[17] and object detection [44]. For sketch parsing, the quantities of pixels belonging to each class of semantic part are extremely diverse. Taking the category “horse” as an example, the number of pixels belonging to the part class “torso” is hundreds of times compared with the class “tail”. To tackle these two problems, we propose a soft-weighted loss function that acts as more effective supervision for training the deep parsing network.

The third challenge equals to the question “how to make the best use of the information shared among different sketch categories to learn a better sketch parsing model?” An alternative solution for the task of sketch parsing is to train a category-specific model for each category due to the label discrepancy. However, this solution largely limits the model generalization capability and generates too many independent models leading to inconvenient for training and test. To overcome this challenge, we present a staged learning strategy to make better use of the shared information across categories. At the first stage, the training data of all categories are used to learn the parameters of shared layers under a super branch architecture. At the next stage, we freeze the shared layers and change the super branch to several category-specific branches. Then, we utilize the training data of each category to train the layers in the corresponding branch. This strategy takes consideration of the information sharing and specific characteristic respectively during these stages, which effectively improve the parsing performance of the trained model.

Extensive experimental results on SketchParse dataset demonstrate the effectiveness of our three methods for freehand sketch parsing. In particular, as a general method for domain adaptation between the real image and sketch, we further demonstrate that our homogeneous transformation is also very effective in improving the performance of deep models on the task of fine-grained sketch-based image retrieval (FG-SBIR). Furthermore, we present an erasing-based augmentation method to enhance the training data. After incorporating the proposed methods into the deep semantic sketch parsing (DeepSSP) framework, we achieve the state-of-the-art performance on the SketchParse dataset. To illustrate the contributions of the proposed methods, we show the comparative results in Fig. 1.

The contributions of this paper are summarized as follows:

  1. We introduce the homogeneous transformation to solve the problem of domain adaptation existed in some sketch related fields, such as the sketch parsing and sketch-based image retrieval.

  2. We propose the soft-weighted loss function for better model training with consideration of the ambiguous label boundary and class imbalance.

  3. We present the staged learning strategy to further enhance the parsing ability of the trained model for each sketch category.

  4. Extensive experimental results demonstrate the practical value of our methods and our final DeepSSP model achieves a new state-of-the-art on the public SketchParse dataset.

The remaining sections are organized as follows. We first briefly review the related work in fields of the semantic image/sketch parsing and domain adaptation in Section II. We give detailed descriptions of the proposed homogeneous transformation, soft-weighted loss and staged learning in Section III. Experimental results, comprehensive analysis, implementation details, and discussions are provided in Section IV. Finally, we articulate our conclusions in Section V.

Ii Related Work

In this section, we first briefly review two branches of works in the field of semantic image parsing: object-level segmentation and part-level parsing. Then we move on to the representative works of semantic sketch parsing, including the stroke-level labeling and part-level parsing. As two domains of the real image and freehand sketch are involved in this paper, we also introduce some related work of domain adaptation.

Ii-a Semantic Image Parsing

Object-level segmentation.

With the advance of deep convolutional neural networks, the field of semantic segmentation has made great achievements. The first work exploring the capabilities of existing networks for semantic image segmentation is proposed by Long et al.

[30]. They combine the well-known CNN models for image classification (e.g. AlexNet [21], VGG [45], and GoogleNet [48]) with fully convolutional networks (FCN) to make dense predictions for every pixel. Following the success of FCN, lots of researchers develop new network structures or filters to improve the performance for semantic segmentation, such as DeconvNet [32], U-Net [36], DeepLab [4], PSPNet [63], and SegNet [1]. These methods label each pixel of the real image with the class of its belonged object or region, while do not distinguish the instances from the same class. To output a finer result, deep frameworks like FCIS [24] and Mask R-CNN [14] are designed to separate different instances with the same class label. Kang et al. [20] propose to utilize the depth map in a depth-adaptive deep neural network for sematic segmentation. We refer two comprehensive review papers [50] [12] of the semantic segmentation to readers with more interest.

Part-level parsing. Compared to object-level segmentation, part-level parsing focuses on decomposing segmented objects into semantic components. Wang et al. [53]

propose to jointly solve the problem of object segmentation and part parsing by using a two-stream fully convolutional networks (FCN) and deep learned potentials. Liang et al.

[26]

design a deep local-global long short-term memory (LG-LSTM) architecture for part-level semantic parsing, which learns features in an end-to-end manner instead of employing separate post-processing steps. To generate high-resolution predictions, Lin et al.

[28] present a generic multi-path refinement network (RefineNet) that exploits features at multiple levels. Beyond these works on general objects, there are some methods specifically designed for human parsing [58] [9] [25]. Liang et al. [27] integrate different kinds of context like the cross-layer context and cross-super-pixel neighborhood context into a contextualized convolutional neural network (Co-CNN). With the consideration of human body configuration, Gong et al. [13] propose a self-supervised structure-sensitive learning method and release a new human parsing dataset named “Look into Person (LIP)”. The biggest difference between these methods and ours is that the starting point of this paper lays in the freehand sketch parsing, which faces several unique challenges as mentioned in Section I.

Ii-B Semantic Sketch Parsing

Stroke-level labeling. Most of the existing works in the field of semantic sketch parsing focus on the task of stroke-level labeling, where the goal is to make predictions inferring labels for every stroke or line segment of the freehand sketch. According to the target difference, this task can be divided into two types: scene segmentation and object labeling. The former takes the scene sketch as the input and segments all strokes of the scene into different semantic objects [47] [66]. The latter labels strokes of an individual object sketch with classes, which correspond to different semantic object parts [22] [18] [35]. Fig. 2 (a) and (b) illustrate examples for these two types of task in the stroke-level labeling. However, as the goal or output of this paper is different from these works, it remains unknown how to transfer such methods to solve the problem of part-level parsing for freehand sketches.

Part-level parsing. Unlike the stroke-level labeling, the goal of part-level parsing is to predict class labels for every pixel instead of only the stroke, as shown in Fig. 2 (c). From the view of output, it is similar to the part-level parsing of real images. Sarvadevabhatla et al. [41] firstly propose this task and collect the SketchParse dataset for the evaluation of parsing models. They present a two-level fully convolutional network and incorporate the pose prediction as an auxiliary task to provide supplementary information. To reduce the domain gap between real images and freehand sketches, they translate the real image to sketch-like form based on the edge map. However, it is just a simple expedient, which leaves much room for improvements. In this paper, we propose a homogeneous transformation method that has been experimentally proven very effective for this problem. Furthermore, we present a soft-weighted loss function and staged learning strategy to further improve the parsing performance.

Ii-C Domain Adaptation

When the model trained on source data of a specific domain is applied to target data from another different domain, the distribution variation between two domains usually degrades the performance at the testing time [33] [64]. To solve this problem, the domain adaption is a promising solution and has been recognized as an essential requirement. There are lots of domain adaptation methods that have proven to be successful for various fields of computer vision, such as image classification [2] [39], object detection [57] [38], and semantic image parsing [51] [16]. Saenko et al. [39] propose to adapt visual category models to new domains for image recognition by learning a transformation in the feature distribution. Instead of learning features that are invariant to the domain shift, Rozantsev et al. [38] state that explicitly modeling the shift between two domains should be more effective. Unlike most of these works that the modality of images actually does not change, the task of this paper faces a modality-level variation (real image vs freehand sketch), which makes it even more challenging. The challenge also exists in other sketch related fields (e.g. sketch-based image retrieval), in which existing methods usually take the edge map of real image as a similar data form to the freehand sketch [46] [59] [40]. Different from these methods, we propose a homogeneous transformation method that transforms the data of two different domains into a homogeneous space to minimize the semantic gap.

Fig. 2: Illustration of different tasks of sketch parsing. (a) shows the output of scene segmentation taken from [66], (b) is the result of object labeling taken from [42], (c) presents the result of our method. (a) and (b) are stroke-level labeling, while (c) is part-level parsing. Different colors refer to specific semantic classes.

Iii Methods

We present a novel deep semantic sketch parsing (DeepSSP) framework for the part-level dense predictions of freehand sketches, which incorporates three new methods that solve the problems from different angles. In this section, we show the details of the proposed homogeneous transformation method, soft-weighted loss function, and staged learning strategy.

Iii-a Homogeneous Transformation

Before introducing the proposed homogeneous transformation method, we first present a glance at the problem of domain adaptation. Given data from one domain for model training, the target is to make predictions in another domain . As domain has a different distribution with , it makes the trained model difficult to obtain a satisfactory performance of prediction. To solve this problem, methods for domain adaptation are undoubtedly necessary. In the sketch related fields, these two domains generally refer to the freehand sketch and real image. One of the frequently-used methods to reduce the domain gap is to convert real images () into edge maps (). The data from the new domain looks more like the freehand sketch () than the original real image, making it easier to train a better model. However, there still remains an obvious difference between the domain and .

Fig. 3: Illustration of the proposed homogeneous transformation. and are two different domains. The domain that converted from domain is similar to domain . is the homogeneous space, in which examples translated from domain and represent the same property.

To take one step further, we make a new definition named “homogeneous space” (), in which the data represent the same property regardless of the source domain. As shown in Fig. 3, space is transformed from domain and . This process is called homogeneous transformation. It is possible to directly translate the data of domain into space . However, considering that the domain is closer to the domain than the domain , we choose as the source domain instead of . When the training and prediction are conducted in space , it is expectable to achieve higher performance for the trained model.

There are two important factors in homogeneous transformation. The first is the selection of property shared in the homogeneous space. The second is that the transformation method should minimize the variation of appearance related to the label. Otherwise, if the appearance shows a remarkable change, it may be inconsistent with the given label. In this paper, we choose the “stroke thickness” as the shared property and convert the strokes of all edge maps and freehand sketches into 1-pixel thickness. As we only change the thickness of strokes, the appearances of generated examples are guaranteed consistent after the transformation.

In practice, we first translate image from the domain or to a binary image by performing a simple threshold operation. In the experiments, we set the threshold . Then we adopt a morph-based skeletonization method to extract the centerline of all binary strokes. This method removes pixels on boundaries of the binary image, without allowing it to break apart. The remained pixels make up the centerline, which has 1-pixel thickness. After this operation, all images from the source domain and target domain are transformed into the homogeneous space, in which all strokes of examples share the same property. Finally, the training and test images are replaced with their corresponding examples in the homogeneous space. As the proposed homogeneous transformation is not restricted to the task of part-level semantic sketch parsing, it can be taken as a general method for sketch related applications, such as sketch-based image retrieval.

Iii-B Soft-Weighted Loss

The soft-weighted loss is designed to address the part-level sketch parsing scenario, in which there are ambiguous label boundary and class imbalance between different semantic parts during training. Before introducing the soft-weighted loss, we first start from the definition of standard cross entropy (CE) loss for each pixel,

(1)

where is the input that contains predicted scores for each class, is the ground truth class label, refers to the number of classes. The final CE loss for each prediction of the part-level semantic sketch parsing is computed by,

(2)

which averages losses at all positions of the prediction with the resolution of .

We propose a soft-weighted loss to address the problems of ambiguous label boundary and class imbalance, which reshapes the standard CE loss to the following formulation,

(3)

where is used to handle the situation of ambiguous label boundary, is a weighted parameter that re-weights the losses of different classes. By substituting Eq. (1) into Eq. (3), the soft-weighted loss can be written as,

(4)

Next, we present the details of these two parameters (, ) and show their specific effects on the task of part-level semantic sketch parsing.

As a high abstraction of objects or scenes, the freehand sketch lacks lots of cues (e.g., texture and color) when compared to the real image, which frequently makes the label boundary of adjacent parts ambiguous. For example, the labels distributed over the boundary of the part class “head” and “torso” are not completely certain. The class “head” and “torso” can be seen as the soft labels for these pixels. It should be more acceptable to assign the soft label “torso” to the pixel labeled with “head” on the boundary than other labels like the “tail” and “leg”. Therefore, we introduce the soft parameter to give some tolerance to predictions that output the soft labels to boundary pixels instead of the ground truth classes.

The soft parameter for class is computed by,

(5)

in which counts the number of pixels belonging to class , is equivalent to the percentage of class among these adjacent pixels. For a better understanding, we present an illustration of the computation of soft parameters for the boundary pixel, as shown in Fig. 4. In practice, we only take foreground classes into consideration and set , where the background class is indexed with . For pixels belonging to class that are not adjacent to other parts, , while for other cases (), which makes the soft-weighted loss evolving into,

(6)

We can see that the soft-weighted loss focuses on adjusting the loss for boundary pixels and preserves the loss for pixels with a clear label. It makes the parsing model more concentrated on reducing losses with clear errors while avoiding the disturbance brought by the ambiguous label boundary.

Fig. 4: Illustration of the computation of soft parameters for the boundary pixel with label “4”. The top matrix shows the class labels of adjacent pixels, counts the number of each class, consists of the soft parameters of all classes.

Class imbalance is a common problem in the field of computer vision [29] [37]. For the task of semantic sketch parsing, there is a great difference in the distribution between different part classes. As a consequence, class with lots of pixels dominates the training loss, which brings negative impact to the model training. To alleviate this issue, we apply the weighted parameter to re-weight the losses from different classes. The parameter for class is defined as,

(7)

where is the median of , and is computed as follows,

(8)

in which is the total number of pixels belonging to class , refers to the number of images that include the class . The can be seen as the average number of pixels on the training set for each class. Eq. (7) guarantees that the class with few pixels has a higher weight than classes with more pixels. Finally, the soft-weighted cross entropy loss is formulated as,

(9)

where means the weight of each pixel.

(a) Stage 1
(b) Stage 2
Fig. 5: Our staged learning strategy for the training of sketch parsing model. (a) Stage 1: the network with 5 super branches is used to learn the parameters of shared layers. (b) Stage 2: we freeze these shared layers and train unshared layers of each specific category under a full branch architecture. The shared layers of the deep model are shown as the left blue box at each stage, and the branches for super or specific categories are presented in right boxes with different colors.

Iii-C Staged Learning

Given the category of freehand sketches, an intuitive way for semantic sketch parsing is to train the network independently for each category. It is straightforward for training but neglects the information shared across different categories. Their performance is greatly limited when only a small number of training examples are available. This problem can be alleviated via a half-shared deep architecture [41], which segments the semantic parsing model into two parts, as shown in Fig. 5 (a). The front part consists of several shared layers, while the rest layers are heterogeneous for 5 super branches. In each super branch, the sub-categories such as cow and horse have similar semantic part classes. However, this network does not consider the difference between sub-categories under the same super category.

  cow   horse   cat   dog   sheep   bus   car  bicycle  motorbike  airplane   bird   average
Original 64.39 66.23 65.01 66.05 64.63 63.55 57.97 50.60 50.20 53.13 43.25 58.37
HT 61.64 64.45 61.36 61.97 63.16 61.76 60.84 53.96 50.07 54.28 44.16 57.76
HT 64.70 66.72 63.88 67.16 66.63 65.40 65.94 57.87 50.88 55.71 45.70 60.72
  • “Original” means using the original training data. “HT” trains the model with the training data after the homogeneous transformation but predicts on the test data without HT. “HT” refers that both the training and evaluation are conducted on the data with HT. For a pure evaluation of these methods, all models are trained without any data augmentation.

TABLE I: Comparative results of using different training or test data on the SketchParse dataset.

In consideration of the information sharing and specific characteristic for each sketch category, we propose a staged learning strategy to further improve the parsing performance of the trained model. As shown in Fig. 5, the strategy consists of two training stages that are independent of backbone networks. At stage 1, we use training examples from all sketch categories to learn the parameters of shared layers under the half-shared deep architecture. For every iteration, the data flow forwards from shared layers to their corresponding branch layers. At the next stage, we freeze all shared layers and replace each super branch with several sub-branches, as shown in Fig. 5 (b). Then we only need to fine-tune layers of the corresponding branch for each sketch category. Experimental results demonstrate the superior performance of our strategy compared to the complete independent training and super branch architecture.

Iv Experiments

In this section, we first introduce datasets used for the training and evaluation. Then, we give details of the experimental implementation and propose a novel erasing-based method for data augmentation. To provide a more comprehensive understanding of the proposed method, we evaluate the contributions of each component and present an ablation study via extensive experiments. Furthermore, extra experiments are conducted on the task of fine-grained sketch-based image retrieval to demonstrate the practical value of the homogeneous transformation method. Finally, we present a comparison against other methods and make discussions of some qualitative results.

Iv-a Datasets

Following the work of Sarvadevabhatla et al. [41], we use the data from real image datasets for network training and evaluate the performance of trained models on the SketchParse dataset. Specifically, the training set consists of 1532 paired real images and corresponding part-level annotations, which distribute across 11 categories (i.e., airplane, bicycle, bird, bus, car, cat, cow, dog, horse, motorbike, and sheep). These images and annotations are selected from two public datasets, i.e., Pascal-Part [7] and Core [11].

The evaluation is conducted on the SketchParse dataset [41], which takes 48 freehand sketches for each category respectively from the Sketchy [40] and TU-Berlin [10] datasets. As the category “bus” only exists in the TU-Berlin dataset, there are totally 1008 () freehand sketches in the SketchParse dataset. All sketches are labeled with part-level dense annotations. The average IOU score is adopted to evaluate the parsing performance of trained models.

Iv-B Implementation Details

We take DeepLab v2 [4] as the backbone network, which is a widely used architecture for semantic parsing. In the experiments, the DeepLab model is derived from a multi-scale version of ResNet-101 [15]. Similar to the work of [41], we split the deep model into two parts at the position of “res5b”. As shown in Fig. 5, the front part is used as shared layers across categories and the rest layers are copied into different branches. The initial learning rate is set to except for the final convolutional layers. The learning rate of the final convolutional layers in each branch is set to

. The learning rate is changed under the polynomial weight decay policy. Limited by the memory of GPU, the mini-batch size is set to 1. We adopt the stochastic gradient descent (SGD) with a momentum of 0.9 as the optimizer. Furthermore, we apply 20000 iterations to learn the parameters of shared layers at stage 1 and 2000 iterations to fine-tune the rest layers for each branch at stage 2. All experiments are conducted on a single NVIDIA GeForce GTX 1080Ti GPU with 11GB memory.

Data augmentation is an important strategy to improve the performance of deep neural networks [65] [55]. Same as [41], we perform different degrees of rotations (0, 10, 20, 30) and mirroring on the original image, which finally outputs 14 augmented images for each sketch. Furthermore, we apply an erasing-based sketch augmentation method to generate two times of training data. For each training image with the resolution of pixels, the erasing augmentation method randomly erases a region with a size of pixels. The generated images share the ground truth annotation with their source images.

   Original    HT   Improvement
vanilla 58.37 60.72 +2.35
mirroring 59.04 61.34 +2.30
rotation 60.46 61.72 +1.26
erasing 59.74 60.86 +1.12
mirroring+rotation 60.74 61.69 +0.95
erasing+mirroring 60.02 61.83 +1.81
erasing+rotation 60.94 61.98 +1.04
erasing+mirroring+rotation 61.29 62.71 +1.42
TABLE II: Comparative results of different data augmentation methods and performance improvements brought by the homogeneous transformation (HT).

Iv-C Ablation Study

  cow   horse   cat   dog   sheep   bus   car bicycle motorbike airplane   bird   average
Base+ 69.37 71.18 66.91 67.83 69.12 65.73 66.33 58.48 50.79 57.05 48.80 62.71
Base+ 69.45 70.53 69.24 71.53 69.70 66.99 70.11 61.08 53.15 57.67 50.41 64.39
Base+ 69.76 71.30 70.19 71.30 71.22 66.98 69.98 61.60 56.18 59.14 52.11 65.33
Base+ 70.16 72.12 69.58 72.03 71.74 67.18 70.15 62.35 54.48 60.45 52.07 65.57
  • “Base”: training with the augmentation methods mentioned above, “”: the standard cross entropy (CE) loss, “”: the weighted CE loss, “: the soft-weighted CE loss, “”: the soft-weighted CE loss with a higher weight for the background class.

TABLE III: Comparison of different loss functions on the SketchParse dataset.
  cow   horse   cat   dog   sheep   bus   car bicycle motorbike airplane   bird   average
Independent 66.54 70.02 66.63 69.94 68.58 65.95 67.01 63.09 51.89 55.97 52.20 63.30
Full Branch 69.73 70.73 68.54 71.33 69.44 68.43 69.71 62.80 51.20 59.41 51.83 64.64
Super Branch 70.16 72.12 69.58 72.03 71.74 67.18 70.15 62.35 54.48 60.45 52.07 65.57
Staged Learning 70.42 72.81 69.94 72.57 72.02 67.88 70.88 63.30 55.69 59.96 54.38 66.25
  • “Independent”: training independently for each category, “Full Branch”: the half-shared network with one branch for each category, “Super Branch”: the super branch architecture, “Staged Learning”: our staged learning strategy.

TABLE IV: Comparison of different architectures on the SketchParse dataset.

In the experiments, we apply the homogeneous transformation (HT) method for the training and test dataset. Then both the model training and evaluation are conducted on the transformed data. We select the super branch architecture as the base network and present the results in Table I. Compared to the training on the original dataset, the model with our HT method achieves better performance among 10 of 11 sketch categories and gets 2.35% higher IOU score on average. The results demonstrate the effectiveness of the proposed HT method on the task of part-level semantic sketch parsing. Furthermore, we show the results of only performing the HT method on the training data but do nothing on the test set, shown as “HT” in Table I. In this case, there still is a big gap between the training and test datasets, which gets the worst performance. Therefore, it is important to perform the HT method on two different domains simultaneously.

We compare the performance of different augmentation methods in Table II. It can be seen that the rotation-based augmentation shows the best performance among these three methods when used in isolation. By combining them together, the average IOU score achieves 2.92% gain compared to the vanilla version (from 58.37% to 61.29%). Therefore, we adopt this combination of augmentation methods for model training. We also present the performance improvements brought by the homogeneous transformation in combination with these augmentation combinations. As shown in Table II, we can see that the HT method invariably outperforms the training with the original dataset, which demonstrates the stability and effectiveness of the proposed HT method.

Table III shows comparative results of different loss functions on the SketchParse dataset. All models are trained with the combination of three augmentation methods mentioned above, noted as “Base” in the table. Compared to the standard cross entropy (CE) loss and the weighted version , the model trained with the proposed soft-weighted CE loss achieves better performance. The results show the superiority of the soft-weighted CE loss in the task of part-level semantic sketch parsing. As the pixels belonging to the background class are mostly separated from other classes by the sketch boundary, we set the weighted parameter to make the network more sensitive to the boundary between foreground classes and the background class. As shown in Table III, this new loss with 2 times of weight gets a slightly higher performance than .

As we have mentioned in Sec III-C, there are different kinds of deep architecture for the sketch parsing. We present the comparison of our staged learning strategy and other methods in Table IV. It can be seen that the way of training independently for each sketch category has the worst performance. Taking advantages of the half-shared network, the models trained under the super and full branch architectures get higher average IOU scores, which show the importance of information sharing. With the proposed staged learning strategy, the parsing model achieves the best among them. Furthermore, we evaluate the contributions of each component for our final deep sematic sketch parsing (DeepSSP) model. The baseline model is trained with the augmentation methods of rotation and mirroring, the standard cross entropy loss, and the super branch architecture. As shown in Fig. 6, these components improve the performance with varying magnitudes, which proves the practical value of our methods.

Fig. 6: Contributions of each component for our final deep sematic sketch parsing (DeepSSP) model. The baseline refers to the model trained with the augmentation methods of rotation and mirroring, the standard cross entropy loss, and the super branch architecture.
    1     2     3     4     5     6     7     8     9     10
   Shoe    DSSA [46] 58.26 68.70 74.78 79.13 82.61 85.22 85.22 88.70 90.43 92.17
   DSSA+HT 66.09 73.91 79.13 85.22 88.70 91.30 92.17 93.04 93.04 93.04
   Chair    DSSA [46] 79.38 85.57 86.60 89.69 92.78 93.81 95.88 95.88 95.88 95.88
   DSSA+HT 85.57 90.72 91.75 93.81 94.85 94.85 95.88 95.88 95.88 95.88
   Handbag    DSSA [46] 48.21 58.33 66.07 69.05 73.21 76.79 79.17 80.95 82.74 83.33
   DSSA+HT 50.60 63.10 70.24 73.21 75.60 77.38 78.57 79.76 81.55 83.33
TABLE V: Comparison of the top k accuracy (acc.@K) between DSSA [46] and with the proposed HT method.
  cow   horse   cat   dog   sheep   bus   car bicycle motorbike airplane   bird   average
Baseline 66.01 67.77 66.37 67.41 67.37 65.80 63.15 59.15 50.43 52.95 44.57 60.74
MM 17’ [41] 68.78 69.35 69.60 71.18 70.81 68.00 67.35 62.66 55.04 57.34 50.89 64.45
Our method 70.42 72.81 69.94 72.57 72.02 67.88 70.88 63.30 55.69 59.96 54.38 66.25
TABLE VI: Comparison of different methods on the SketchParse dataset.

Iv-D Homogeneous Transformation for SBIR

The domain adaptation is also a common problem in the field of sketch-based image retrieval (SBIR). To demonstrate the practical value of the proposed homogeneous transformation (HT) method, we integrate it into the training pipeline of existing SBIR networks and evaluate their performance on the QMUL FG-SBIR dataset [46] [59].

The QMUL FG-SBIR dataset is constructed for the task of fine-grained instance-level SBIR. It includes three sub-datasets: shoe, chair, and handbag, in which there are 419, 297, and 568 sketch-photo pairs, respectively. The standard split of training and testing is provided by the authors and also adopted in our experiments.

   Ori    HT   Improvement
Shoe  Triplet SN [59] 52.17 58.26 +6.09
 DSSA [46] 58.26 66.09 +7.83
Chair  Triplet SN [59] 72.16 82.47 +10.31
 DSSA [46] 79.38 85.57 +6.19
Handbag  Triplet SN [59] 39.88 42.86 +2.98
 DSSA [46] 48.21 50.60 +2.39
  • “Ori” means training on the original dataset without HT.

TABLE VII: Comparative results against baselines on the QMUL FG-SBIR dataset (acc.@1).

We select two cutting-edge methods named Triplet SN [59] and DSSA [46] as our baseline models. Considering that both methods take triplets as the input of their networks, we apply the proposed HT method to create new training sketches as the anchor samples, which preserve the same number of triplets for the model training. Following the works of Triplet SN [59] and DSSA [46]

, we use the same experimental settings and take the top K accuracy (acc.@K) as the evaluation metric. The comparative results against baselines on the QMUL FG-SBIR dataset (acc.@1) are shown in Table

VII. We can see that there are significant performance improvements for both baseline networks when integrated with the proposed HT method. Furthermore, we also show the results of top K accuracies (acc.@) between DSSA [46] and with the proposed HT method in Table V. It can be observed that the models trained with our HT method mostly perform better than the baseline methods. The experimental results demonstrate the effectiveness of the proposed HT method for fine-grained instance-level SBIR.

Fig. 7: Parsing results of different methods. From left to right: input freehand sketches, outputs of three methods (i.e., the baseline, MM 17’, and our model), ground truth annotations.

Iv-E Comparative Results

Table VI shows the comparison of our method against the baseline and the work of Sarvadevabhatla et al. [41] (noted as MM 17’). It can be seen that our method gets 5.51% higher performance than the baseline and beats the method of MM 17’ in 10 of 11 sketch categories. Especially for the categories “bus” and “bird”, we achieve 3.53% and 3.49% performance improvements. By integrating the proposed methods together, our final model becomes the new state-of-the-art on the task of part-level semantic sketch parsing. Some examples of the parsing results are shown in Fig. 7. Compared to other methods, our predictions look more accurate in parts like the wheel of the bus and the headlight of the car, which show the superiority of our methods.

Fig. 8: Illustrations of results output by our final DeepSSP model.

Iv-F Discussion

Fig. 8 shows some representative results output by our final DeepSSP model. It can be seen that our parsing results are mostly satisfactory, which demonstrate the effectiveness of our method for the task of part-level semantic sketch parsing. Furthermore, some failure cases are shown in Fig. 9. The first figure shows our result of the category “airplane”, which has the wrong prediction at the position of the cockpit window. The failure for the second result of the category “bicycle” lies in the bicycle frame, which is normally labeled as a hollow part. However, sometimes the bicycle frame has a filled annotation as shown in Fig. 8. These two kinds of failures are mostly caused by the ambiguous part-level annotations. The third case for the category “bird” mixes up the positions of the part class “head” and “tail”. If the extra information on the spatial relationship between different part classes is provided, it could output a better parsing result. The last one from the category “cow” misclassified the background pixels between legs. A finer dense prediction is required to solve this problem, which is left to the future work.

Fig. 9: Some failure cases of our method, taken from categories of airplane, bicycle, bird, and cow.

V Conclusion

Our novel DeepSSP framework re-purposes the network designed for real image segmentation to the task of part-level semantic freehand sketch parsing by integrating the homogeneous transformation, soft-weighted loss, and staged learning. We propose the homogeneous transformation to solve the problem of the semantic gap between domains of the real image and freehand sketch. To avoid the dilemma of ambiguous label boundary and class imbalance, we reshape the standard cross entropy loss to the soft-weighted loss for better guidance for the model training. Furthermore, we present a staged learning strategy that takes advantages of the shared information across categories and the specific characteristic of each sketch class. Extensive experimental results prove the practical value of our method and show that our final DeepSSP achieves the state-of-the-art on the public SketchParse dataset.

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (12), pp. 2481–2495. Cited by: §II-A.
  • [2] A. Bergamo and L. Torresani (2010) Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems, pp. 181–189. Cited by: §II-C.
  • [3] S. D. Bhattacharjee, J. Yuan, Y. Huang, J. Meng, and L. Duan (2018) Query adaptive multi-view object instance search and localization using sketches. IEEE Transactions on Multimedia 20 (10), pp. 2761–2773. Cited by: §I.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §II-A, §IV-B.
  • [5] T. Chen, M. Cheng, P. Tan, A. Shamir, and S. Hu (2009) Sketch2photo: internet image montage. In ACM Transactions on Graphics, Vol. 28, pp. 124. Cited by: §I.
  • [6] T. Chen, P. Tan, L. Ma, M. Cheng, A. Shamir, and S. Hu (2013) Poseshop: human image database construction and personalized content synthesis. IEEE Transactions on Visualization and Computer Graphics 19 (5), pp. 824–837. Cited by: §I.
  • [7] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille (2014) Detect what you can: detecting and representing objects using holistic models and body parts. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1971–1978. Cited by: §IV-A.
  • [8] J. Choi, H. Cho, J. Song, and M. Y. Sang (2019) SketchHelper: real-time stroke guidance for freehand sketch retrieval. IEEE Transactions on Multimedia. Cited by: §I.
  • [9] J. Dong, Q. Chen, X. Shen, J. Yang, and S. Yan (2014)

    Towards unified human parsing and pose estimation

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 843–850. Cited by: §II-A.
  • [10] M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. ACM Transactions on Graphics 31 (4), pp. 44:1–44:10. Cited by: §IV-A.
  • [11] A. Farhadi, I. Endres, and D. Hoiem (2010) Attribute-centric recognition for cross-category generalization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2352–2359. Cited by: §IV-A.
  • [12] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez (2017) A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857. Cited by: §II-A.
  • [13] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin (2017) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6757–6765. Cited by: §II-A.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In IEEE International Conference on Computer Vision, pp. 2980–2988. Cited by: §II-A.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §IV-B.
  • [16] S. Hong, J. Oh, H. Lee, and B. Han (2016) Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3204–3212. Cited by: §II-C.
  • [17] C. Huang, Y. Li, C. Change Loy, and X. Tang (2016) Learning deep representation for imbalanced classification. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384. Cited by: §I.
  • [18] Z. Huang, H. Fu, and R. W. Lau (2014) Data-driven segmentation and labeling of freehand sketches. ACM Transactions on Graphics 33 (6), pp. 175. Cited by: §II-B.
  • [19] R. Ji, F. Chen, L. Cao, and Y. Gao (2018) Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Transactions on Multimedia. Cited by: §I.
  • [20] B. Kang, Y. Lee, and T. Q. Nguyen (2018) Depth-adaptive deep neural network for semantic segmentation. IEEE Transactions on Multimedia 20 (9), pp. 2478–2490. Cited by: §II-A.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, pp. 1097–1105. Cited by: §II-A.
  • [22] K. Li, K. Pang, J. Song, Y. Song, T. Xiang, T. M. Hospedales, H. Zhang, et al. (2018) Universal sketch perceptual grouping. In European Conference on Computer Vision, pp. 582–597. Cited by: §I, §II-B.
  • [23] Y. Li, T. M. Hospedales, Y. Song, and S. Gong (2015) Free-hand sketch recognition by multi-kernel feature learning. Computer Vision and Image Understanding 137, pp. 1–11. Cited by: §I.
  • [24] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017) Fully convolutional instance-aware semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2359–2367. Cited by: §II-A.
  • [25] X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, and S. Yan (2016) Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia 18 (6), pp. 1175–1186. Cited by: §II-A.
  • [26] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan (2016) Semantic object parsing with local-global long short-term memory. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3185–3193. Cited by: §II-A.
  • [27] X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan (2015) Human parsing with contextualized convolutional neural network. In IEEE International Conference on Computer Vision, pp. 1386–1394. Cited by: §II-A.
  • [28] G. Lin, A. Milan, C. Shen, and I. Reid (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5168–5177. Cited by: §II-A.
  • [29] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, pp. 2999–3007. Cited by: §III-B.
  • [30] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §II-A.
  • [31] C. Ma, Y. Liu, H. Wang, D. Teng, and G. Dai (2012) Sketch-based annotation and visualization in video authoring. IEEE Transactions on Multimedia 14 (4), pp. 1153–1165. Cited by: §I.
  • [32] H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision, pp. 1520–1528. Cited by: §II-A.
  • [33] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa (2015) Visual domain adaptation: a survey of recent advances. IEEE Signal Processing Magazine 32 (3), pp. 53–69. Cited by: §II-C.
  • [34] Y. Qi, J. Guo, Y. Song, T. Xiang, H. Zhang, and Z. Tan (2015) Im2sketch: sketch generation by unconflicted perceptual grouping. Neurocomputing 165, pp. 338–349. Cited by: §I.
  • [35] Y. Qi, Y. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, and J. Guo (2015) Making better use of edges via perceptual grouping. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1856–1865. Cited by: §II-B.
  • [36] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §II-A.
  • [37] S. Rota Bulo, G. Neuhold, and P. Kontschieder (2017)

    Loss max-pooling for semantic image segmentation

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2126–2135. Cited by: §III-B.
  • [38] A. Rozantsev, M. Salzmann, and P. Fua (2018) Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-C.
  • [39] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In European Conference on Computer Vision, pp. 213–226. Cited by: §II-C.
  • [40] P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics 35 (4), pp. 119:1–119:12. Cited by: §II-C, §IV-A.
  • [41] R. K. Sarvadevabhatla, I. Dwivedi, A. Biswas, S. Manocha, et al. (2017) SketchParse: towards rich descriptions for poorly drawn sketches using multi-task hierarchical deep networks. In ACM International Conference on Multimedia, pp. 10–18. Cited by: §I, §I, §I, §II-B, §III-C, §IV-A, §IV-A, §IV-B, §IV-B, §IV-E, TABLE VI.
  • [42] R. G. Schneider and T. Tuytelaars (2016) Example-based sketch segmentation and labeling using crfs. ACM Transactions on Graphics 35 (5), pp. 151:1–151:9. Cited by: §I, §I, Fig. 2.
  • [43] T. Shao, W. Xu, K. Yin, J. Wang, K. Zhou, and B. Guo (2011) Discriminative sketch-based 3d model retrieval via robust shape matching. In Computer Graphics Forum, Vol. 30, pp. 2011–2020. Cited by: §I.
  • [44] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §I.
  • [45] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §II-A.
  • [46] J. Song, Y. Qian, Y. Song, T. Xiang, and T. Hospedales (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560. Cited by: §I, §II-C, §IV-D, §IV-D, TABLE V, TABLE VII.
  • [47] Z. Sun, C. Wang, L. Zhang, and L. Zhang (2012) Free hand-drawn sketch segmentation. European Conference on Computer Vision, pp. 626–639. Cited by: §I, §II-B.
  • [48] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §II-A.
  • [49] G. Tolias and O. Chum (2017) Asymmetric feature maps with application to sketch based retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6185–6193. Cited by: §I.
  • [50] R. Unnikrishnan, C. Pantofaru, and M. Hebert (2007) Toward objective evaluation of image segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), pp. 929–944. Cited by: §II-A.
  • [51] A. Van Opbroek, M. A. Ikram, M. W. Vernooij, and M. De Bruijne (2015) Transfer learning improves supervised image segmentation across imaging protocols. IEEE Transactions on Medical Imaging 34 (5), pp. 1018–1030. Cited by: §II-C.
  • [52] F. Wang, L. Kang, and Y. Li (2015) Sketch-based 3d shape retrieval using convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875–1883. Cited by: §I.
  • [53] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille (2015) Joint object and part segmentation using deep learned potentials. In IEEE International Conference on Computer Vision, pp. 1573–1581. Cited by: §II-A.
  • [54] S. Wang, J. Zhang, T. X. Han, and Z. Miao (2015) Sketch-based image retrieval through hypothesis-driven object boundary selection with hlr descriptor. IEEE Transactions on Multimedia 17 (7), pp. 1045–1057. Cited by: §I.
  • [55] X. Wang, A. Shrivastava, and A. Gupta (2017) A-fast-rcnn: hard positive generation via adversary for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3039–3048. Cited by: §IV-B.
  • [56] K. Xu, K. Chen, H. Fu, W. Sun, and S. Hu (2013) Sketch2Scene: sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics 32 (4), pp. 123. Cited by: §I.
  • [57] Z. Xu, S. Huang, Y. Zhang, and D. Tao (2018) Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (5), pp. 1100–1113. Cited by: §II-C.
  • [58] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg (2012) Parsing clothing in fashion photographs. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577. Cited by: §II-A.
  • [59] Q. Yu, F. Liu, Y. Song, T. Xiang, T. M. Hospedales, and C. Loy (2016) Sketch me that shoe. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 799–807. Cited by: §II-C, §IV-D, §IV-D, TABLE VII.
  • [60] Q. Yu, Y. Yang, F. Liu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Sketch-a-net: a deep neural network that beats humans. International Journal of Computer Vision 122 (3), pp. 411–425. Cited by: §I.
  • [61] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao (2016) Sketchnet: sketch classification with web images. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1105–1113. Cited by: §I.
  • [62] Y. Zhang, X. Qian, X. Tan, J. Han, and Y. Tang (2016) Sketch-based image retrieval by salient contour reinforcement. IEEE Transactions on Multimedia 18 (8), pp. 1604–1615. Cited by: §I.
  • [63] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §II-A.
  • [64] S. Zhao, H. Yao, Y. Gao, R. Ji, and G. Ding (2017)

    Continuous probability distribution prediction of image emotions via multitask shared sparse regression

    .
    IEEE Transactions on Multimedia 19 (3), pp. 632–645. Cited by: §II-C.
  • [65] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv:1708.04896. Cited by: §IV-B.
  • [66] C. Zou, Q. Yu, R. Du, H. Mo, Y. Song, T. Xiang, C. Gao, B. Chen, and H. Zhang (2018) SketchyScene: richly-annotated scene sketches. In European Conference on Computer Vision, pp. 438–454. Cited by: Fig. 2, §II-B.