Freehand sketch analysis is an important research topic in multimedia community especially for the applications of content-based retrieval   and cross-media computing  . As freehand sketch has a strong abstract representation ability of objects and scenes, it gains great interest from lots of researchers over the past decade. Most of sketch related works focus on the task of sketch-based image    and 3D retrieval   , sketch parsing    and recognition   , conversion between the real image and sketch   . In this paper, we aims at exploring the problem of part-level freehand sketch parsing. The in-depth understanding and further solution of this problem can facilitate the development of sketch related applications, such as sketch captioning, drawing assessment, and sketch-based image retrieval.
The existing works of freehand sketch parsing mainly focus on stroke-level labeling, which group strokes or line segments into semantically meaningful object parts  . This kind of labeling method is largely different from the semantic parsing of real images, as the former only need to post the semantic label to each pixel in the sketch stroke, while the latter requires complete labeling of every pixel in the real image. For this reason, lots of existing methods for real image parsing cannot be directly applied in the field of stroke-level labeling. The part-level parsing  also takes freehand sketches as the input but conducts complete pixel-level labeling like the real image parsing, which can be considered as an intermediate transitional form between stroke-level labeling and real image parsing. Benefits from that, it is possible to solve the sketch parsing problem by using the preeminent deep architectures designed in the area of real image parsing.
There are three main challenges in the task of part-level semantic sketch parsing: (1) semantic gap between domains of the image and sketch, (2) ambiguous label boundary and class imbalance, (3) information sharing across different sketch categories. Next, we discuss these challenges and present our methods to overcome them in the proposed deep semantic sketch parsing (DeepSSP) framework.
The semantic gap emerges in dealing with different data domains or models. Sarvadevabhatla et al.  first propose the task of part-level semantic sketch parsing and release the SketchParse dataset for evaluation. However, there is still no available labeled data for end-to-end training of parsing models. In consideration of the fact that there are several public datasets existed in the area of part-level real image parsing, it is quite reasonable to utilize such data to train the sketch parsing model. However, as the training data comes from the domain of real images, it is inevitable to face the challenge of semantic gap between image and sketch domains. To solve this problem, several existing works directly take the edge map of real image as a simulate form to the sketch  . Different from these methods, we propose to transform the edge map of the real image and freehand sketch into a homogeneous space, in which two kinds of data represent the same property. In particular, we define the “stroke thickness” as one property of the homogeneous space and convert all edge maps and sketches into 1-pixel thickness. The homogeneous transformation is very simple, but it allows us to effectively train deep networks for sketch parsing.
The second challenge comes with the inherent nature of the sketch data, which consists of two aspects: ambiguous label boundary and class imbalance. The former is caused by the high abstraction of freehand sketches. The sketch only needs a small number of stroke lines to describe an object and lacks cues of color and texture. It makes the label boundary of adjacent parts ambiguous. The class imbalance indicates the variation of instance numbers for different classes, which is a common problem in many fields of computer vision, such as image classification and object detection . For sketch parsing, the quantities of pixels belonging to each class of semantic part are extremely diverse. Taking the category “horse” as an example, the number of pixels belonging to the part class “torso” is hundreds of times compared with the class “tail”. To tackle these two problems, we propose a soft-weighted loss function that acts as more effective supervision for training the deep parsing network.
The third challenge equals to the question “how to make the best use of the information shared among different sketch categories to learn a better sketch parsing model?” An alternative solution for the task of sketch parsing is to train a category-specific model for each category due to the label discrepancy. However, this solution largely limits the model generalization capability and generates too many independent models leading to inconvenient for training and test. To overcome this challenge, we present a staged learning strategy to make better use of the shared information across categories. At the first stage, the training data of all categories are used to learn the parameters of shared layers under a super branch architecture. At the next stage, we freeze the shared layers and change the super branch to several category-specific branches. Then, we utilize the training data of each category to train the layers in the corresponding branch. This strategy takes consideration of the information sharing and specific characteristic respectively during these stages, which effectively improve the parsing performance of the trained model.
Extensive experimental results on SketchParse dataset demonstrate the effectiveness of our three methods for freehand sketch parsing. In particular, as a general method for domain adaptation between the real image and sketch, we further demonstrate that our homogeneous transformation is also very effective in improving the performance of deep models on the task of fine-grained sketch-based image retrieval (FG-SBIR). Furthermore, we present an erasing-based augmentation method to enhance the training data. After incorporating the proposed methods into the deep semantic sketch parsing (DeepSSP) framework, we achieve the state-of-the-art performance on the SketchParse dataset. To illustrate the contributions of the proposed methods, we show the comparative results in Fig. 1.
The contributions of this paper are summarized as follows:
We introduce the homogeneous transformation to solve the problem of domain adaptation existed in some sketch related fields, such as the sketch parsing and sketch-based image retrieval.
We propose the soft-weighted loss function for better model training with consideration of the ambiguous label boundary and class imbalance.
We present the staged learning strategy to further enhance the parsing ability of the trained model for each sketch category.
Extensive experimental results demonstrate the practical value of our methods and our final DeepSSP model achieves a new state-of-the-art on the public SketchParse dataset.
The remaining sections are organized as follows. We first briefly review the related work in fields of the semantic image/sketch parsing and domain adaptation in Section II. We give detailed descriptions of the proposed homogeneous transformation, soft-weighted loss and staged learning in Section III. Experimental results, comprehensive analysis, implementation details, and discussions are provided in Section IV. Finally, we articulate our conclusions in Section V.
Ii Related Work
In this section, we first briefly review two branches of works in the field of semantic image parsing: object-level segmentation and part-level parsing. Then we move on to the representative works of semantic sketch parsing, including the stroke-level labeling and part-level parsing. As two domains of the real image and freehand sketch are involved in this paper, we also introduce some related work of domain adaptation.
Ii-a Semantic Image Parsing
With the advance of deep convolutional neural networks, the field of semantic segmentation has made great achievements. The first work exploring the capabilities of existing networks for semantic image segmentation is proposed by Long et al.. They combine the well-known CNN models for image classification (e.g. AlexNet , VGG , and GoogleNet ) with fully convolutional networks (FCN) to make dense predictions for every pixel. Following the success of FCN, lots of researchers develop new network structures or filters to improve the performance for semantic segmentation, such as DeconvNet , U-Net , DeepLab , PSPNet , and SegNet . These methods label each pixel of the real image with the class of its belonged object or region, while do not distinguish the instances from the same class. To output a finer result, deep frameworks like FCIS  and Mask R-CNN  are designed to separate different instances with the same class label. Kang et al.  propose to utilize the depth map in a depth-adaptive deep neural network for sematic segmentation. We refer two comprehensive review papers   of the semantic segmentation to readers with more interest.
Part-level parsing. Compared to object-level segmentation, part-level parsing focuses on decomposing segmented objects into semantic components. Wang et al. 
propose to jointly solve the problem of object segmentation and part parsing by using a two-stream fully convolutional networks (FCN) and deep learned potentials. Liang et al.
design a deep local-global long short-term memory (LG-LSTM) architecture for part-level semantic parsing, which learns features in an end-to-end manner instead of employing separate post-processing steps. To generate high-resolution predictions, Lin et al. present a generic multi-path refinement network (RefineNet) that exploits features at multiple levels. Beyond these works on general objects, there are some methods specifically designed for human parsing   . Liang et al.  integrate different kinds of context like the cross-layer context and cross-super-pixel neighborhood context into a contextualized convolutional neural network (Co-CNN). With the consideration of human body configuration, Gong et al.  propose a self-supervised structure-sensitive learning method and release a new human parsing dataset named “Look into Person (LIP)”. The biggest difference between these methods and ours is that the starting point of this paper lays in the freehand sketch parsing, which faces several unique challenges as mentioned in Section I.
Ii-B Semantic Sketch Parsing
Stroke-level labeling. Most of the existing works in the field of semantic sketch parsing focus on the task of stroke-level labeling, where the goal is to make predictions inferring labels for every stroke or line segment of the freehand sketch. According to the target difference, this task can be divided into two types: scene segmentation and object labeling. The former takes the scene sketch as the input and segments all strokes of the scene into different semantic objects  . The latter labels strokes of an individual object sketch with classes, which correspond to different semantic object parts   . Fig. 2 (a) and (b) illustrate examples for these two types of task in the stroke-level labeling. However, as the goal or output of this paper is different from these works, it remains unknown how to transfer such methods to solve the problem of part-level parsing for freehand sketches.
Part-level parsing. Unlike the stroke-level labeling, the goal of part-level parsing is to predict class labels for every pixel instead of only the stroke, as shown in Fig. 2 (c). From the view of output, it is similar to the part-level parsing of real images. Sarvadevabhatla et al.  firstly propose this task and collect the SketchParse dataset for the evaluation of parsing models. They present a two-level fully convolutional network and incorporate the pose prediction as an auxiliary task to provide supplementary information. To reduce the domain gap between real images and freehand sketches, they translate the real image to sketch-like form based on the edge map. However, it is just a simple expedient, which leaves much room for improvements. In this paper, we propose a homogeneous transformation method that has been experimentally proven very effective for this problem. Furthermore, we present a soft-weighted loss function and staged learning strategy to further improve the parsing performance.
Ii-C Domain Adaptation
When the model trained on source data of a specific domain is applied to target data from another different domain, the distribution variation between two domains usually degrades the performance at the testing time  . To solve this problem, the domain adaption is a promising solution and has been recognized as an essential requirement. There are lots of domain adaptation methods that have proven to be successful for various fields of computer vision, such as image classification  , object detection  , and semantic image parsing  . Saenko et al.  propose to adapt visual category models to new domains for image recognition by learning a transformation in the feature distribution. Instead of learning features that are invariant to the domain shift, Rozantsev et al.  state that explicitly modeling the shift between two domains should be more effective. Unlike most of these works that the modality of images actually does not change, the task of this paper faces a modality-level variation (real image vs freehand sketch), which makes it even more challenging. The challenge also exists in other sketch related fields (e.g. sketch-based image retrieval), in which existing methods usually take the edge map of real image as a similar data form to the freehand sketch   . Different from these methods, we propose a homogeneous transformation method that transforms the data of two different domains into a homogeneous space to minimize the semantic gap.
We present a novel deep semantic sketch parsing (DeepSSP) framework for the part-level dense predictions of freehand sketches, which incorporates three new methods that solve the problems from different angles. In this section, we show the details of the proposed homogeneous transformation method, soft-weighted loss function, and staged learning strategy.
Iii-a Homogeneous Transformation
Before introducing the proposed homogeneous transformation method, we first present a glance at the problem of domain adaptation. Given data from one domain for model training, the target is to make predictions in another domain . As domain has a different distribution with , it makes the trained model difficult to obtain a satisfactory performance of prediction. To solve this problem, methods for domain adaptation are undoubtedly necessary. In the sketch related fields, these two domains generally refer to the freehand sketch and real image. One of the frequently-used methods to reduce the domain gap is to convert real images () into edge maps (). The data from the new domain looks more like the freehand sketch () than the original real image, making it easier to train a better model. However, there still remains an obvious difference between the domain and .
To take one step further, we make a new definition named “homogeneous space” (), in which the data represent the same property regardless of the source domain. As shown in Fig. 3, space is transformed from domain and . This process is called homogeneous transformation. It is possible to directly translate the data of domain into space . However, considering that the domain is closer to the domain than the domain , we choose as the source domain instead of . When the training and prediction are conducted in space , it is expectable to achieve higher performance for the trained model.
There are two important factors in homogeneous transformation. The first is the selection of property shared in the homogeneous space. The second is that the transformation method should minimize the variation of appearance related to the label. Otherwise, if the appearance shows a remarkable change, it may be inconsistent with the given label. In this paper, we choose the “stroke thickness” as the shared property and convert the strokes of all edge maps and freehand sketches into 1-pixel thickness. As we only change the thickness of strokes, the appearances of generated examples are guaranteed consistent after the transformation.
In practice, we first translate image from the domain or to a binary image by performing a simple threshold operation. In the experiments, we set the threshold . Then we adopt a morph-based skeletonization method to extract the centerline of all binary strokes. This method removes pixels on boundaries of the binary image, without allowing it to break apart. The remained pixels make up the centerline, which has 1-pixel thickness. After this operation, all images from the source domain and target domain are transformed into the homogeneous space, in which all strokes of examples share the same property. Finally, the training and test images are replaced with their corresponding examples in the homogeneous space. As the proposed homogeneous transformation is not restricted to the task of part-level semantic sketch parsing, it can be taken as a general method for sketch related applications, such as sketch-based image retrieval.
Iii-B Soft-Weighted Loss
The soft-weighted loss is designed to address the part-level sketch parsing scenario, in which there are ambiguous label boundary and class imbalance between different semantic parts during training. Before introducing the soft-weighted loss, we first start from the definition of standard cross entropy (CE) loss for each pixel,
where is the input that contains predicted scores for each class, is the ground truth class label, refers to the number of classes. The final CE loss for each prediction of the part-level semantic sketch parsing is computed by,
which averages losses at all positions of the prediction with the resolution of .
We propose a soft-weighted loss to address the problems of ambiguous label boundary and class imbalance, which reshapes the standard CE loss to the following formulation,
where is used to handle the situation of ambiguous label boundary, is a weighted parameter that re-weights the losses of different classes. By substituting Eq. (1) into Eq. (3), the soft-weighted loss can be written as,
Next, we present the details of these two parameters (, ) and show their specific effects on the task of part-level semantic sketch parsing.
As a high abstraction of objects or scenes, the freehand sketch lacks lots of cues (e.g., texture and color) when compared to the real image, which frequently makes the label boundary of adjacent parts ambiguous. For example, the labels distributed over the boundary of the part class “head” and “torso” are not completely certain. The class “head” and “torso” can be seen as the soft labels for these pixels. It should be more acceptable to assign the soft label “torso” to the pixel labeled with “head” on the boundary than other labels like the “tail” and “leg”. Therefore, we introduce the soft parameter to give some tolerance to predictions that output the soft labels to boundary pixels instead of the ground truth classes.
The soft parameter for class is computed by,
in which counts the number of pixels belonging to class , is equivalent to the percentage of class among these adjacent pixels. For a better understanding, we present an illustration of the computation of soft parameters for the boundary pixel, as shown in Fig. 4. In practice, we only take foreground classes into consideration and set , where the background class is indexed with . For pixels belonging to class that are not adjacent to other parts, , while for other cases (), which makes the soft-weighted loss evolving into,
We can see that the soft-weighted loss focuses on adjusting the loss for boundary pixels and preserves the loss for pixels with a clear label. It makes the parsing model more concentrated on reducing losses with clear errors while avoiding the disturbance brought by the ambiguous label boundary.
Class imbalance is a common problem in the field of computer vision  . For the task of semantic sketch parsing, there is a great difference in the distribution between different part classes. As a consequence, class with lots of pixels dominates the training loss, which brings negative impact to the model training. To alleviate this issue, we apply the weighted parameter to re-weight the losses from different classes. The parameter for class is defined as,
where is the median of , and is computed as follows,
in which is the total number of pixels belonging to class , refers to the number of images that include the class . The can be seen as the average number of pixels on the training set for each class. Eq. (7) guarantees that the class with few pixels has a higher weight than classes with more pixels. Finally, the soft-weighted cross entropy loss is formulated as,
where means the weight of each pixel.
Iii-C Staged Learning
Given the category of freehand sketches, an intuitive way for semantic sketch parsing is to train the network independently for each category. It is straightforward for training but neglects the information shared across different categories. Their performance is greatly limited when only a small number of training examples are available. This problem can be alleviated via a half-shared deep architecture , which segments the semantic parsing model into two parts, as shown in Fig. 5 (a). The front part consists of several shared layers, while the rest layers are heterogeneous for 5 super branches. In each super branch, the sub-categories such as cow and horse have similar semantic part classes. However, this network does not consider the difference between sub-categories under the same super category.
“Original” means using the original training data. “HT” trains the model with the training data after the homogeneous transformation but predicts on the test data without HT. “HT” refers that both the training and evaluation are conducted on the data with HT. For a pure evaluation of these methods, all models are trained without any data augmentation.
In consideration of the information sharing and specific characteristic for each sketch category, we propose a staged learning strategy to further improve the parsing performance of the trained model. As shown in Fig. 5, the strategy consists of two training stages that are independent of backbone networks. At stage 1, we use training examples from all sketch categories to learn the parameters of shared layers under the half-shared deep architecture. For every iteration, the data flow forwards from shared layers to their corresponding branch layers. At the next stage, we freeze all shared layers and replace each super branch with several sub-branches, as shown in Fig. 5 (b). Then we only need to fine-tune layers of the corresponding branch for each sketch category. Experimental results demonstrate the superior performance of our strategy compared to the complete independent training and super branch architecture.
In this section, we first introduce datasets used for the training and evaluation. Then, we give details of the experimental implementation and propose a novel erasing-based method for data augmentation. To provide a more comprehensive understanding of the proposed method, we evaluate the contributions of each component and present an ablation study via extensive experiments. Furthermore, extra experiments are conducted on the task of fine-grained sketch-based image retrieval to demonstrate the practical value of the homogeneous transformation method. Finally, we present a comparison against other methods and make discussions of some qualitative results.
Following the work of Sarvadevabhatla et al. , we use the data from real image datasets for network training and evaluate the performance of trained models on the SketchParse dataset. Specifically, the training set consists of 1532 paired real images and corresponding part-level annotations, which distribute across 11 categories (i.e., airplane, bicycle, bird, bus, car, cat, cow, dog, horse, motorbike, and sheep). These images and annotations are selected from two public datasets, i.e., Pascal-Part  and Core .
The evaluation is conducted on the SketchParse dataset , which takes 48 freehand sketches for each category respectively from the Sketchy  and TU-Berlin  datasets. As the category “bus” only exists in the TU-Berlin dataset, there are totally 1008 () freehand sketches in the SketchParse dataset. All sketches are labeled with part-level dense annotations. The average IOU score is adopted to evaluate the parsing performance of trained models.
Iv-B Implementation Details
We take DeepLab v2  as the backbone network, which is a widely used architecture for semantic parsing. In the experiments, the DeepLab model is derived from a multi-scale version of ResNet-101 . Similar to the work of , we split the deep model into two parts at the position of “res5b”. As shown in Fig. 5, the front part is used as shared layers across categories and the rest layers are copied into different branches. The initial learning rate is set to except for the final convolutional layers. The learning rate of the final convolutional layers in each branch is set to
. The learning rate is changed under the polynomial weight decay policy. Limited by the memory of GPU, the mini-batch size is set to 1. We adopt the stochastic gradient descent (SGD) with a momentum of 0.9 as the optimizer. Furthermore, we apply 20000 iterations to learn the parameters of shared layers at stage 1 and 2000 iterations to fine-tune the rest layers for each branch at stage 2. All experiments are conducted on a single NVIDIA GeForce GTX 1080Ti GPU with 11GB memory.
Data augmentation is an important strategy to improve the performance of deep neural networks  . Same as , we perform different degrees of rotations (0, 10, 20, 30) and mirroring on the original image, which finally outputs 14 augmented images for each sketch. Furthermore, we apply an erasing-based sketch augmentation method to generate two times of training data. For each training image with the resolution of pixels, the erasing augmentation method randomly erases a region with a size of pixels. The generated images share the ground truth annotation with their source images.
Iv-C Ablation Study
“Base”: training with the augmentation methods mentioned above, “”: the standard cross entropy (CE) loss, “”: the weighted CE loss, “: the soft-weighted CE loss, “”: the soft-weighted CE loss with a higher weight for the background class.
“Independent”: training independently for each category, “Full Branch”: the half-shared network with one branch for each category, “Super Branch”: the super branch architecture, “Staged Learning”: our staged learning strategy.
In the experiments, we apply the homogeneous transformation (HT) method for the training and test dataset. Then both the model training and evaluation are conducted on the transformed data. We select the super branch architecture as the base network and present the results in Table I. Compared to the training on the original dataset, the model with our HT method achieves better performance among 10 of 11 sketch categories and gets 2.35% higher IOU score on average. The results demonstrate the effectiveness of the proposed HT method on the task of part-level semantic sketch parsing. Furthermore, we show the results of only performing the HT method on the training data but do nothing on the test set, shown as “HT” in Table I. In this case, there still is a big gap between the training and test datasets, which gets the worst performance. Therefore, it is important to perform the HT method on two different domains simultaneously.
We compare the performance of different augmentation methods in Table II. It can be seen that the rotation-based augmentation shows the best performance among these three methods when used in isolation. By combining them together, the average IOU score achieves 2.92% gain compared to the vanilla version (from 58.37% to 61.29%). Therefore, we adopt this combination of augmentation methods for model training. We also present the performance improvements brought by the homogeneous transformation in combination with these augmentation combinations. As shown in Table II, we can see that the HT method invariably outperforms the training with the original dataset, which demonstrates the stability and effectiveness of the proposed HT method.
Table III shows comparative results of different loss functions on the SketchParse dataset. All models are trained with the combination of three augmentation methods mentioned above, noted as “Base” in the table. Compared to the standard cross entropy (CE) loss and the weighted version , the model trained with the proposed soft-weighted CE loss achieves better performance. The results show the superiority of the soft-weighted CE loss in the task of part-level semantic sketch parsing. As the pixels belonging to the background class are mostly separated from other classes by the sketch boundary, we set the weighted parameter to make the network more sensitive to the boundary between foreground classes and the background class. As shown in Table III, this new loss with 2 times of weight gets a slightly higher performance than .
As we have mentioned in Sec III-C, there are different kinds of deep architecture for the sketch parsing. We present the comparison of our staged learning strategy and other methods in Table IV. It can be seen that the way of training independently for each sketch category has the worst performance. Taking advantages of the half-shared network, the models trained under the super and full branch architectures get higher average IOU scores, which show the importance of information sharing. With the proposed staged learning strategy, the parsing model achieves the best among them. Furthermore, we evaluate the contributions of each component for our final deep sematic sketch parsing (DeepSSP) model. The baseline model is trained with the augmentation methods of rotation and mirroring, the standard cross entropy loss, and the super branch architecture. As shown in Fig. 6, these components improve the performance with varying magnitudes, which proves the practical value of our methods.
|MM 17’ ||68.78||69.35||69.60||71.18||70.81||68.00||67.35||62.66||55.04||57.34||50.89||64.45|
Iv-D Homogeneous Transformation for SBIR
The domain adaptation is also a common problem in the field of sketch-based image retrieval (SBIR). To demonstrate the practical value of the proposed homogeneous transformation (HT) method, we integrate it into the training pipeline of existing SBIR networks and evaluate their performance on the QMUL FG-SBIR dataset  .
The QMUL FG-SBIR dataset is constructed for the task of fine-grained instance-level SBIR. It includes three sub-datasets: shoe, chair, and handbag, in which there are 419, 297, and 568 sketch-photo pairs, respectively. The standard split of training and testing is provided by the authors and also adopted in our experiments.
|Shoe||Triplet SN ||52.17||58.26||+6.09|
|Chair||Triplet SN ||72.16||82.47||+10.31|
|Handbag||Triplet SN ||39.88||42.86||+2.98|
“Ori” means training on the original dataset without HT.
We select two cutting-edge methods named Triplet SN  and DSSA  as our baseline models. Considering that both methods take triplets as the input of their networks, we apply the proposed HT method to create new training sketches as the anchor samples, which preserve the same number of triplets for the model training. Following the works of Triplet SN  and DSSA 
, we use the same experimental settings and take the top K accuracy (acc.@K) as the evaluation metric. The comparative results against baselines on the QMUL FG-SBIR dataset (acc.@1) are shown in TableVII. We can see that there are significant performance improvements for both baseline networks when integrated with the proposed HT method. Furthermore, we also show the results of top K accuracies (acc.@) between DSSA  and with the proposed HT method in Table V. It can be observed that the models trained with our HT method mostly perform better than the baseline methods. The experimental results demonstrate the effectiveness of the proposed HT method for fine-grained instance-level SBIR.
Iv-E Comparative Results
Table VI shows the comparison of our method against the baseline and the work of Sarvadevabhatla et al.  (noted as MM 17’). It can be seen that our method gets 5.51% higher performance than the baseline and beats the method of MM 17’ in 10 of 11 sketch categories. Especially for the categories “bus” and “bird”, we achieve 3.53% and 3.49% performance improvements. By integrating the proposed methods together, our final model becomes the new state-of-the-art on the task of part-level semantic sketch parsing. Some examples of the parsing results are shown in Fig. 7. Compared to other methods, our predictions look more accurate in parts like the wheel of the bus and the headlight of the car, which show the superiority of our methods.
Fig. 8 shows some representative results output by our final DeepSSP model. It can be seen that our parsing results are mostly satisfactory, which demonstrate the effectiveness of our method for the task of part-level semantic sketch parsing. Furthermore, some failure cases are shown in Fig. 9. The first figure shows our result of the category “airplane”, which has the wrong prediction at the position of the cockpit window. The failure for the second result of the category “bicycle” lies in the bicycle frame, which is normally labeled as a hollow part. However, sometimes the bicycle frame has a filled annotation as shown in Fig. 8. These two kinds of failures are mostly caused by the ambiguous part-level annotations. The third case for the category “bird” mixes up the positions of the part class “head” and “tail”. If the extra information on the spatial relationship between different part classes is provided, it could output a better parsing result. The last one from the category “cow” misclassified the background pixels between legs. A finer dense prediction is required to solve this problem, which is left to the future work.
Our novel DeepSSP framework re-purposes the network designed for real image segmentation to the task of part-level semantic freehand sketch parsing by integrating the homogeneous transformation, soft-weighted loss, and staged learning. We propose the homogeneous transformation to solve the problem of the semantic gap between domains of the real image and freehand sketch. To avoid the dilemma of ambiguous label boundary and class imbalance, we reshape the standard cross entropy loss to the soft-weighted loss for better guidance for the model training. Furthermore, we present a staged learning strategy that takes advantages of the shared information across categories and the specific characteristic of each sketch class. Extensive experimental results prove the practical value of our method and show that our final DeepSSP achieves the state-of-the-art on the public SketchParse dataset.
-  (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (12), pp. 2481–2495. Cited by: §II-A.
-  (2010) Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems, pp. 181–189. Cited by: §II-C.
-  (2018) Query adaptive multi-view object instance search and localization using sketches. IEEE Transactions on Multimedia 20 (10), pp. 2761–2773. Cited by: §I.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §II-A, §IV-B.
-  (2009) Sketch2photo: internet image montage. In ACM Transactions on Graphics, Vol. 28, pp. 124. Cited by: §I.
-  (2013) Poseshop: human image database construction and personalized content synthesis. IEEE Transactions on Visualization and Computer Graphics 19 (5), pp. 824–837. Cited by: §I.
Detect what you can: detecting and representing objects using holistic models and body parts.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1978. Cited by: §IV-A.
-  (2019) SketchHelper: real-time stroke guidance for freehand sketch retrieval. IEEE Transactions on Multimedia. Cited by: §I.
Towards unified human parsing and pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 843–850. Cited by: §II-A.
-  (2012) How do humans sketch objects?. ACM Transactions on Graphics 31 (4), pp. 44:1–44:10. Cited by: §IV-A.
-  (2010) Attribute-centric recognition for cross-category generalization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2352–2359. Cited by: §IV-A.
-  (2017) A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857. Cited by: §II-A.
-  (2017) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6757–6765. Cited by: §II-A.
-  (2017) Mask r-cnn. In IEEE International Conference on Computer Vision, pp. 2980–2988. Cited by: §II-A.
-  (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §IV-B.
-  (2016) Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3204–3212. Cited by: §II-C.
-  (2016) Learning deep representation for imbalanced classification. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384. Cited by: §I.
-  (2014) Data-driven segmentation and labeling of freehand sketches. ACM Transactions on Graphics 33 (6), pp. 175. Cited by: §II-B.
-  (2018) Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Transactions on Multimedia. Cited by: §I.
-  (2018) Depth-adaptive deep neural network for semantic segmentation. IEEE Transactions on Multimedia 20 (9), pp. 2478–2490. Cited by: §II-A.
-  (2012) Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, pp. 1097–1105. Cited by: §II-A.
-  (2018) Universal sketch perceptual grouping. In European Conference on Computer Vision, pp. 582–597. Cited by: §I, §II-B.
-  (2015) Free-hand sketch recognition by multi-kernel feature learning. Computer Vision and Image Understanding 137, pp. 1–11. Cited by: §I.
-  (2017) Fully convolutional instance-aware semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2359–2367. Cited by: §II-A.
-  (2016) Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia 18 (6), pp. 1175–1186. Cited by: §II-A.
-  (2016) Semantic object parsing with local-global long short-term memory. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3185–3193. Cited by: §II-A.
-  (2015) Human parsing with contextualized convolutional neural network. In IEEE International Conference on Computer Vision, pp. 1386–1394. Cited by: §II-A.
-  (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5168–5177. Cited by: §II-A.
-  (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, pp. 2999–3007. Cited by: §III-B.
-  (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §II-A.
-  (2012) Sketch-based annotation and visualization in video authoring. IEEE Transactions on Multimedia 14 (4), pp. 1153–1165. Cited by: §I.
-  (2015) Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision, pp. 1520–1528. Cited by: §II-A.
-  (2015) Visual domain adaptation: a survey of recent advances. IEEE Signal Processing Magazine 32 (3), pp. 53–69. Cited by: §II-C.
-  (2015) Im2sketch: sketch generation by unconflicted perceptual grouping. Neurocomputing 165, pp. 338–349. Cited by: §I.
-  (2015) Making better use of edges via perceptual grouping. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1856–1865. Cited by: §II-B.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §II-A.
Loss max-pooling for semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2126–2135. Cited by: §III-B.
-  (2018) Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-C.
-  (2010) Adapting visual category models to new domains. In European Conference on Computer Vision, pp. 213–226. Cited by: §II-C.
-  (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics 35 (4), pp. 119:1–119:12. Cited by: §II-C, §IV-A.
-  (2017) SketchParse: towards rich descriptions for poorly drawn sketches using multi-task hierarchical deep networks. In ACM International Conference on Multimedia, pp. 10–18. Cited by: §I, §I, §I, §II-B, §III-C, §IV-A, §IV-A, §IV-B, §IV-B, §IV-E, TABLE VI.
-  (2016) Example-based sketch segmentation and labeling using crfs. ACM Transactions on Graphics 35 (5), pp. 151:1–151:9. Cited by: §I, §I, Fig. 2.
-  (2011) Discriminative sketch-based 3d model retrieval via robust shape matching. In Computer Graphics Forum, Vol. 30, pp. 2011–2020. Cited by: §I.
-  (2016) Training region-based object detectors with online hard example mining. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §II-A.
-  (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560. Cited by: §I, §II-C, §IV-D, §IV-D, TABLE V, TABLE VII.
-  (2012) Free hand-drawn sketch segmentation. European Conference on Computer Vision, pp. 626–639. Cited by: §I, §II-B.
-  (2015) Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §II-A.
-  (2017) Asymmetric feature maps with application to sketch based retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6185–6193. Cited by: §I.
-  (2007) Toward objective evaluation of image segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), pp. 929–944. Cited by: §II-A.
-  (2015) Transfer learning improves supervised image segmentation across imaging protocols. IEEE Transactions on Medical Imaging 34 (5), pp. 1018–1030. Cited by: §II-C.
-  (2015) Sketch-based 3d shape retrieval using convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875–1883. Cited by: §I.
-  (2015) Joint object and part segmentation using deep learned potentials. In IEEE International Conference on Computer Vision, pp. 1573–1581. Cited by: §II-A.
-  (2015) Sketch-based image retrieval through hypothesis-driven object boundary selection with hlr descriptor. IEEE Transactions on Multimedia 17 (7), pp. 1045–1057. Cited by: §I.
-  (2017) A-fast-rcnn: hard positive generation via adversary for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3039–3048. Cited by: §IV-B.
-  (2013) Sketch2Scene: sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics 32 (4), pp. 123. Cited by: §I.
-  (2018) Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (5), pp. 1100–1113. Cited by: §II-C.
-  (2012) Parsing clothing in fashion photographs. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577. Cited by: §II-A.
-  (2016) Sketch me that shoe. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 799–807. Cited by: §II-C, §IV-D, §IV-D, TABLE VII.
-  (2017) Sketch-a-net: a deep neural network that beats humans. International Journal of Computer Vision 122 (3), pp. 411–425. Cited by: §I.
-  (2016) Sketchnet: sketch classification with web images. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1105–1113. Cited by: §I.
-  (2016) Sketch-based image retrieval by salient contour reinforcement. IEEE Transactions on Multimedia 18 (8), pp. 1604–1615. Cited by: §I.
-  (2017) Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §II-A.
Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Transactions on Multimedia 19 (3), pp. 632–645. Cited by: §II-C.
-  (2017) Random erasing data augmentation. arXiv:1708.04896. Cited by: §IV-B.
-  (2018) SketchyScene: richly-annotated scene sketches. In European Conference on Computer Vision, pp. 438–454. Cited by: Fig. 2, §II-B.