Sketching is a ubiquitous communication way. It consists of fingertip-drawing strokes and gestures in visual images, expressing ideas across cultures and language barriers. With the wide application of touch screen technology, there’s a valuable place for sketching, including image retrieval(Sketch-based Image Retrieval, fine-grained retrieval), 3D modeling and shape retrieval. Above-mentioned technologies focus on categories labeling, computing similarity between the input sketch and existing classifications . In this paper, we want to parse the sketch based on correspondence between strokes and different parts of the object, and furthermore understand meanings of parts and poses in sketches.
High-level ambiguity in sketches makes segmentation become a hard task. Unlike photos which have rich color, detailed objects and high contrast between foreground and background, sketches consist of sparse lines and massive blank space usually in black and white. Besides, different people have different drawing styles and depict lines from objects in their own ways, causing certain sketches to present various appearances. Some use subtle shading by multiple lines to make objects seem stereo. Some apply fewer strokes, which leads sketches to obtain “open” boundary parts. Shakes of the nib and distorts of the stroke during drawing also make obvious differences in drawing detail. The resulting distortions during drawing pose a challenge to implement segmentation methods.
To handle the above issues, we propose a Sketch-target deep FCN Segmentation Network(SFSegNet) for free-hand human sketch segmentation in semantically components labeling. In this architecture, we adopt two major constructions shown in Fig.1:
A sketch-targeted deep Fully Convolutional Network by fine-tuning(Section III-A). Although it is still essentially a FCN, there are a number of crucial differences with the proposed model. First, we adopt the state-of-the-art classification network ResNet34 for the segmentation task. According to the sparsity of lines in sketches, we use the reweighting way to avoid the part-blank(foreground-background) class imbalance.
As for the dataset, we introduce the instance-level sketch segmentation dataset extended from Huang’s benchmark, consisting 10,000 annotated sketches, collected by both experts and non-experts depicting objects in ten categories after observing photos or simply imagining. To evaluate the performance of the proposed network architecture, we compare it with state-of-the-art image segmentation method FCN, LinkNet-34, U-Net and sketch segmentation method Huang’s, CRFs on both the introduced dataset and Huang’s benchmark.
In Section 2, related works on image segmentation and recent approaches to sketch segmentation are reviewed. Section 3 introduces the proposed network SFSegNet’s architecture design and explains the effect of in-network module combinations. In Section 4, we describe the experimental framework including our dataset and also demonstrate the results that we achieved. Finally, Section 5 concludes the paper.
Ii Related work
In this section we discuss the prior work related to segmentation approaches for photos, scenes, and sketches. Both of them assign per-pixel predictions of object categories for the given image.
Ii-a Image Segmentation
Recent state-of-the-art methods for semantic segmentation are based on the rapid development of Convolutional Neural Network(CNN), typically based on the Fully Convolutional Network(FCN) framework. FCN can transform a classification CNN, e.g. AlexNet, VGG or GoogLeNet into a pixel-wise predictor with multiscale upsampling to tackle the semantic segmentation task.
For solving the problem of resolution loss associated with downsampling, Dilated convolution strategy is proposed. This strategy handles multiscale convolution result to produce dense predictions from pretrained networks but it is lack of using global scene category clues. Inspired by Dilated convolution strategy, PSPNet adopts Spatial Pyramid Pooling that pools features in multiscale and concatenates them after convolution layers. Deeplab presents an Atrous Spatial Pyramid Pooling that adopts large rate dilated convolutions. These approaches embed difficult scenery context features and reduce the model complexity by the hole algorithm.
Furthermore, in the network architecture, U-Net is proposed to propagate context information with a large number of feature channels in upsampling part. U-Net uses skip connections to combine low-level feature maps with higher-level feature maps, which enables precise pixel-level localization.
Attempting to use in the real-time application, LinkNet uses light encoders to gain fast segmentation capability. This method novelly links each encoder with decoder and bypasses the input of each encoder layer to the output of its corresponding decoder, aiming at recovering lost spatial information that is caused by downsampling processes. Shvets et al. demonstrate the improvement of LinkNet, named as LinkNet-34, by using encoders based on a ResNet-type architecture in pre-trained weight to gain high-efficiency performance. In order to get precise segmentation boundaries, researchers have also tried to cascade their neural network with post-processing steps, like the application of the Conditional Random Field(CRF).
Ii-B Sketch Segmentation and Labeling
Several approaches for sketch segmentation have been suggested in the last decade, which can roughly be classified into stroke-based methods and object-based ones.
Stroke-based methods focus on each stroke and each partition sketch by classifying which basic geometric components the strokes belong to, such as straight lines, circles, and arcs. Sezgin et al. extract stroke basic information about drawing direction and speed, assuming that extreme speed combined with high curvature typically corresponding to segmentation points. Kim et al. use curvature as an important criterion during the segmentation procedure. Their proposed approach was intended primarily for closed curves. Pu et al.
use Radial Basis Functions(RBFs) to leverage the direction and curvature of strokes.
Instead of targeting at the stroke level, Sun et al. first consider the sketch segmentation problem at the object level. Their solution is based on both the low-level perception and high-level knowledge, calculating the distance between each stroke to measure the proximity and recognizing from a large-scale clip-art database. There are several limitations of their method including heavily depend on the drawing sequence and normative lines. Huang et al.
propose a data-driven approach, using parts of 3D models to match parts of sketches in a given category and performing a global optimization. In order to fit the input sketch into a certain 3D model, their technic needs a part-labeled 3D model repository and a sketch-based shape retrieval to estimate the viewpoint and the category of the input sketch. Similarly, classification before segmentation, Schneider et al.
adopt Fisher Vectors(FVs) for sketch classification and segment them at points with a high curvature. In consideration of the relations between segments, they fit the segmentation results into a CRF model to encode these relations. Different from their method, our model processes images of sketches without the need of any classification and drawing sequence, which makes data gathering convenient. Recently, Wu et al.
present a Recurrent Neural Networks(RNN)-based model namedSketchSegNet to translate sequence of strokes into their semantic part labels. By adopting Sketch-RNN, they generate a 57K annotated sketch dataset from a subset of QuickDraw built by Google. The subset consists of 7 classes and about 60 sketches in each. However, from 420 sources drawn by the human to 57,000 sketches generated by machines, it’s hard to keep the variance with the data augmentation method. Still, in their dataset, each class has its own ground truth, similar to Huang’s. Our goal is to find an extensively applicable model to segment stokes semantically in one ground truth with low training costs.
In this section, we will first introduce the proposed network architecture of SFSegNet. The network’s pipeline is shown as Fig. 1, which consists of the multiscale convolutional, pooling, upsampling architecture(Section III-A), and affine transform encoder(Section III-B) with the reweighting strategy(Section III-C). Because of the characteristics of the input freehand sketch, such as sparsity, non-ordering and duplication in strokes, we cannot directly apply raw sketches to SFSegNet. Considering this, an algorithm for sketch preprocessing is described in Section III-D.
Iii-a Architecture of SFSegNet
that consists of 34 sequential layers and has the state-of-the-art image classification performance. Each stage combines layers of features hierarchy from coarse, high level to fine, low level and gathers the necessary information. After several upsampling steps, the final output of the network is a probability map in 3-dimensional size, indicating each probability of which part the pixel belongs to. is the number of predefined parts in segmentation or the classification number for pixels in other words. and are the shape of the input sketch, namely width and height.
In our network, we first decapitate ResNet34 by discarding the final average pooling layer and divide it into 3 stages. We append a 2-dimension convolution layer with channels to each stage, to predict scores for each of the sketch part classes(including white background). The previous output is followed by a deconvolution layer to bilinearly upsample the coarse result to pixel-dense prediction. During training, we set the input sketch shape to in RGB color format. So, a sketch in will lead to features under 3 different resolutions, that is, in stage1 the shape of features is , in stage2 the shape is and in stage3 the shape is . For more information, please see Table I. Next, we fuse these stage results to gain more precise dense prediction. We append a 2x upsampling layer to the stage3 output and sum both the predictions computed by stage2 and stage3, notated as . Also, we append the same upsampling strategy to stage2 and combine the result and the stage1 output, notated as . We continue in this fashion by applying a 4x upsampling to the sum of predictions fused from and . Finally, we transform the dense prediction to the sketch segmentation result.
|stage name||output size||layers|
7x7,64,stride 2 conv
|3x3 max pool, stride 2|
Iii-B Affine Transform Encoder
It’s known that reproducing the same sketch during drawing is difficult. There is often a slight difference in the description of the same object due to the jitter of the strokes. This difference will have a negative effect on sketch segmentation. At the stroke level, diverse trends of strokes bring various local feature representations, and at the part level, rotation of components increases global feature differences. As shown in Fig. 2, the topic of these sketches is “bicycle” with 6 parts. Focus on the part label “body”, strokes in the small receptive field during convolution can be implemented with affine transform and gain spatial invariance to get better segmentation results. Moreover, with high resolutions sketches, receptive fields mostly contain only one part of the same category stroke. That means one receptive field corresponds to one stroke, which remains more structural information.
According to this thought, we employ an affine transform encoder to generate a transformation matrix to align the output of the feature maps extracted in resolutions from the lower level to the global level. The applied encoder is a mini STN and enables the network to have the ability to correct hand drawn deviation. Our affine transform encoder only has one convolution layer in the localization network, different from STN which has two convolution layers but affine transformation still works towards the sampled output feature map.
Three stages extracted from ResNet34 are pre-trained on ImageNet, before fine-tuning on the sketch dataset. We use the cross-entropy function as a loss for deep model training. Given
as a discrete probability distribution andas the correct class of the input, the cross-entropy function is defined as:
where is the number of classes. In terms of the characteristics of sketches, there are several simple curves and almost blank space in a sketch. The area of strokes occupied less than 1% according to the statistics on the dataset. In the paper, we treat the blank area, namely the background of the sketch, as one of the components to be segmented. About 99% pixels in the same color(R:255, G:255, B:255) will be classified as “background” and the rest of them will be classified as about 3 to 4 categories. The unbalanced data makes the segmentation model more likely to classify all pixels as “background”, so the segmentation result is almost in white. We have a strong reason to reweight the “background” class before training. The reweighting loss can be described as:
During training, we set the weight of “background” to 0 and other classes to 1, ignoring blank pixels in the loss computation. Next section will show how good results we achieve using the reweighting strategy.
To arrange the sketch as a normalized input form for training, we centralize and recolor the raw data. We first use a bounding box to enclose the sketch and resize it randomly(from to
pixels). Resized sketches are padding to
pixels in the center. To avoid the impact of the interpolation algorithm during scaling, we erode the strokes into 1 pixel and recolor each pixel to correct labels.
We evaluate our network in two datasets. Results in different datasets show the proposed technology is perfectly competent in sketch-target segmentation and detailed implementations are described as follow.
We first introduce the component-labeled sketch dataset built by Huang et al., which contains 10 classes and 300 sketches drawn by 3 users(i.e., for each class, 10 from each user). Their examples are created by a quick glance at a natural image, thus are much more realistic than the usual sketches which are more likely to be imaginary(see Fig. 3). Huang’s dataset focuses on whether each sketch has as many labeled parts as possible, but each category has an independent set of ground truths. Scarce annotated datasets for training and single ground truth for each sketch, the above factors make Huang’s dataset unsuitable for deep learning.
Followed by Huang et al., we build a large-scale dataset that consists 10,000 sketches and 25 components(include background) in one ground truth for each sketch. We taste 10 familiar classes for their easy imaginativeness: Airplane(6 components), Bicycle(5 components), Candelabra(4 components), Chair(3 components) Fourleg(4 components), Human(4 components), Lamp(3 components), Rifle(4 components), Table(3 components) and Vase(4 components). Labeled examples and components’ tags on RGB space are shown in Fig. 5. For each class, there are 1,000 sketches drawn by 10 volunteers; half of them are experienced artists and the rest are not. We ask both of them to draw sketches on the digital tablet in 1 minute. The content of sketches is immediately thought up when volunteers receive a topic, for closing to much more natural representations. Though a ceiling number of components has been set, volunteers can decide how many components in sketches.
Iv-B Implementation Details
Our model is implemented with Pytorch on a PC with a single NVIDIA 1080TI, an i5-7400 3GHz CPU and 16GB RAM. We divide our dataset into 2 subsets, 75% for training and 25% for testing. We utilize randomly initialized decoder weights and encoder weights initialized with ResNet34, pre-trained on ImageNet. The initial learning rate is set to 0.001, and the mini-batch size is set to 5. During training, we use stochastic gradient descent with the momentum of 0.9 and a polynomial weight decay policy. For baseline models in deep learning including FCN, LinkNet-34 and U-Net, we adopt their default training parameters. Also, we optimize baselines with reweighting strategy described in Section III-C. All models are trained within 50 iterations.
Different from image segmentation, pixels in the sketch have been pre-classified into two categories: strokes and background. It’s inappropriate to use image segmentation’s evaluation method such as IoU(Intersection over Union) or AP(Average Precision). To evaluate the segmentation performance for sketches, we adopt two accuracy metrics followed by Huang et al: 1) Pixel-based accuracy (P-metric), the number of pixels with correct labels divided by the total number. 2) Component-based accuracy (C-metric), the ratio of the number of a component with correct labels to the total number. A component is correctly labeled if the number of its correct pixels is up to 75%.
Iv-D Results and Discussion
Experiments on Huang’s Dataset. We train our network on our dataset and test it on Huang’s dataset. Notice that Huang’s dataset has ten ground truths for each category, differing from our dataset’s configuration(one ground truth for all categories). We remove and combine components to apply the same settings. It’s unfair to evaluate the network on C-metric due to the inconsistent number of components. We evaluate on P-metric merely. Besides, some components are annotated by mistake. We relabel them to ensure all sketches are fine-labeled correctly. Also, the same preprocessing has been implemented before testing. The above contents will be explained in the appendix.
|Note: Best results are in boldface.|
Table II shows that our method outperforms the Huang’s method and has similar performance but much less test time-consuming(1 to 2 sketches per second) compared with the CRF model. However, towards certain categories, SFSegNet is about 20% less than CRF. The most likely cause of unsatisfactory accuracy results is, CRF has categorization information for each sketch. Without the prior knowledge, it’s a tough work to classify strokes by only relying on grouping information. For example, some instances in class “Candelabra” have more than one candles which are far apart from each other. The local feature representation brings less correlation between them. Shown in Fig. 3, the lamp-like handle makes the network easily recognize it as a “lamp” object, though we are pretty sure that a candelabra couldn’t have a lamp inside.
Experiments on Our Dataset. We report the comparative performance of our network SFSegNet and other methods including FCN, LinkNet-34, U-Net as baselines, due to their successfully application in semantic segmentation. We also discuss the effect of the affine transform encoder. To avoid the background label biasing normal labels during training, all models use the reweighting loss. Segmentation results are shown in Fig. 5.
|Note: Best results are in boldface.|
|Note: Best results are in boldface.|
Iv-D1 Reweighting strategy
Quantitative results including the segmentation accuracy based on P-metric and loss during training are shown in Fig.6. We can observe from Fig.6(b) that, when the reweighting strategy has been deactivated, the loss of each network is decreasing, however, the accuracy is increasing slightly and eventually stops at a low level. Suffering heavily from heavy class imbalance problem, the model predicts pixels in one component which produces meaningless results. And Fig.6(a)
indicates that the reweighing strategy can solve the problem and speed up fitting. Each of them gains high accuracy within about 10 epoch.
The quantitative results of the proposed network and the competitors are presented in Table III and IV. The average labeling accuracy for each class is presented. We can observe that our model performs the best in each metric. Using P-metric, the average accuracy is 2.9% higher than FCN-8s, 3.5% higher than FCN-16s, 3.5% higher than FCN-32s, 1.5% higher than the LinkNet and 8.0% higher than the U-net. Using P-metric, the average accuracy is 2.9% higher than FCN-8s, 3.8% higher than FCN-16s, 4.6% higher than FCN-32s, 12.1% higher than the LinkNet and 1.0% higher than the U-net.
|Note: Best results are in boldface.|
Iv-D3 Affine transform encoder
We remove all affine transform encoders from our SFSegNet to discuss their effect. Table V shows the comparison results. Three stages of SFSegNet learn strokes structural features from hierarchy layers and obtain good results in segmentation, even better than baselines. We note that some components composed of straight strokes but labeled more than two categories. We believe there is a strong possibility that shaking strokes bring noise to convolutional features and make the components’ probability map predict more than one part. With spatial invariance during convolution, the strokes’ features are canonicalized and our model can achieve a better segmentation result.
In this paper, a sketch-targeted deep network named SFSegNet is proposed. We observe the class imbalance through blank labels and component labels. By using a reweighting strategy during training, the background pixels are ignored and the overall structure information in part-wise is well preserved. To prevent the disturbance caused by shaking strokes, we apply the affine transform encoder to gain spatial invariance during convolution to get more robust features. Essentially, it learns from the structural information among the drawing strokes. Thus the fully convolutional decoder is able to get the better segmentation results. Experimental results validated the effectiveness of our proposed method.
|Huang||Huang(with our config)|
-  A. Chaurasia and E. Culurciello, “Linknet: Exploiting encoder representations for efficient semantic segmentation,” in Visual Communications and Image Processing (VCIP), 2017 IEEE. IEEE, 2017, pp. 1–4.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
-  ——, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
-  F. Cordier, K. Singh, E. Etem, M.-P. Cani, and Y. Gingold, “Sketch-based modeling,” in Proceedings of the 37th Annual Conference of the European Association for Computer Graphics: Tutorials. Eurographics Association, 2016, p. 7.
-  M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa, “Sketch-based image retrieval: Benchmark and bag-of-features descriptors,” IEEE transactions on visualization and computer graphics, vol. 17, no. 11, pp. 1624–1636, 2011.
-  D. Ha and D. Eck, “A neural representation of sketch drawings,” arXiv preprint arXiv:1704.03477, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
-  Z. Huang, H. Fu, and R. W. Lau, “Data-driven segmentation and labeling of freehand sketches,” ACM Transactions on Graphics (TOG), vol. 33, no. 6, p. 175, 2014.
-  M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
-  D. H. Kim and M.-J. Kim, “A curvature estimation for pen input segmentation in sketch-based modeling,” Computer-Aided Design, vol. 38, no. 3, pp. 238–248, 2006.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  H. Li, H. Wu, X. He, S. Lin, R. Wang, and X. Luo, “Multi-view pairwise relationship learning for sketch based 3d shape retrieval,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 1434–1439.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  J. Pu and D. Gur, “Automated freehand sketch segmentation using radial basis functions,” Computer-Aided Design, vol. 41, no. 12, pp. 857–864, 2009.
-  X. Qian, X. Tan, Y. Zhang, R. Hong, and M. Wang, “Enhancing sketch-based image retrieval by re-ranking and relevance feedback,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 195–208, 2016.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International journal of computer vision, vol. 105, no. 3, pp. 222–245, 2013.
-  R. G. Schneider and T. Tuytelaars, “Example-based sketch segmentation and labeling using crfs,” ACM Transactions on Graphics (TOG), vol. 35, no. 5, p. 151, 2016.
-  T. M. Sezgin, T. Stahovich, and R. Davis, “Sketch based interfaces: early processing for sketch understanding,” in Proceedings of the 2001 workshop on Perceptive user interfaces. ACM, 2001, pp. 1–8.
A. A. Shvets, A. Rakhlin, A. A. Kalinin, and V. I. Iglovikov, “Automatic
instrument segmentation in robot-assisted surgery using deep learning,” in
2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018, pp. 624–628.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales, “Deep spatial-semantic attention for fine-grained sketch-based image retrieval.” in ICCV, 2017, pp. 5552–5561.
-  Z. Sun, C. Wang, L. Zhang, and L. Zhang, “Free hand-drawn sketch segmentation,” in European Conference on Computer Vision. Springer, 2012, pp. 626–639.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  F. Wang, S. Lin, H. Li, H. Wu, J. Jiang, R. Wang, and X. Luo, “Multi-column point-cnn for sketch segmentation,” arXiv preprint arXiv:1812.11029, 2018.
-  F. Wang, S. Lin, H. Wu, R. Wang, and X. Luo, “Data-driven method for sketch-based 3d shape retrieval based on user similar draw-style recommendation,” in SIGGRAPH ASIA 2016 Posters. ACM, 2016, p. 34.
-  X. Wu, Y. Qi, J. Liu, and J. Yang, “Sketchsegnet: A rnn model for labeling sketch strokes,” in 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2018, pp. 1–6.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.