SketchGCN: Semantic Sketch Segmentation with Graph Convolutional Networks

03/02/2020 ∙ by Lumin Yang, et al. ∙ City University of Hong Kong Association for Computing Machinery Zhejiang University 0

We introduce SketchGCN, a graph convolutional neural network for semantic segmentation and labeling of free-hand sketches. We treat an input sketch as a 2D pointset, and encode the stroke structure information into graph node/edge representations. To predict the per-point labels, our SketchGCN uses graph convolution and a global-local branching network architecture to extract both intra-stroke and inter-stroke features. SketchGCN significantly improves the accuracy of the state-of-the-art methods for semantic sketch segmentation (by 11.4 large-scale challenging SPG dataset) and has magnitudes fewer parameters than both image-based and sequence-based methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Freehand sketching is becoming one of the common interaction means between humans and machines with the continuous iteration of digital touch devices (e.g., smartphones, tablets) and various sketch-based interfaces on them. However, sketch interpretation still remains difficult for computers due to the inherent ambiguity and sparsity in user sketches, since sketches are often created with varying abstraction levels, artistic forms, and drawing styles. While many previous works attempt to interpret a whole sketch (e.g., for sketch classification and sketch-based retrieval [6, 7, 39, 27]), part-level sketch analysis is increasingly required in multiple sketch applications, including sketch captioning [28], sketch generation [24, 31], sketch-based 3D modeling [38], and 3D sketch reconstruction [18]. In this article, we focus on semantic segmentation and labeling of sketched objects, an essential task in finer-level sketch analysis.

Early works on sketch segmentation use hand-crafted features with limited ability to handle the large variations of sketches [4, 8]. Many of such solutions require the assistance of an interactive system [21, 22]. Later multiple data-driven approaches [12, 29]

have been proposed, but they are often computationally expensive. Recently, deep learning methods greatly improve both the segmentation accuracy and the efficiency. They can be roughly grouped into two classes: image-based methods

[28, 17, 41] and sequence-based methods [16, 26, 36]. Image-based methods treat the task as a semantic image segmentation problem and use convolutional neural networks to solve the problem. These methods usually ignore the structure information of strokes or use the stroke structure information in a post-processing step [17]. In contrast, sequence-based methods treat the task as a sequence prediction problem. These methods use relative coordinates and pen actions to encode the structure information. However, sequence-based representations ignore the proximity of points (especially among different strokes), which is significant for sketch analysis according to the Gestalt laws [35].

To address the above issues with the existing solutions, we adopt the graph representation [11] to the sketch domain and present a novel method based on graph convolutional networks (GCNs). We treat a sketch as a 2D point set with certain graphical relationships automatically built from the original stroke structure, which provides richer information against image-based methods. Unlike sequence-based methods based on relative coordinates, our method uses the absolute coordinates of points, thus naturally providing the proximity. In addition, we introduce a novel Stroke Pooling operation to improve the consistency of labels within individual strokes.

Although the GCN-based method has the advantages compared to the other methods, semantic interpretation of the sketch from the graph built using the stroke structure is still challenging because the constructed graph structure is very sparse. Schneider et al. [29] manually add relation edges (e.g., proximity and enclosing relations) in the graph before using a Conditional Random Field (CRF), which however is sensitive to variations of input. Inspired by the dynamic edges used in 3D point cloud analysis [30, 33, 34], we use a similar technique in our network to extend the basic structure information. To alleviate the problem that the dynamic edges may possibly bring wrong relationships between vertices and thus contaminate the original correct structure, we propose a two-branch network: one branch using the original sparse structure and the other with dynamic edges, to balance the correctness and sufficiency.

Our main contributions are as follows: (1) We propose the first GCN-based method for semantic segmentation and labeling of sketched objects; (2) Our method significantly improves the accuracy of state-of-the-art and has magnitudes fewer parameters than both image-based methods and sequence-based methods.

2 Related Work

Sketch Grouping. Sketch grouping divides strokes into clusters, with each cluster corresponding to an object part. Qi et al. [23] treat this problem as a graph partition problem, and group strokes by graph cut. Later Qi et al. [25] present a grouper that utilizes multiple Gestalt principles synergistically, with a novel multi-label graph-cut algorithm. [15] and [16]

use ordered strokes to represent a sketch, and develop a sequence-to-sequence Variational Autoencoder (VAE) model to learn a stroke affinity matrix. Although sketch grouping can be applied across classes, it does not attempt to address the labeling problem.

Semantic Sketch Segmentation. Semantic sketch segmentation methods label data into semantic groups. Huang et al. [12] formulate this problem as a mixed integer programming problem, and present a data-driven solution by utilizing the segmentation information in a repository of pre-segmented 3D models. Schneider and Tyutelaars [29]classify strokes based on Fisher vectors, build a graph by encoding relations between strokes, and finally use a Conditional Random Field (CRF) to solve for the most suitable label configuration. [36]

builds an end-to-end Recurrent Neural Network (RNN) to translate ordered strokes to semantic parts, and its enhanced version

[26] enables stroke segmentation across multiple categories. [28] introduces a two-level Convolutional Neural Network (CNN) to parse an image of sketched objects into semantic regions. [17, 41] take sketches as sketch images, and adapt existing CNN-based semantic image segmentation methods for sketch segmentation.

The aforementioned deep-learning-based methods either treat input sketches as ordered point sequences [36, 26, 16] or directly as images [28, 17]. Taking sketches as images unavoidably ignores the stroke structures while treating sketches as ordered point sequences neglects the inter-stroke proximity information, which is crucial for sketch analysis according to the Gestalt Laws [35]. In contrast, our method uses a graph representation to fully exploit the stroke structure and the stroke proximity, with carefully designed convolutional operations to extract both the intra-stroke and the inter-stroke features.

Graph Convolutional Networks. Graph Convolutional Networks (GCNs) have been used in many applications for example for processing social networks [32], in recommendation engines [20, 40]

, and in natural language processing

[1]. GCNs are also suitable to process 2D and 3D point cloud data. Sketches are composed of strokes with ordered point sequences, making it possible to construct graphs based on strokes and to use GCNs for sketch segmentation. As far as we know, we are the first to apply GCNs to semantic sketch segmentation and labeling.

Graph structures in most GCNs are static. Recent studies about dynamic graph convolution show that changeable edges may perform better. For instance, filter weights in [30] are dynamically generated for each specific data. EdgeConv in [34] dynamically computes node neighbors and constructs new graph structures in each layer. Valsesia et al. [33] also construct node neighbors with the

-nearest neighbors (KNN) algorithm, in order to learn to generate point clouds. Since original graphs built from the stroke structure are very sparse, it is difficult to learn effective point-level features. To better capture global and local features, we will adopt a two-branch network and use both static and dynamic graph convolutions. Lately, Li et al.

[14]

leverage residual connections, dense connections, and dilated convolution to solve the problem of vanishing gradient and over-smoothing in GCNs

[19, 13, 37]. Our method also exploits similar ideas when building our multi-layer GCNs.

Figure 1: The architecture of our SketchGCN

. An input sketch is first converted into a graph based on the stroke structure (the graph is simplified for illustration purpose). The graph node features and the connectivities are fed into the network, passing through two graph convolutional branches to extract the inter-stroke features (top, the global branch) and the intra-stroke features (bottom, the local branch). The extracted inter-stroke features are further fed into a mix pooling block to extract the global features, which are subsequently concatenated with the local features, and fed into a Multi-Layer Perceptron (MLP) to get the final results.

3 Overview

Fig. 1 shows the pipeline of our network. Given an input sketch, we first construct a graph from the basic stroke structure and use the absolute coordinate information as the features of the graph nodes (Section 4.1). Then the graph and the node features are fed into two branches (Section 4.2): a local branch consists of several static graph convolutional units; a global branch consists of dynamic graph convolutional units and a mix pooling block (Section 4.3

), including a max pooling operation and a stroke pooling operation. The learned features of two branches are concatenated and fed into a Multi-Layer Perceptron (MLP) to get the final segmentation and labeling.

The two-branch structure is tailored to the unique sketch structure, learning both the intra-stroke features and the inter-stroke features of sketches. In the local branch, the information only flows inside individual strokes since different strokes are not connected in the input graph. In contrast in the global branch, we add extra connections with the nodes found by a dilated KNN function. We use two pooling operations to aggregate both the sketch-level information and the stroke-level information to provide hierarchical global features. The stroke pooling operation which uses stroke-level aggregation is proved to benefit the task a lot in our experiment (Section 5.3).

4 Methodology

In this section, we first explain our graph-based sketch representation as input to the network. Then we introduce the two graph convolutional units separately used in two branches, followed by the descriptions of our novel stroke pooling operation in the mix pooling block of the global branch.

4.1 Input Representation

Many existing sequence-based methods [16, 26, 36] use the relative coordinates to represent an input sketch. Instead, we use the absolute coordinates of sketch points, which are more suitable for our graph-based network structure.

Specifically, we represent a single sketch as an -point set , where and are the 2D absolute coordinates of point . A graph is built using the basic stroke structure information, leading to a sparse graph , where and includes the edges that connect adjacent points on each single stroke. We use the same input for both the local and global branches of the network.

4.2 Graph Convolutional Units

We use two types of graph convolutional units in our network: a static graph convolutional unit (SConv for short) used in the local branch and a dynamic graph convolutional unit (DConv) used in the global branch. Both units use the same graph convolutional operation. The main difference between them is that the SConv unit does not update the graph connectivity during convolution in different layers while the DConv unit updates the graph connectivity layer by layer using KNN. Both units use residual connections, since their performance is more stable at the training stage than the general connection [10, 14]. After several SConv and DConv units, we obtain the local features and the global features from the local branch and the global branch, respectively.

Graph Convolution Operation. We use the same graph convolution operation as in [34] and briefly explain the operation here for the convenience of reading. Given a graph at the -th layer , where and are the respective vertices and edges in the graph , and is a set of node features, each defined at a vertex at the -th layer.

The node feature of the vertex in the -th layer is updated by

(1)

where is the learnable weights of the feature update operation , and the operation is defined as

(2)

Graph Updating Strategy. The SConv units only use the input graph and do not update the graph structure in different layers. In other words, we have . While the global branch dynamically changes the graph by adding a different edge set to the input graph in different layers, leading to non-local diffusion across the strokes. More specifically, the graph used in the -th layer of the global branch is defined as

(3)

To enlarge the receptive fields, is designed to get dilated aggregations of the information, inspired by Li et al. [14],

(4)

where is the -dilated neighbors of vertex . We use the same stochastic strategy at the training time as in [14].

4.3 Mix Pooling Block

The mix pooling block is designed to learn both the sketch-level features via the max pooling operation and the stroke-level features via the sketch pooling operation. Before applying two pooling operations, we transform the global features by using different multi-layer perceptrons with learnable weights and separately.

We use the max pooling operation to aggregate the sketch-level features, similar to many existing methods used in 3D point cloud analysis [2, 30],

(5)

For the stroke-level features, we propose a new pooling operation, named stroke pooling, to aggregate features on every single stroke,

(6)

where , is the -th stroke in a sketch, and is the number of strokes in the sketch. Note stroke pooling produces the same for points within the same stroke.

The whole features used in the final MLP layers are a concatenation of the output of the mix pooling block (i.e., stroke-level feature and sketch-level feature ) and the output of the local branch (i.e., local features ):

(7)
              SPGSeg [16]               FastSeg+GC [17]                    Ours Human
airplane
ant
crab
face
pig
cactus
Figure 2: Qualitative results of semantic segmentation on the SPG dataset. More visual results can be found in the supplementary document.

5 Experiments

We have evaluated our SketchGCN on the following four existing sketch datasets: SPG dataset [16], SketchSeg-150K dataset [36], Huang14 dataset [12] and TU-Berlin dataset [6]. The SPG and SketchSeg-150K datasets are both built upon QuickDraw [9], which is a vector drawing dataset selected from an online game where the players are required to draw objects in less than 20 seconds. SPG has 25 categories and 800 sketches per category and we use the same 20 categories as in [16]. Compared to the SPG dataset, SketchSeg-150K is relatively simpler with fewer semantic labels per categories (2-4 labels per category). With data augmentation by a sketch generative model [9], SketchSeg-150K has about 150,000 sketches over 20 categories. The Huang14 dataset [12] and TU-Berlin dataset [6]

are eariler smaller datasets which consist of 30 sketches and 80 sketches per category respectively. We use the same evaluation metrics as previous works, i.e., pixel metric and component metric

[12, 16, 36, 17].

5.1 Implementation Details

As shown in Fig. 1, our network uses graph convolution units in both the local branch and the global branch. Each graph convolution unit computes edge features from connected point pairs by first concatenating point features and then using a multi-layer perceptron with the hidden size of . Then it updates point features by aggregating neighboring edge features. In the global branch, we additionally use dynamic edges found by a dilated KNN function with the number of nearest neighbors and an increasing dilation rate for layers to , respectively. In the mix pooling block, we apply a multi-layer perceptron with the hidden size of before each pooling operation. After pooling and repeating, the global features together with the local features are fed into a multi-layer perceptron with the hidden size of to get the final prediction.

We scale all sketches to the size of . We then simplify each sketch using the Ramer-Douglas-Peucker algorithm [5] and resample it to points. During the training phase, we use the cross-entropy loss and Adam () for optimization with the base learning rate and batch size

. For SPG and SketchSeg-150K, we train the network for 100 epochs and decay the rate by 0.5 for every 50 epochs. For Huang14 and TU-Berlin, we train the network for 30 epochs and decay the rate by 0.5 for every 10 epochs. The model is implemented with PyTorch and trained with an NVIDIA GTX 1080Ti GPU. On average an epoch takes 15s on the SPG dataset and it takes 7ms to process a test sketch. We use

for the SPG, Huang14, and TU-Berlin datasets, and for the SketchSeg-150K dataset, since the original sketches in the latter contain fewer points. During the test time, an input sketch is first resampled to points to pass through the network and then the predicted result is mapped back to the original sketch using a nearest-neighbor scheme.

Category SPGSeg [16] DeepLab [3] FastSeg+GC [17] Ours
P_metric C_metric P_metric C_metric P_metric C_metric P_metric C_metric
airplane 82.9 70.9 70.7 46.2 85.3 75.2 97.2 93.6
alarm_clock 84.8 81.0 82.5 74.3 84.6 72.3 97.7 94.9
ambulance 80.7 68.1 72.5 54.2 85.8 75.3 94.3 91.0
ant 66.4 56.6 61.3 32.1 68.9 66.4 93.5 92.5
apple 89.9 71.8 87.3 60.2 91.4 82.3 98.4 95.6
backpack 75.2 63.7 64.3 28.4 73.3 59.8 93.8 86.9
basket 84.8 83.2 79.5 69.5 86.6 82.2 98.2 97.5
butterfly 89.0 83.6 85.6 69.8 92.7 79.3 98.6 97.0
cactus 77.5 72.3 67.2 30.8 73.3 68.6 97.4 96.5
calculator 91.1 89.9 92.5 92.1 97.4 93.0 99.4 98.5
campfire 92.3 91.4 82.9 83.3 95.6 92.9 97.1 96.2
candle 88.3 71.8 91.5 76.9 90.8 80.1 99.5 98.4
coffee cup 92.0 87.2 86.2 81.8 90.9 87.0 99.4 98.4
crab 77.9 70.5 73.9 49.3 75.9 55.4 96.6 94.2
duck 86.9 75.4 85.9 76.0 88.9 75.1 98.1 96.7
face 88.0 80.1 87.4 78.4 88.1 80.4 98.7 97.3
ice cream 85.4 79.3 80.7 70.3 87.5 80.1 96.0 95.0
pig 81.9 75.4 82.1 77.9 81.1 73.9 98.8 98.2
pineapple 89.8 90.2 85.4 79.5 91.9 82.3 99.3 96.4
suitcase 92.7 90.7 90.2 90.1 94.8 86.7 99.2 97.6
Average 84.9 77.7 80.5 66.1 86.2 77.4 97.6 95.6
Table 1: Quantitative comparison on the SPG dataset [16]. P_metric and C_metric stand for the pixel and component metrics, respectively.
Category FastSeg+GC [17] SegNet+ [36] Ours
P_metric C_metric P_metric C_metric P_metric C_metric
angel 0.98 0.96 0.89 0.86 0.99 0.98
bird 0.82 0.70 0.98 0.97 0.99 0.99
bowtie 1.00 1.00 0.99 1.00 1.00 1.00
butterfly 0.98 0.96 0.95 0.95 1.00 1.00
candle 0.96 0.78 0.95 0.95 0.98 0.97
cup 0.91 0.92 0.77 0.74 0.98 0.98
door 1.00 1.00 0.99 0.99 1.00 1.00
dumbbell 0.98 0.98 0.99 0.99 1.00 1.00
envelope 1.00 1.00 1.00 0.99 1.00 1.00
face 0.98 0.95 0.94 0.91 0.94 0.92
ice 1.00 1.00 0.72 0.69 1.00 1.00
lamp 0.78 0.78 0.95 0.94 0.96 0.96
lighter 0.99 0.96 0.99 0.98 1.00 1.00
marker 0.90 0.80 0.61 0.55 0.97 0.98
mushroom 0.98 0.94 0.70 0.66 0.99 0.94
pear 0.97 0.94 0.99 0.98 1.00 1.00
plane 1.00 0.99 0.86 0.85 1.00 1.00
spoon 0.80 0.79 0.85 0.81 0.90 0.90
traffic 0.89 0.93 0.96 0.96 0.95 0.95
van 0.99 0.99 0.87 0.84 0.99 0.99
Average 0.95 0.92 0.90 0.88 0.98 0.98
Table 2: Quantitative comparison on the SketchSeg-150K dataset [36].
Category FastSeg [17] FastSeg+GC [17] DeepLab [3] MIP-Auto [12] Ours Ours+GC
P_metric C_metric P_metric C_metric P_metric C_metric P_metric C_metric P_metric C_metric P_metric C_metric
airplane 81.1 65.4 85.5 75.5 45.0 30.4 74.0 55.8 80.0 66.6 82.9 75.4
bicycle 82.9 67.9 85.4 76.7 64.9 46.0 72.6 58.3 82.0 69.1 83.5 76.0
candelabra 74.7 59.2 77.3 68.0 58.6 44.1 59.0 47.1 78.6 66.7 81.4 74.3
chair 70.0 60.5 73.9 69.3 56.3 44.5 52.6 42.4 76.3 66.8 76.5 72.2
fourleg 79.6 66.5 83.9 75.8 64.6 49.1 77.9 64.4 80.2 67.8 82.0 74.9
human 74.8 61.9 79.2 71.9 67.6 55.5 62.5 47.2 75.5 66.3 76.5 71.0
lamp 85.7 78.1 86.5 80.9 68.3 64.8 82.5 77.6 87.1 79.2 89.8 86.5
rifle 68.5 56.3 71.4 67.3 63.8 50.2 66.9 51.5 77.9 67.4 79.3 73.3
table 77.6 67.3 79.0 73.1 64.6 51.9 67.9 56.7 78.6 68.0 81.0 76.7
vase 81.1 71.9 83.8 79.3 73.4 63.6 63.1 51.8 78.4 71.0 80.2 79.6
Average 77.6 65.5 80.6 73.8 62.7 50.0 67.9 55.3 79.5 69.2 81.3 76.0
Table 3: Quantitative comparison on the Huang14 dataset [12]. GC is short for graph cut refinement [17].
Category FastSeg [17] FastSeg+GC [17] Ours Ours+GC
P_metric C_metric P_metric C_metric P_metric C_metric P_metric C_metric
airplane 77.6 63.5 82.1 72.4 88.0 74.1 87.7 77.0
chair 91.7 89.5 95.7 93.5 95.5 92.5 95.9 93.7
guitar 78.5 67.0 81.4 78.8 91.6 86.0 92.5 87.7
motorbike 66.0 47.5 70.8 61.1 75.2 66.2 76.5 70.6
table 92.0 87.3 94.0 91.1 96.1 92.4 96.2 92.8
Average 81.2 71.0 84.8 79.4 89.3 82.2 89.8 84.4
Table 4: Quantitative comparison on subsets of the TU-Berlin dataset [6].
Category Baseline B. + L. B. + SP. B. + L. + SP.
P_metric C_metric P_metric C_metric P_metric C_metric P_metric C_metric
airplane 88.09 81.94 90.75 82.10 95.97 91.68 97.22 93.57
alarm_clock 88.99 81.73 93.85 87.82 96.09 92.38 97.71 94.91
ambulance 81.97 76.65 87.37 80.42 88.65 82.92 94.26 91.03
ant 81.50 73.56 86.10 82.84 90.65 89.13 93.47 92.51
apple 92.79 77.55 94.25 83.37 97.26 91.98 98.37 95.55
backpack 77.40 62.37 80.54 66.65 87.79 79.31 93.77 86.94
basket 86.16 82.61 92.42 90.38 96.14 96.72 98.18 97.54
butterfly 93.09 88.13 95.61 90.77 99.05 96.78 98.63 96.97
cactus 88.93 79.11 91.85 81.90 97.24 95.51 97.38 96.47
calculator 97.09 96.04 97.03 95.06 99.03 98.29 99.38 98.53
campfire 86.02 83.46 88.74 85.65 96.06 94.76 97.08 96.16
candle 96.96 91.42 97.25 94.11 99.00 98.04 99.53 98.38
coffee_cup 94.93 94.19 97.56 96.28 99.01 97.95 99.43 98.38
crab 87.21 83.20 90.74 85.20 94.93 92.48 96.64 94.23
duck 91.94 90.25 94.54 90.24 97.82 95.63 98.13 96.68
face 93.79 85.52 94.62 89.78 98.04 95.90 98.67 97.27
ice cream 87.69 83.75 90.72 86.82 94.52 93.28 95.96 94.99
pig 90.65 87.03 94.38 92.64 97.55 96.25 98.81 98.16
pineapple 91.48 88.41 95.50 89.06 98.79 95.02 99.25 96.40
suitcase 96.18 92.61 97.16 95.75 97.96 95.46 99.24 97.58
Average 89.64 83.98 92.55 87.34 96.08 93.47 97.56 95.61
Table 5: Quantitative results of ablation study on the SPG dataset. The baseline only has a global branch and uses max pooling. B. + L. is the baseline with the local branch, and B. + SP. is the baseline with stroke pooling, while B. + L. + SP. is our full model.

5.2 Results

Tables 1 and 2 list the quantitative results of different methods on the SPG and SketchSeg-150K datasets. We use the same set of data split as in [16, 36]. Our approach outperforms others by a large margin: on average 11.4% higher in terms of the pixel metric and 18.2% higher in terms of the component metric on the SPG dataset than FastSeg + GC [17], which performs the best among the existing methods, and on average 3% higher in the pixel metric and 6% higher in the component metric on the SketchSeg-150K dataset. The less significant performance gain on the SketchSeg-150K dataset is mainly because this dataset is labeled coarsely with many fewer semantic labels per categories (2-4 labels per category in SketchSeg-150K versus 3-7 labels per category in SPG) and thus less challenging for existing methods.

Fig. 2 shows some representative visual comparisons between our method and the methods of [16] and [17]. The sequence-based representation [16] uses point drawing orders and their relative coordinates which ignores the proximity among strokes, leading to unsatisfactory results (Fig. 2, the first column). The image-based method [17] is not aware of the stroke structure and hence mainly relies on the local image structure which also leads to inferior results.

We also evaluate our model on the Huang14 dataset and a subset of TU-Berlin dataset. For the limited number of sketches per category in the two datasets, we use the same method as [17] to generate the training data, i.e., we render 3D models with labels and extract edge maps to approximate the sketch images. We apply a uniform sampling procedure to generate the sketch vectors from images (see Fig. 3). Tables 3 and 4 show the respective quantitative results. In the Huang14 dataset, the individual strokes typically contain many spacious small segments (e.g., see Fig. 3, top row). Such small segments would potentially increase the structural noise and thus degrade the performance of our DConv units, since DConv connects new edges within the feature space. For a fair comparison, we apply an additional graph cut algorithm to refine our results as in [17]. The situation is improved (see statistics in Table 4) in the TU-Berlin dataset where there are not many such spacious small segments, which indicates that our model can learn the stroke structure information. We did not include the results of our method with graph cut when comparing to existing methods on the SPG and SketchSeg-150K datasets, since GC did not bring any obvious improvements. Note GC is more helpful in [17] because their image-based method is not aware of the stroke structure and hence the prediction usually contains many spacious tiny segments within a single stroke (see details in [17]).

Figure 3: To create training data for the Huang14 and the TU-Berlin datasets, we construct graphs (Middle) from edge map images (Left) rendered from 3D models. The right column shows the exemplar freehand sketches in the test data. The synthesized and real sketches are different in both structure and shape. The endpoints of each stroke are marked with solid circles.

Overall, our method + GC gains on average 0.7% higher in the pixel metric and 2.2% higher in the component metric on the Huang14 dataset (than FastSeg + GC), and on average 5.0% higher in the pixel metric and 5.0% higher in the component metric on the TU-Berlin dataset. On the Huang14 dataset, our results are only slightly better than those by the CNN-based method FastSeg + GC [17] (in some categories even slightly worse, Table 3). This is mainly due to the large domain gap between the synthetic data rendered from 3D models and the real hand-drawn data, as shown in Fig. 3. The large domain gap may result in large structural noise for GCN-based methods to fully capture the stroke structures. Our method, however is still able to achieve the state-of-the-art performance.

It is noteworthy that compared with existing deep network-based models for sketch segmentation, our model has orders of magnitude fewer parameters. For example, the sequence-based method SPGSeg [16] has a parameter size of 23.4MB while the image-based method FastSeg+GC [17] has a parameter size of 40.9MB. In contrast, the parameter size of our model is only 434KB, which is two orders of magnitude tinier than them. Fewer parameters means less calculation and more potential to be developed in light-weight applications like those on mobile devices, which further shows the superiority of our model.

5.3 Ablation Study

In this section we examine the effectiveness of our various design choices in SketchGCN. The core designs of our network architecture are the global-local branching and the stroke pooling. We hence examine the following design alternatives. We use the network without the local branch and stroke pooling as our baseline. In other words, the baseline only has a global branch and concatenates the features before and after the max pooling operation (similar to [2, 30]) to apply the final multi-layer perceptron. The other two alternatives are: the baseline network with the local branch (B.+L.) and the baseline network with stroke pooling (B.+SP.), which means the baseline with the mix pooling block.

We run experiments on the SPG dataset to prove the effectiveness of the two designs since this is the first large freehand sketch dataset with fine-grained semantic labels. Table 5 lists the experimental results. Compared to the baseline, B.+L. gains on average 2.91% higher in the pixel metric and 3.36% higher in the component metric; B.+SP. gains on average 6.44% higher in the pixel metric and 9.49% higher in the component metric. Our full model, i.e., the baseline with both the local branch and stroke pooling (B.+L.+SP.) achieves the best results.

We have also tried alternating the number of GCN units used in our two branches. In our current setting, we use both 4 units (of SConv and DConv) in the global and the local branches. Alternatively, we change this number to 6, 8, and 10 convolutional units. As shown in Table 5.3, we find that increasing the number of GCN units does not benefit our results. For simplicity, we thus use 4 units in our final model.

Average 4 units 6 units 8 units 10 units
P_metric 97.6% 96.9% 96.9% 96.7%
C_metric 95.6% 94.2% 94.2% 94.0%
Table 6: Evaluation on different numbers of GCN units.

6 Limitations

Fig. 4 shows several segmentation results with segmentation errors. The imperfection of our method is mainly caused by two factors. First, due to the inherent ambiguity of freehand sketches in part position and part shape, our model may assign wrong labels to strokes. For example, in Fig. 4 (a) the branch of the cactus is mistakenly assigned as thorn and (b) the strap on the top of a bag is labeled as handle. Second, the large differences between train data and test data may also mislead our model (Fig. 4 (d)): the butterflies in the train data always spread the wings, while the test example in this figure has the butterfly folded its wings, with a different view angle. We believe that the domain gap is a common issue of current learning based methods. Nevertheless, the visual results in Fig. 2 and the statistics on the Huang14 and the TU-Berlin datasets (Tables 3 and 4) have shown the generalization ability of our model. Finally, since our graph representation only warps features such as node position and proximity, our model is not aware of some high-level semantics such as the fact that “a human face can only have one nose” (see Fig. 4 (c)). We believe this issue can be alleviated by incorporating more semantic features into our graph representation for which we leave for future work.

Figure 4: Exemplar results with imperfect segmentation.

7 Conclusion

In this work, we presented the first graph convolutional network for semantic sketch segmentation and labeling. Our SketchGCN employs static graph convolutional units and dynamic graph convolutional units to respectively extract intra-stroke and inter-stroke features using a two-branch architecture. With a novel stroke pooling operation enabling more consistent intra-stroke labeling, our method achieves higher accuracy than the state-of-the-art methods with significantly fewer parameters in multiple sketch datasets. In our current experiments, we only use absolute positions as graph node features, while ignoring the information of the stroke order, direction, spatial relation, etc. In the future, we will exploit these information with more flexible graph structures. Another possibility would be to exploit recurrent modules to learn intact graph representations. Finally, it could be an intriguing direction to reshape our architecture for scene-level sketch segmentation and sketch recognition tasks.

References

  • [1] J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Sima’an (2017)

    Graph convolutional encoders for syntax-aware neural machine translation

    .
    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Vol. , pp. 1957–1967. Cited by: §2.
  • [2] R. Q. Charles, S. Hao, K. Mo, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 77–85. Cited by: §4.3, §5.3.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. Cited by: Table 1, Table 3.
  • [4] A. Delaye and K. Lee (2015) A flexible framework for online document segmentation by pairwise stroke distance learning.. Pattern Recognition 48 (4), pp. 1197–1210. Cited by: §1.
  • [5] D. H. Douglas and T. K. Peucker (1973) Algorithms for the reduction of the number of points required to represent a line or its caricature. University of Toronto Press. Cited by: §5.1.
  • [6] M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. ACM Transactions on Graphics (Proceedings SIGGRAPH) 31 (4), pp. 44:1–44:10. Cited by: §1, Table 4, §5.
  • [7] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa (2012) Sketch-based shape retrieval. ACM Transactions on Graphics (Proceedings SIGGRAPH) 31 (4), pp. 31:1–31:10. Cited by: §1.
  • [8] L. Gennari, L. B. Kara, T. F. Stahovich, and K. Shimada (2005) Combining geometry and domain knowledge to interpret hand-drawn diagrams. Computers & Graphics 29 (4), pp. 547–562. Cited by: §1.
  • [9] D. Ha and D. Eck (2017) A neural representation of sketch drawings. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: §5.
  • [10] K. He, X. Zhang, S. Ren, and S. Jian (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
  • [11] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. CoRR abs/1506.05163. External Links: Link, 1506.05163 Cited by: §1.
  • [12] Z. Huang, H. Fu, and R. W. H. Lau (2014) Data-driven segmentation and labeling of freehand sketches. ACM Trans. Graph. 33 (6), pp. 175:1–175:10. Cited by: §1, §2, Table 3, §5.
  • [13] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Vol. , pp. . Cited by: §2.
  • [14] G. Li, M. Müller, A. Thabet, and B. Ghanem (2019) DeepGCNs: can gcns go as deep as cnns?. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §4.2, §4.2.
  • [15] K. Li, K. Pang, J. Song, Y. Song, T. Xiang, T. M. Hospedales, and H. Zhang (2018) Universal sketch perceptual grouping. In The European Conference on Computer Vision (ECCV), pp. 582–597. Cited by: §2.
  • [16] K. Li, K. Pang, Y. Song, T. Xiang, T. M. Hospedales, and H. Zhang (2019) Toward deep universal sketch perceptual grouper. IEEE Transactions on Image Processing 28 (7), pp. 3219–3231. Cited by: §1, §2, §2, Figure 2, §4.1, §5.2, §5.2, §5.2, Table 1, §5.
  • [17] L. Li, H. Fu, and C. Tai (2019) Fast sketch segmentation and labeling with deep learning. IEEE Computer Graphics and Applications 39 (2), pp. 38–51. Cited by: §1, §2, §2, Figure 2, §5.2, §5.2, §5.2, §5.2, §5.2, Table 1, Table 2, Table 3, Table 4, §5.
  • [18] L. Li, Z. Huang, C. Zou, C. Tai, R. W. H. Lau, H. Zhang, P. Tan, and H. Fu (2016) Model-driven sketch reconstruction with structure-oriented retrieval. In SIGGRAPH ASIA 2016 Technical Briefs, pp. 28:1–28:4. Cited by: §1.
  • [19] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .
    In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Vol. , pp. 3538–3545. Cited by: §2.
  • [20] F. Monti, M. Bronstein, and X. Bresson (2017) Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems 30, Vol. , pp. 3697–3707. Cited by: §2.
  • [21] G. Noris, D. Sykora, A. Shamir, S. Coros, B. Whited, M. Simmons, A. Hornung, M. H. Gross, and R. W. Sumner (2012-01) Smart scribbles for sketch segmentation. Computer Graphics Forum 31 (8). Cited by: §1.
  • [22] F. Perteneder, M. Bresler, E. M. Grossauer, J. Leong, and M. Haller (2015) CLuster: smart clustering of free-hand sketches on large interactive surfaces. In Proceedings of the 28th Annual ACM Symposium on User Interface Software and Technology, pp. 37–46. Cited by: §1.
  • [23] Y. Qi, J. Guo, Y. Li, H. Zhang, T. Xiang, and Y. Song (2013) Sketching by perceptual grouping. In 2013 IEEE International Conference on Image Processing, pp. 270–274. Cited by: §2.
  • [24] Y. Qi, J. Guo, Y. Song, T. Xiang, H. Zhang, and Z. Tan (2015) Im2Sketch: sketch generation by unconflicted perceptual grouping. Neurocomputing 165, pp. 338–349. Cited by: §1.
  • [25] Y. Qi, Y. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, and J. Guo (2015) Making better use of edges via perceptual grouping. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1856–1865. Cited by: §2.
  • [26] Y. Qi and Z. Tan (2019) SketchSegNet+: an end-to-end learning of rnn for multi-class sketch semantic segmentation. In IEEE Access, Vol. 7, pp. 102717–102726. Cited by: §1, §2, §2, §4.1.
  • [27] P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph. 35 (4), pp. 119:1–119:12. Cited by: §1.
  • [28] R. K. Sarvadevabhatla, I. Dwivedi, A. Biswas, S. Manocha, and R. V. Babu (2017) SketchParse : towards rich descriptions for poorly drawn sketches using multi-task hierarchical deep networks. In Proceedings of the 25th ACM International Conference on Multimedia, pp. 10–18. Cited by: §1, §1, §2, §2.
  • [29] R. G. Schneider and T. Tuytelaars (2016) Example-based sketch segmentation and labeling using crfs. ACM Trans. Graph. 35 (5), pp. 151:1–151:9. Cited by: §1, §1, §2.
  • [30] M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 29–38. Cited by: §1, §2, §4.3, §5.3.
  • [31] J. Song, K. Pang, Y. Song, T. Xiang, and T. M. Hospedales (2018) Learning to sketch with shortcut cycle consistency. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 801–810. Cited by: §1.
  • [32] L. Tang and H. Liu (2009) Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. , pp. 817–826. Cited by: §2.
  • [33] D. Valsesia, G. Fracastoro, and E. Magli (2019) Learning localized generative models for 3d point clouds via graph convolution. In International Conference on Learning Representations, Vol. , pp. . Cited by: §1, §2.
  • [34] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38 (5), pp. 146:1–146:12. Cited by: §1, §2, §4.2.
  • [35] M. Wertheimer (1938) Laws of organization in perceptual forms. Psychologische Forschung 4, pp. 71–88. Cited by: §1, §2.
  • [36] X. Wu, Y. Qi, J. Liu, and J. Yang (2018) Sketchsegnet: a rnn model for labeling sketch strokes. In

    2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP)

    ,
    Vol. , pp. 1–6. Cited by: §1, §2, §2, §4.1, §5.2, Table 2, §5.
  • [37] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Link, 1901.00596 Cited by: §2.
  • [38] X. Xie, K. Xu, N. Mitra, D. Cohen-Or, W. Gong, Q. Su, and B. Chen (2013-11) Sketch-to-design: context-based part assembly. Computer Graphics Forum 32, pp. 233–245. External Links: Document Cited by: §1.
  • [39] K. Xu, C. Kang, H. Fu, W. Sun, and S. Hu (2013) Sketch2Scene: sketch-based co-retrieval and co-placement of 3d models. Acm Transactions on Graphics (Proceedings SIGGRAPH) 32 (4), pp. 1–15. Cited by: §1.
  • [40] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. , pp. 974–983. Cited by: §2.
  • [41] X. Zhu, Y. Xiao, and Y. Zheng (2019) 2D freehand sketch labeling using cnn and crf. Multimedia Tools and Applications, pp. 1–18. Cited by: §1, §2.