Research on freehand human sketches has become increasingly popular in recent years. Due to its ubiquitous ability in recording visual objects [eitz2012humans], sketches form a natural medium for human-computer interaction. Deriving a tailor-made representation for sketches sits at the core of sketch research, and has direct impact on a series of downstream applications such as sketch recognition [yu2015sketch, xu2019multigraph, kipf2016semi]
, sketch-based image retrieval[cao2010mindfinder, yu2016sketch, pang2017cross, pang2019generalising], sketch-3D reconstruction [lun20173d, jiang2019disentangled, wang20203d, ranjan2018generating, shih20203d], and sketch synthesis [liuunsupervised, song2018learning]. Albeit a pivotal component, designing an effective representation is challenging since sketches are typically abstract and iconic.
Prior works predominantly relied on encoding sketch in a pixelative format (i.e., an image) [eitz2012humans, yu2015sketch, sangkloy2017scribbler]
. Although it provided the convenience of using off-the-shelf convolutional neural network effortlessly re-purposed for sketches, pixelative format lacks the intuitive exploitation of structural information. The presence of structural information is vital for sketch abstraction modeling[eitz2012humans, riaz2018learning], which in turn is essential for downstream tasks that dictate structural manipulation such as sketch generation [ha2017neural, chen2017sketch, karras2019style] and sketch synthesis [liuunsupervised, song2018learning].
RNN-based approaches have consequently emerged as means to fully explore the sequential nature of sketches [ha2017neural]. The research convention is to use QuickDraw [quickdraw] vector format where each sketch is represented as a list of offsets in and . Thanks to the stroke-level modeling, these approaches do offer a degree of flexibility in generation and synthesis tasks, yet they do so by imposing a strong assumption – all sketches have sequential stroke data to them. This assumption largely prohibits the application of RNN-based methods to work with sketches such as those drawn on a piece of paper. A natural question is therefore – is there a way to remove the bottleneck on requiring vector data but at the same time preserve the structural cues that vector data provides?
To answer this question, we propose an alternative sketch representation inspired by the concept of lattice structures – SketchLattice. We define SketchLattice as a set of points sampled from the original 2D sketch image using a lattice graph as shown in Figure 1 (a). Such latticed sketch representation, although seemingly simplistic, is remarkably amenable to structural deformations, thereby providing vital benefits for sketch abstraction modeling and in subsequent downstream sketch generation tasks. Our proposed latticed representation can be easily and effectively encoded using a simple off-the-shelf graph convolutional network (GCN) [kipf2016semi, chen2019multi, yang2018graph], resulting in considerably fewer model parameters ( times lesser) as compared to recent state-of-the-art techniques. This not only makes our proposed sketch representation easily deployable, thus making further progress towards practicality, but also reduces the difficulty in optimization and training to give a competitive performance.
Specifically, each point in SketchLattice is regarded as a graph node. Geometric proximity between nodes serves as guiding principle for constructing the adjacency matrix to form graph links. Intuitively, the proposed GCN-based feature extractor learns the topology of points in a sketch object. Despite being simple, our novel sketch representation is surprisingly effective for sketch generation. In particular, using the proposed latticed representation, we show how to recover a corrupted sketch using our Lattice-GCN-LSTM network, as represented in Figure 1(b). Additionally, we present a novel aspect in sketch representation, where the abstraction level in the generated sketch is controllable as shown in Figure 1(c), subject to the density of points sampled by the lattice graph. Furthermore, our method is also applicable to the problem of image-to-sketch synthesis by simply dropping a few key points along the edge of a target object as depicted in Figure 1(d).
Our contributions are summarized as follows: (i) we propose SketchLattice, a novel latticed representation for sketches using an extremely simple formulation i.e., a set of points sampled from a sketch image using a lattice graph. (ii) Our latticed representation can be easily and effectively encoded using a simple graph model that use fewer model parameters, thereby making important progress towards efficiency. (iii) We show how the abstraction level of generated sketches is controllable by varying the density of points sampled from an image using our lattice graph.
2 Related Work
Sketch Data Format There are mainly two types of data formats for sketch representation – image-based and sequential representation. While the former treats sketch as a conventional 2D image with pixel values (i.e., the pixelative format), the latter considers sketch as an elaborately designed set of ordered stroke points (i.e., vector format), represented by offset coordinates along and directions with pen states (touch, lift and end) [ha2017neural]. Traditional sketch feature extractors are usually CNN-based approaches [yu2015sketch, chen2017sketch] that can directly take sketch image in pixelative format during input. However, this is highly redundant due to the sparsity of line drawings in a sketch image that necessitates heavy engineering efforts [yu2015sketch]. Additionally, CNN-based approaches cannot effectively capture structural cues since it does not encode position and orientation of objects, leading to sub-par results on generation models.
In contrast, sequential representation is sketch-specific [quickdraw]
, designed according to the drawing habit of humans which is constructed stroke-by-stroke. Such sequential, vector format representation allows modeling sketches using RNN-based methods. This resulted in impressive results such as sketch generation and sketch synthesis using long short-term memory (LSTM)[ha2017neural, su2020sketchhealer]. Although promising, such RNN-based approaches require a vectorized data format at the input, which leads to a major bottleneck limiting practical usage in the absence of vector sketches, like sketches drawn on a piece of paper. Therefore, we aim to propose an unexplored technique, by employing a more practical lattice sketch representation, i.e., SketchLattice, that avoids storing stroke orders while still keeping strong spatial evidence that vector sketches typically provide.
Graphical Sketch Embedding Graph convolutional networks (GCNs) [bruna2013spectral, kipf2016semi]
were originally designed to deal with structured data, such as knowledge graphs or social networks, by generalizing neural networks on graphs. In the past few years, exciting developments have been made that explore GCNs capabilities for various vision tasks including image classification[chen2019multi], captioning [yao2018exploring], image understanding [aditya2018image], action recognition [li2019actional], 3D object detection [zarzar2019pointrgcn], and shape analysis [wei2020view]. Yet, until recently, a few attempts [yang2020sketchgcn, yang2020s] started to apply GCNs on sketch embedding. The existing visual sparsity and spatial structure of sketch strokes are naturally compatible with graphical representations. However, the dominating approach in sketch research assumes access to a vector format where the stroke orders are required, thus resulting in a major limitation in real-world cases. On contrary, ours provides a generic approach that explores the geometrical proximity in a latticed representation of sketch. We also show how our proposed graphical sketch embedding can be used additionally for tasks, involving sketch generation and image-to-sketch translation.
Overview We describe the Lattice-GCN-LSTM network where the central idea is a novel sketch representation technique that (i) transforms an input 2D sketch image into a set of points , using a lattice graph . Each point in represents the absolute coordinates and in the . We call as the lattice format representation of . (ii) Our novel lattice format could be seamlessly transformed to a graphical form that is encoded into a -dimensional sketch-level embedding vector using a simple off-the-shelf GCN-based model. (iii) We observe how this sketch-level embedding vector could help in downstream tasks such as sketch generation by using existing LSTM-based decoding models. Figure 2 offers a schematic illustration.
3.1 Latticed Sketch
The input to our proposed latticed sketch representation is a sketch image where and represent the width and height of respectively. We extract the latticed sketch from using the lattice graph . Our lattice graph
is a grid that constitutes of uniformly distributedhorizontal and vertical lines, arranged in a criss-cross manner. The optimal value of for any given sketch image , could be empirically determined during inference without further training. As shown in Figure 2, we construct by sampling the set of all overlapping points between a black pixel in sketch image representing a stroke region, and the horizontal or vertical lines in . Formally, we define as:
Although extremely simple, this novel latticed sketch representation is very informative since it can express the topology (i.e., overall structure and shape) of the original sketch image , without the need of vector data. Additionally, our latticed sketch representation is very flexible because, (i) there is no constrain for the size of input sketch image and the original aspect ratio is maintained, (ii) the abstraction level of the generated sketches modulates depending on the sampling density from the lattice graph by varying the value of . Increasing the value of would result in more detailed sketches, whereas decreasing it would lead to highly abstract sketches. Figure 1(c) and 4 demonstrate how adding more sample points changes the abstraction level of generated sketches.
3.2 Graph Construction
Graph Nodes The SketchLattice can be effectively encoded by a simple graph model which not only consumes fewer model parameters, thereby increasing efficiency, but also allows easier optimization resulting in a model better trained to give a state-of-the-art performance. For each point we calculate representing elements of the set , denoting graph nodes. To ensure that the encoding process is amenable to structural changes, each point is tokenized by a learnable embedding function that maps the absolute point location to a -dimensional vector space. Formally,
is the resulting tokenized vector representation. Maintaining the original aspect ratio, we resize and pad the input sketch imageto a size of (256, 256) before applying the lattice graph . Hence, the vocabulary size of the learned embedding function is . Our intuition is that, by using an embedding function that tokenizes each point location to a -dimensional vector , the model would learn to obtain similar embedding features for nearby points. Hence the resulting representation would be more robust to the frequently observed shape deformation and be more amenable that largely benefits sketch abstraction modeling in the generation tasks.
Graph Edges A straight-forward yet efficacious approach based on the geometric proximity principles is adopted to construct the graph edge links among nodes, based on the corresponding latticed points’ locations . Specifically, we first compute the euclidean distance between every pair of nodes by . Then we follow either of the two options: (i) Each node is connected to its nearest neighbor or (ii) each node is connected to its nearby neighbors that are “close enough”, i.e., , where is a normalized distance in (0,1). is a pre-defined distance threshold whose value is empirically found to be in our case. An adjacency matrix is constructed by setting the link strength to for a pair of linked nodes (), such that a smaller distance would result in a larger score. All the disconnected nodes, are set to .
3.3 Graphical Sketch Encoder
Given the graph nodes and their corresponding adjacency matrix , we employ a simple graph model to compute our final sketch-level latent vector . The resulting vector allows downstream applications including sketch healing, and image-to-sketch translation. We employ a stack of.
For each node in graph encoding layer , a feature propagation step is executed to produce the updated node feature , where each node attends to all its linked neighbors with non-zero link strength, defined in the adjacency matrix . We compute as:
Such a mechanism incorporating spatial awareness not only facilitates message passing among connected nodes, but also adds robustness to missing parts in a lattice sketch while encoding. This greatly benefits downstream tasks such as sketch healing [su2020sketchhealer]. A graph convolution is applied to the resulting rich spatially dependent feature as:
where each encoding layer consists of two multi-layer perceptron (MLP) units, both of which is followed by a rectified linear unit (ReLU). We employ dropout and residual connection in each encoding layer as shown in Figure2. The final feature vectors of nodes from the graph encoding layer are integrated into a single vector which is further fed into a sequence of FC layer, batch normalization, and to compute our sketch-level latent representation .
3.4 Sketch Generation by LSTM decoder
Following [ha2017neural, chen2017sketch, su2020sketchhealer], we design a generative LSTM decoder that generates the sequential sketch strokes in vector format. Accordingly, the sketch-level latent vector is projected into two vectors and , then from which we can sample a random vector by using the reparameterization trick [kingma2014vae] to introduce stochasticity in the generation process via an IID Gaussian variable :
are learned through backpropagation[ha2017neural]. The latent vector is used as a condition for the LSTM decoder to sequentially predict sketch strokes. Specifically, the output stroke representation from the previous time step, together with latent vector serve as inputs to update the LSTM hidden state by:
where represents concatenation operation. Next, a linear layer is used to predict a output stroke representation for current time step, i.e., , where and are learnable weight and bias. The final stroke coordinates are derived from
with the help of Gaussian mixture models, to generate the vector sketch format, represented by. We refer readers to [ha2017neural, quickdraw] for more details.
3.5 Model Training and Deployment
Our proposed graphical sketch encoder and the generative LSTM decoder are trained end-to-end for sketch generation. Note that, although we require vector sketches to train the LSTM decoder for the purpose of vector sketch generation, our model fully works on image sketch input, rather than vector data during inference. Following [ha2017neural]
, the goal is to minimize the negative log-likelihood of the generated probability distribution to explain the training data, which can be defined as:
which seeks to reconstruct the vector sketch representation from the predicted latent vector . Upon training, the decoder generates a vector sketch conditioned on the graphical encoded latent vector , obtained from our lattice sketch given any image sketch, thus being more effective for practical applications.
The ability to be amenable to structural changes and enable appropriate abstraction modeling for sketch generation are the two key aspects that our proposed SketchLattice representation aims to address. Specifically, we adopt the challenging task of sketch healing to testify the robustness of our novel sketch representation towards frequently occurring structural deformation in sketches. Additionally, we also observe how our proposed approach could be utilized to perform the task of image-to-sketch translation.
We implement our model on PyTorch[paszke2017automatic] using a single Nvidia Tesla T4 GPU. Optimization is performed using the Adam optimizer with parameters , and . Value of learning rate is set to along with a decay rate of
in every iteration. A gradient clipping strategy is adopted to prevent gradient from exploding during the training of LSTM decoder. Essentially, we force the gradient value toif the actual value is larger than . The optimal value for the number of graph encoding layer is .
4.1 Sketch Healing
The task of sketch healing [su2020sketchhealer] was proposed akin to vector sketch synthesis. Specifically, given a partial sketch drawing, the objective is to recreate a sketch which can best resemble the partial sketch.
From full to partial Given the lattice sketch representation from an input sketch image , we randomly drop a fraction of lattice points in
with some probabilityto generate a partial SketchLattice, represented by . Hence, the graph edges linked to the removed node are also disconnected, thereby simultaneously modifying the adjacency matrix . One can think of as the corruption level of an input sketch image.
4.1.1 Experimental Settings
Dataset Following the footsteps of [xu2019multigraph, su2020sketchhealer], we use QuickDraw [quickdraw] for evaluation since it is currently the largest doodle sketch dataset. More specifically, a small subset of categories are selected such that it includes (i) both complex and simple drawings, (ii) intra-category objects having high degree of resemblance among each other, and (iii) presence of common life object categories that contain diverse sub-categories such as bus and umbrella. In each category we use training and testing sketches. The selected categories are as follows: airplane, angel, apple, butterfly, bus, cake, fish, spider, The Great Wall, umbrella.
Competitors We compare our proposed Lattice-GCN-LSTM network with three most popular alternatives for vector sketch generation: SketchRNN (SR) [ha2017neural], SketchPix2seq (Sp2s) [chen2017sketch], and SketchHealer (SH) [su2020sketchhealer]. Input to SketchRNN is a set of offsets in and directions from a vector sketch representation. The key to SketchRNN is a sequence-to-sequence model which is trained without the KL-divergence term. This is done to maintain the fairness of comparison since the KL-divergence term has shown to be beneficial for multi-class scenarios [chen2017sketch]. SketchPix2seq on the other hand replaces its encoding module with a CNN-based encoder that accepts a pixelative format i.e., sketch image. It is expected that such a design will help capture better visual information. Note that, although both SketchRNN and SketchPix2seq were not specifically designed for the task of sketch healing, following [su2020sketchhealer], we employ these techniques since they are procedural-wise compatible after being re-purposed. The only work specifically addressing the task of sketch healing is SketchHealer [su2020sketchhealer]. Given a vector sketch as input, SketchHealer converts it into a graphical form where each stroke is considered as a node. Next, visual image patches are extracted from each node region. A GCN-based model is applied for encoding a random vector . The decoding procedure of SketchHealer is identical to ours where is fed into a generative LSTM decoder to generate the corresponding vector sketch. Additionally, we retrain a variant of SketchHealer only using visual cues (SH-VC), for which stroke order is unavailable. Hence, geometric proximity principles are utilized to construct graph edges, similar to ours. This examines the performance of SH [su2020sketchhealer] in the absence of vector sketches.
Evaluation setup We adopt a similar evaluation setup in [song2018learning] and [su2020sketchhealer] for quantitative evaluation to understand the effectiveness of our novel latticed sketch representation. First, we evaluate the quality of generated vector sketches (transformed to pixelative format), via sketch recognition accuracy
. A pre-trained multi-category classifier with AlexNet architecture is used, which is trained on the training split ofQuickDraw categories. Higher recognition accuracy essentially signifies the ability of the network to generate realistic sketches. It also indicates that the network is able to accurately model the underlying data distribution by effectively encoding a sketch into an accurate and informative sketch representation. We use testing sketches from each of the selected categories for a thorough evaluation. Second, we judge the recognizability of the encoded sketch-level latent vector by performing a sketch-to-sketch retrieval task. The objective is, given the encoded representation of a sketch , we expect to retrieve sketches of the same category from a gallery of sketches. A higher retrieval accuracy signifies that the network has a strong sketch healing ability, due to its amenable and robust sketch representation.
Qualitative Results We illustrate some examples produced by our lattice-based sketch generator under different values of in Figure 3. We can observe that (i) our latticed representation is robust to partial missing parts such that ours can still generate a novel complete sketch even up to . (ii) The generated sketch is sensitive to the number of the obtained sampling points, where more points lead to greater details in the generated sketch. Taking the example of cake as shown in Figure 3, we observe how the bottom and candle regions are simplified when increasing . (iii) On further lifting to , our model can hardly generate a satisfactory sketch, given the severe absence of input sampling points.
We further show that the abstraction level of generated sketch modulates when varying the number and position of input lattice points, as shown in Figure 4. For example, if we only drop points to form the wings of butterfly, the resulting generated butterfly will be extremely simple. While the body and antenna start to show up when we produce more points to convey the intention of such corresponding details. Similar trends can be found in other classes as well.
|SR [ha2017neural]||✓||✗||0.67 M||10%||25.08%||50.65%|
|Sp2s [chen2017sketch]||✗||✓||1.36 M||10%||24.26%||45.20%|
|SH [su2020sketchhealer]||✓||✓||1.10 M||10%||50.78%||85.74%|
|SH-VC [su2020sketchhealer]||✗||✓||1.10 M||10%||-||58.48%|
Quantitative Results As discussed in Section 4.1.1, we compare the performance of different models under the two metrics (recognition accuracy and Top-1 retrieval) as shown in Table 1. We can observe from Table 1 that our approach outperforms other baseline methods on recognition accuracy, suggesting that the healed sketches obtained from ours are more likely to be recognized as objects in the correct categories. Importantly, we can also observe that, unlike other competitors which are very sensitive to the corruption level, ours can maintain a stable recognition accuracy even when increases up to . For the task of sketch-to-sketch retrieval, we can see that ours achieves the second best, which is inferior to SketchHealer. However, SketchHealer depends heavily on stroke order provided by vectorized sketches, evidenced by the dramatic decrease of retrieval performance (Table 1 SH vs SH-VC). This signifies the importance and superiority of our approach in the absence of vector input, which is a common practice in real-world cases. In addition, our network is much lighter than the other competitors having far less parameters ( times lesser than SketchHealer), since our approach avoids the use of expensive CNN-based operation. We further include two classes circle and clock as distraction to apple for evaluation. A slightly better result 77.80% (vs 76.02%) can be observed (12 classes, =10%), suggesting good scalibility and robustness of our latticed representation. Figure 5 shows some examples of sketch-to-sketch retrieval.
Visualization of To further visualize the discriminative power of our lattice-based graphical encoder, we randomly select sketches from each class in the test set, and visualize their latent vectors using t-SNE in Figure 6. We observe that intra-category instances tend to cluster together, suggesting a category-level discriminative ability.
Ablation Study A thorough ablative study using sketch recognition accuracy, is conducted to verify the effectiveness of our different design choices, such as (i) the size of lattice graph, (ii) proximity principle for graph construction, and (iii) importance of residual connection used in our graphical sketch encoder. As shown in Table 2, we can observe that (i) increasing the value of , that will simultaneously increase density of sampling lattice points from , directly transpires in an improvement of recognition accuracy. From we start to observe saturation of performance, thereby indicating the optimal value for . (ii) For construction of graph, we observe that more neighbors (nearby) works better than using only the nearest one for graph construction. (iii) removing residual connection leads to significant drop in performance, thereby establishing its importance.
On Complex Sketches To testify the robustness and applicability of our proposed latticed representation, we further investigate the performance when dealing with complex data. Specifically, we examine the recognition accuracy of the most complicated sketches (top ) over all categories, based on the number of strokes. From the results shown in Table 3, we can see that ours outperforms other competitors when healing the most complex sketches.
|SR [ha2017neural]||Sp2s [chen2017sketch]||SH [su2020sketchhealer]||Ours|
Human Study To gain more insights about the fidelity of the healed sketches, a human study is additionally conducted. We recruited 10 participants. 50 sketch samples across all 10 classes were randomly selected. Each sample has two corrupted instances associated, at mask ratio and , respectively. For each corrupted sketch, we generate a group of healed sketches using different methods (SketchHealer, SketchRNN, SketchPix2seq, and ours). We show each participant, the corrupted sketch, and the group of four healed versions in random order. Each participant is then asked to pick a healed sketch that best resembles the corrupted input. Results in Table 4 reveals that according to human, sketches healed by our method resemble the corrupted input the best, at both corruption levels.
|SR [ha2017neural]||Sp2s [chen2017sketch]||SH [su2020sketchhealer]||Ours|
4.2 Image-to-Sketch Synthesis
Our Lattice-GCN-LSTM network can be applied to image-to-sketch translation. Essentially, given an input image, the corresponding edges are extracted using an off-the-shelf edge extractor [zitnick2014edge]. Next, we transform the edge map into a latticed sketch representation using our lattice graph. The resulting SketchLattice could be seamlessly encoded via our graphical encoder. Finally, a vector sketch can be produced using the generative LSTM decoder. Our objective is to generate a sketch that best resembles the ground-truth sketch drawn by humans. Once trained, for any input image, we can obtain some representative lattice points based on the corresponding edges and lattice graph. Then, our generative LSTM decoder can generate a sketch from the sketch-level encoded representation , as stated in section 3.4.
Experimental Settings We use QMUL-shoe-v2 [yu2016sketch], a fine-grained sketch-based image retrieval dataset, to evaluate our image-to-sketch synthesis approach. In total, there are 6648 one-to-one mappings of image-to-sketch pairs. The dataset is split into two parts, i.e., 6000 pairs for training and the rest of 648 pairs for testing. We chose the value of for our lattice graph as and adopt the “nearby” strategy for graph construction. LS-SCC [song2018learning], a current state-of-the-art, is adopted for comparison.
Human Study We conduct a user study that judges two aspects: (i) reality of the generated sketches, i.e., whether a sketch “looks” like being drawn by a human or not, and (ii) similarity between the produced sketch and its target photo. Specifically, we show triplets of images, i.e., a photo shoe, and two corresponding sketches generated by LS-SCC [song2018learning] and ours in random order to 10 new participants. Each participant is asked to, (i) choose which of the two sketches “looks” more like a human drawing (REAL), and (ii) identify the sketch that best resembles the photo shoe (SIM).
Results and Analysis Some qualitative results are shown in Figure 7, where we can see that results from both LS-SCC [song2018learning] and ours are far from satisfactory when compared to the human-drawn sketches, yet our generated sketches depict more detailed features, such as the “heel”, “sole” and the “zipper”. The human study results in Table 5 suggest that the produced sketches by our model are closer to human drawing, while equally effective to depict real shoes compared to LS-SCC.
We introduced a novel sketch representation, SketchLattice, that not only removes the bottleneck on having vector data, but also preserves the essential structural cues that vector data provides. This result in a sketch representation that is particularly amenable to structural changes that allows better abstraction modeling. We show this new representation helps multiple sketch manipulation tasks, such as sketch healing and image-to-sketch synthesis, where it outperforms state-of-the-art alternatives despite using significantly less parameters.
This work was supported by the National Natural Science Foundation of China (NSFC) under 61601042.