Sketch-R2CNN: An Attentive Network for Vector Sketch Recognition

by   Lei Li, et al.

Freehand sketching is a dynamic process where points are sequentially sampled and grouped as strokes for sketch acquisition on electronic devices. To recognize a sketched object, most existing methods discard such important temporal ordering and grouping information from human and simply rasterize sketches into binary images for classification. In this paper, we propose a novel single-branch attentive network architecture RNN-Rasterization-CNN (Sketch-R2CNN for short) to fully leverage the dynamics in sketches for recognition. Sketch-R2CNN takes as input only a vector sketch with grouped sequences of points, and uses an RNN for stroke attention estimation in the vector space and a CNN for 2D feature extraction in the pixel space respectively. To bridge the gap between these two spaces in neural networks, we propose a neural line rasterization module to convert the vector sketch along with the attention estimated by RNN into a bitmap image, which is subsequently consumed by CNN. The neural line rasterization module is designed in a differentiable way to yield a unified pipeline for end-to-end learning. We perform experiments on existing large-scale sketch recognition benchmarks and show that by exploiting the sketch dynamics with the attention mechanism, our method is more robust and achieves better performance than the state-of-the-art methods.


page 1

page 2

page 3

page 4


A Neural Representation of Sketch Drawings

We present sketch-rnn, a recurrent neural network (RNN) able to construc...

SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval

We propose a deep hashing framework for sketch retrieval that, for the f...

SketchLattice: Latticed Representation for Sketch Manipulation

The key challenge in designing a sketch representation lies with handlin...

Sketch-Inspector: a Deep Mixture Model for High-Quality Sketch Generation of Cats

With the involvement of artificial intelligence (AI), sketches can be au...

Sequential Dual Deep Learning with Shape and Texture Features for Sketch Recognition

Recognizing freehand sketches with high arbitrariness is greatly challen...

Sketch-pix2seq: a Model to Generate Sketches of Multiple Categories

Sketch is an important media for human to communicate ideas, which refle...

Language free character recognition using character sketch and center of gravity shifting

In this research, we present a heuristic method for character recognitio...

1 Introduction

Freehand sketching is an easy and quick means of communication because of its simplicity and expressiveness. While a human has the innate ability to interpret drawing semantics, the vast capacity of expressiveness in sketches poses great perception challenges to machines. For better human-computer interactions, sketch analysis has been an active research topic in the computer vision and graphics fields, spanning a wide spectrum including sketch recognition 

[3, 44, 47], sketch segmentation [35, 11, 17, 18], sketch-based retrieval [4, 38, 30, 42] and modeling [26], etc. In this paper, we focus on developing a novel learning-based method for freehand sketch recognition.

The goal of sketch classification or recognition is to identify the object category of an input sketch, which is more challenging than image classification due to the lack of rich texture details, inherent ambiguities, and large shape variations in the input. Traditional studies [3, 31, 19]

commonly cast sketch recognition as an image classification task by converting sketches into binary images and then extracting local image features. With the quantified feature descriptors, a typical classifier such as Support Vector Machine (SVM) is trained for object category prediction. Recent years have witnessed the success of deep learning in image classification 

[14]. Similar neural network designs have also been used to address the recognition problem of sketch images [44, 30]. Although these deep learning-based methods outperform the traditional ones, the unique properties of sketches, as discussed in the following, are often overlooked, leaving room for further improving the performance of sketch recognition.

In general, sketch has two widely-used representations for processing, which are raster pixel sketch and vector sketch. Raster pixel sketches are binary images with pixels covered by strokes having the value one and the rest of pixels the value zero, resulting in a large portion of void pixels and thus a sparse representation. This representation does not allow the state-of-the-art convolutional neural networks (CNNs) to easily distinguish which strokes are more important or which strokes can be ignored for better recognition 

[31]. Following the definition in [42], a vector sketch in our work refers to a sequence of strokes containing the points in the drawing order (Fig. 1). A vector sketch can be easily converted into a bitmap image through rasterization but not vice versa. Notably, vector sketches contain rich temporal ordering and grouping (i.e., strokes) information, which has been shown to be useful for learning more descriptive features [42]. However, these information cues are all discarded during the rasterization process for pixel images and thus inaccessible by subsequent recognition algorithms.

Motivated by the above discussions, to address the incapacity of existing CNN-based methods for stroke importance interpretation, we propose a novel single-branch attentive network architecture RNN-Rasterization-CNN (Sketch-R2CNN for short), for vector sketch recognition. Sketch-R2CNN takes advantages of both vector and raster representations of sketches during the learning process and is able to focus on adaptively learned important strokes, with an attention mechanism, for better recognition (Fig. 1

). It takes only a vector sketch (i.e., grouped sequences of points) as input, and employs a recurrent neural network (RNN) in the first stage for analyzing the temporal ordering and grouping information in the input and producing attention estimations for the stroke points. We then develop a novel

neural line rasterization (NLR) module, capable of converting the vector sketch with the computed attentions into an attention map in a differentiable way. Subsequently, Sketch-R2CNN uses a CNN to consume the obtained attention map for guided hierarchical understanding and feature extraction on critical strokes to identify the target object category. Our proposed NLR module is the key to connecting the vector sketch space and the raster sketch space in neural networks and allows gradient information to back propagate from CNN to RNN for end-to-end learning. Experiments on existing large-scale sketch recognition benchmarks [3, 8] show that our method, leveraging more human factors in the input, performs better than the state-of-the-art methods, and our RNN-Rasterization-CNN design consistently improves the performance of CNN-only methods.

In summary, our contributions in this work are: (1) the first single-branch attentive network with an RNN-Rasterization-CNN design for vector sketch recognition; (2) a novel differentiable neural line rasterization module that unifies the vector sketch space and raster sketch space in neural networks, allowing end-to-end learning. We will make our code publicly available.

2 Related Work

Figure 1: Illustration of our single-branch attentive network architecture for vector sketch recognition. (Neural Line Raster stands for our neural line rasterization (NLR) module.)

To recognize sketched objects, traditional methods generally take preprocessed raster sketches as input. To quantify a sketch image, existing studies have tried to adapt several types of local features originally intended for photos (e.g., bag-of-features [3], Fisher Vectors with SIFT features [31], HOG features [19]) to line drawing images. With the extracted features, classifiers (e.g., SVMs) are then trained to recognize unseen sketches [3, 31]. Different learning schemes, such as multiple kernel learning [19]

or active learning 

[43], may be employed for performance improvement. Another line of traditional methods has also attempted to utilize additional cues for recognition, such as prior knowledge for domain-specific sketches [1, 15, 27, 23, 32, 2] or object context for sketched scenes [47, 48]. While progress has been made in sketch recognition, these methods still cannot robustly handle freehand sketches with large shape or style variations, especially those hastily drawn in dozens of seconds [8], struggling to achieve performance on par with human on existing benchmarks like the TU-Berlin benchmark [3].

Recently, deep learning has revolutionized many research fields, including sketch recognition, with state-of-the-art performance. Research efforts [30, 46, 39, 44] have been made to employ deep neural networks, such as AlexNet [14] or GoogLeNet [36], to learn more discriminative image features in the sketch domain to replace hand-engineered ones. Yu et al. [44] proposed Sketch-a-Net, an AlexNet-like architecture specifically adapted for sketch images by using large kernels in convolutions to accommodate the sparsity of stroke pixels. Their method achieved superior classification accuracy (77.95%) on the TU-Berlin benchmark [3], surpassing human performance (73.1%) for the first time. Their method still follows the existing learning process of image classification, i.e., using the raster image representation of sketches as CNN inputs, and thus cannot easily learn the awareness of stroke importance in an end-to-end manner for further improvement. In contrast, our network directly consumes vector sketches as input for learning stroke importance effectively and adaptively by exploiting the temporal ordering and grouping information therein with RNNs.

Vector representation of sketches has been considered for certain tasks such as sketch generation [7, 8, 33] or sketch hashing [42] with deep learning. For example, SketchRNN  [8], which has received much attention recently, is built upon RNNs to process vector sketches. It is composed of an RNN encoder followed by an RNN decoder, and is able to model the underlying distribution of points in vector sketches for a specific object category. To learn to hash sketches for retrieval, Xu et al. [42] has demonstrated that an RNN branch, exploiting temporal ordering in vector sketches, can complement the other CNN branch for extracting more descriptive features. They fuse two types of features, produced by RNN and CNN respectively, via a late-fusion layer by concatenation. Our work shares a similar spirit with [42], advocating that the temporal and grouping information in vector sketches also offer additional cues for more accurate sketch recognition. In contrast to their two-branch network with simple concatenation, our RNN-Rasterization-CNN design seeks to boost the synergy between the two networks in a single branch during the learning process. To this end, inspired by [12], which proposed an approximate gradient for in-network mesh rendering and rasterization, we design a novel neural line rasterization module, allowing gradients to back-propagate from CNN (raster sketch space) to RNN (vector sketch space) for end-to-end learning.

For a sketch, its constituent strokes may contribute differently to its recognition. With a trained SVM, Schneider et al. [31] qualitatively analyzed how stroke importance affects classification scores by iteratively removing each stroke from the corresponding raster sketch image. To automatically capture stroke importance during the learning process, researchers have attempted to adapt an attention mechanism in network design [34]. Attention mechanism has been widely used in many visual tasks, such as image classification [24, 40, 37, 10], image caption [41, 22] or Visual Question Answering (VQA) [25]. A simple attention module generally works by computing soft masks over the spatial image grid [37, 41], or even feature channels [10], to obtain weighted combination of features. Song et al. [34]

has incorporated a spatial attention module for raster sketches in their network for fine-grained sketch-based image retrieval. Differently, Riaz Muhammad et al. 


tackled the sketch abstraction task with reinforcement learning, which aims to develop a stroke removal policy by considering the stroke influence to recognizability. As discussed in existing studies 

[44, 42, 6, 5], CNNs may suffer from the sparsity of inputs (e.g., raster sketches), though they excel at building hierarchical representations of 2D inputs. Instead of struggling to estimate attention from binary images that contain limited information [34], we argue that additional cues, such as the temporal ordering and grouping information in vector sketches, are essential to learn reliable attention for strokes. In our method, we resort to RNNs for computing attention for each point in a vector sketch, and use our NLR module for in-network vector-to-raster conversion. To our best knowledge, no existing work has tried to derive an attention map from vector sketches with RNNs for CNN-based sketch recognition.

3 Method

Our network architecture, as illustrated in Fig. 1, is composed of two cascaded sub-networks: an RNN for stroke attention estimation in the vector sketch space and a CNN for 2D feature extraction in the raster sketch space (Sec. 3.2). The key enabler for linking the two sub-networks that operate in completely different spaces is a novel neural line rasterization (NLR) module, which converts a vector sketch with the estimated attention to a raster pixel sketch in a differentiable way (Sec. 3.3). More specifically, during the forward inference pass, given a vector sketch as input, the RNN takes in a point at each time step and computes a corresponding attention value for the point. Our proposed NLR module then rasterizes the vector sketch, together with the estimated per-point attention, into an attention map and computes the corresponding gradients for the backward optimization pass. A subsequent CNN consumes the attention map as input for hierarchical understanding and produces category predictions as the final output.

3.1 Input Representation

The input to our network is a vector sketch, formed by a sequence of strokes, each stroke being represented by a sequence of points. This storing format is widely adopted for sketches in existing crowdsourced datasets [8, 30, 3].

Following [7], we denote a vector sketch as an ordered point sequence , where is the total number of points in all strokes. For each point , and are the 2D coordinates, and is a binary stroke state. Specifically, state indicates that the current stroke has not ended and that the stroke connects to ; indicates that is the last point of the current stroke and will be the starting point of another stroke. Our network takes only the vector sketch as input for end-to-end learning.

3.2 Network Architecture

Our network architecture is formed by two sequentially-arranged sub-networks, which are linked with a differentiable NLR module. The first sub-network is an RNN, which analyzes the temporal ordering and grouping information in the input. The RNN consumes a vector sketch

and estimates per-point attention as output at each iteration step. Specifically, we use a bidirectional Long Short-Term Memory (LSTM) unit with two layers as the first sub-network. We set the size of the hidden state to be 512 and adopt dropout with probability = 0.5. For the hidden state at step

, after the LSTM cell takes in

, we pass it through a fully-connected layer followed by a sigmoid function to produce per-point attention, denoted as

. That is, for each point , we obtain a corresponding scalar , signifying the point importance in the subsequent 2D visual understanding by CNN. Similar to [8], instead of using absolute coordinates, for each fed into the RNN, we compute the offsets from its previous point as its coordinates.

Next, we pass the point sequence along with the estimated attention, i.e., , through our NLR module, as detailed in Sec. 3.3. The output of the module is a raster sketch image , which can also be viewed as an attention map with the intensity of each stroke pixel as the corresponding attention. A deep CNN then takes the image as input for hierarchical 2D feature extraction. Sketch-a-Net [44] or ResNet50 [9] can be used as the backbone network, which is then connected to a fully-connected layer to produce estimations over all the possible object categories. We use the cross entropy loss for optimizing the whole network.

Our network architecture for sketch recognition differs from the one proposed by Xu et al. [42] for sketch retrieval in several aspects. First, their network has two branches for feature extraction, one branch with a RNN and the other branch with a CNN. During learning, their RNN and CNN individually work on two different sketch spaces with little interaction, except at the last concatenation layer for feature fusion. In contrast, our single-branch design allows more information flow between RNN and CNN owing to our NLR module, that is, the RNN can complement the CNN by producing a more informative input whereas the CNN provides guidance on attention estimation with learned hierarchical representations during back propagation. In addition, our network only uses vector sketches as input and performs in-network vector-to-raster conversion, while the two-branch late-fusion network [42] requires both vector and raster sketches as input, thus a preprocessing stage for rasterization is needed.

3.3 Neural Line Rasterization with Attention

To convert a point sequence with attention to a pixel image , the basic operation is to draw each valid line segment (Sec. 3.1) onto the canvas image. As illustrated in Fig. 2, to determine whether or not a pixel is on the target line segment, we simply compute the distance from its center to the line segment and check whether it is smaller than a predefined threshold (we set in our experiments). If

is a stroke pixel, we compute its attention by linear interpolation 

[12]; otherwise its attention is set to zero. More specifically, let be the projection point of ’s center onto . The intensity or attention of is then defined as


where , and , and are in absolute coordinates. This rasterization process for line segments can be efficiently done in parallel on GPU with a CUDA kernel. Note that in the implementation we need to record the relevant information, such as line segment index and at each pixel , for subsequent gradient computation.

Figure 2: Rasterization of line segment and linear interpolation of the attention value for stroke pixel .

Through the above process, a vector sketch can be easily converted into a raster image in the forward inference pass. In order to propagate gradients w.r.t the loss function from CNN to RNN in the backward optimization pass, we need to derive gradients for the above rasterization process. Thanks to the simplicity of the used linear interpolation, the gradients can be computed as follows:


Let be the loss function and be the gradient back-propagated into w.r.t

through the CNN. By the chain rule, we have


where iterates over all the stroke pixels covered by the line segment . If is adjacent to another line segment , we accumulate the gradients.

Our NLR module is simple and easy to implement, but it is crucial to bridge the gap between the vector sketch space and the raster sketch space in neural networks for end-to-end learning. Unlike existing methods [37, 34] that derive attention from feature maps produced by CNNs, with our NLR module, we can take advantage of additional cues (i.e., temporal ordering and grouping information) in vector sketches for better attention map estimation, as shown in experiments (Sec. 4.2). These additional cues, however, are not accessible for the methods with raster inputs.

4 Experiments

4.1 Datasets and Settings

We have performed various experiments on two existing large-scale sketch recognition benchmarks, i.e., the TU-Berlin benchmark [3] and the QuickDraw benchmark [8], to validate the performance of our Sketch-R2CNN. These two benchmarks differ in several aspects, such as sketching style, acquisition procedure, and sketch quantity per category. Notably, sketches in the TU-Berlin benchmark tend to be more realistic while the ones in QuickDraw are more iconic and abstract (Fig. 4). The TU-Berlin benchmark [3] contains 250 object categories with 80 sketches per category. Each sketch was created within 30 minutes by a participant from Amazon Mechanical Turk (AMT). The QuickDraw benchmark [8] contains 345 object categories with 75K sketches per category. During acquisition, the participants were given only 20 seconds to sketch an object.

Similar to [8], to simplify sketches in the TU-Berlin benchmark, we applied the Ramer-Douglas-Peucker (RDP) algorithm, resulting a maximum point sequence length of 448 for RNN. Following [44], we used three-fold cross validation on this benchmark (i.e., two folds for training, one fold for testing). Sketches in the QuickDraw benchmark have already been preprocessed with the RDP simplification algorithm and the maximum number of points in a sketch is 321. In each QuickDraw category, the 75K sketches have already been divided into training, validation and testing sets with sizes of 70K, 2.5K and 2.5K, respectively.

We implemented our Sketch-R2CNN and NLR module with PyTorch. We adopted Adam 


for stochastic gradient descent update with a mini-batch size of 48. We used a learning rate of 0.0001 for training on QuickDraw and 0.00005 for training or fine-tuning on TU-Berlin (see Sec. 

4.2 for the pre-training and training procedures). Due to the limited training data in the TU-Berlin benchmark, we followed [44] to perform data augmentation, including horizontal reflection, stroke removal and sketch deformation.

4.2 Results and Discussions

Results on TU-Berlin Benchmark. We have compared our method with a variety of existing methods on the TU-Berlin benchmark. Table 1 includes the results of some methods reported in [44]. These methods can be generally categorized into two groups. The first group follows the conventional pipeline using hand-crafted features + classifier, including the HOG-SVM method [3], structured ensemble matching [20], multi-kernel SVM [19], and the Fisher Vector based method [31]. The second group uses deep learning, including the state-of-the-art network Sketch-a-Net (the earlier version Sketch-a-Net v1 [45] and the later improved version Sketch-a-Net v2 [44]) and those networks that have been evaluated in [44]: LeNet [16], AlexNet-SVM [14] and AlexNet-Sketch [14].

Model Accuracy
Humans [3] 73.1%
HOG-SVM [3] 56.0%
Ensemble [20] 61.5%
MKL-SVM [19] 65.8%
Fisher-Vectors [31] 68.9%
LeNet [16] 55.2%
AlexNet-SVM [14] 67.1%
AlexNet-Sketch [14] 68.6%
Sketch-a-Net v1 [45] 74.9%
Sketch-a-Net v2 [44] 77.95%
Sketch-a-Net v2 (ours) [44] 77.54%
ResNet50 [9] 82.08%
Sketch-R2CNN (Sketch-a-Net v2) 78.49%
Sketch-R2CNN (ResNet50) 83.25%
Table 1: Evaluations on the TU-Berlin benchmark. Our method with ResNet50 working as the CNN backbone achieves the highest recognition accuracy. Sketch-a-Net v2 (our) is our PyTorch-based implementation.

We reimplemented Sketch-a-Net v2 with PyTorch since the original model [44]

, implemented with Caffe, is not compatible with our framework (i.e., the NLR module). We pre-trained the Sketch-a-Net v2 on QuickDraw 

[8] instead of preprocessed edge maps from photos [44] for ease of preparation and reproduction. Our best reproduced recognition accuracy of Sketch-a-Net v2 on the TU-Berlin benchmark is , close to the accuracy of reported with the original Caffe-based implementation [44]. In addition to Sketch-a-Net v2, we also evaluated ResNet50 [9], a more advanced CNN architecture that has been widely used for various visual tasks such as image classification [9] or object detection [21]

. Specifically, before training on raster sketches of the TU-Berlin benchmark, we sequentially pre-trained the ResNet50 on ImageNet 

[29] and QuickQraw. The ResNet50 achieves a recognition accuracy of , significantly outperforming the state-of-art approach Sketch-a-Net v2.

Since both Sketch-a-Net v2 and ResNet50 are CNN variants, they can be incorporated into our network architecture (Fig. 1) as the CNN backbone. By inserting one of these CNN alternatives into the proposed architecture, we can study how helpful the attention learned by RNN can be for vector sketch recognition. The comparison results are summarized in Table 1. Our method incorporated with Sketch-a-Net v2, named Sketch-R2CNN (Sketch-a-Net-v2) in Table 1, achieves a recognition accuracy of , improving Sketch-a-Net v2 (ours) by about . Another variant of our method with ResNet50, named Sketch-R2CNN (ResNet50) in Table 1, achieves an accuracy of , improving the ResNet50-only model by about , and surpasses all the existing approaches and human performance.

Alternatives Study on TU-Berlin Benchmark. To validate our proposed architecture, we have studied several network design alternatives on the TU-Berlin benchmark (Table 2). First, as mentioned in Sec. 2, attention modules have been used in existing CNN architectures for image classification [37] and sketch retrieval [34]. To compare against our RNN-based attention module, we modified ResNet50 and inserted the spatial attention module proposed by Song et al. [34] after the residual block [9, 21]. This modified version of ResNet50 still takes binary sketch images as input and tries to compute attention maps from feature maps of previous convolutional layers. This model, named Attentive-ResNet50 in Table 2, achieves a recognition accuracy of , slightly higher than by the ResNet50-only model, while lower than attained by our method, showing the comparatively higher effectiveness of additional cues in vector sketches used by our method for attention estimation. Attention maps produced by our RNN-based attention module and Attentive-ResNet50 are visualized in Fig. 3. Note that our method only predicts attention for stroke pixels and sets non-stroke pixels to have an attention value of zero, while Attentive-ResNet50 computes attention for every pixel of the attention map.

Figure 3: Visualization of attention maps, in grayscale and color coded, produced by our Sketch-R2CNN (ResNet50) and Attentive-ResNet50. Recognition failures are in red and successes are in green. Attention maps of Attentive-ResNet50 are estimated from feature maps of the last layer of the residual block, which are of size , while attention maps by our method are of size . (Best viewed in the electronic version.)

To study the influence of temporal ordering information provided by human on RNN’s attention estimation, we trained Sketch-R2CNN (ResNet50) with randomized stroke orders. That is, instead of keeping the human drawing order in vector sketch, the stroke sequence is randomly disrupted. This scheme, named Random-Stroke-Order, achieves a slightly lower recognition accuracy of than Sketch-R2CNN (ResNet50) on the TU-Berlin benchmark, still superior to the ResNet50-only model. This indicates that the temporal information (i.e., stroke order) provided by human can help RNN to learn more descriptive sequential features, confirming a similar conclusion made from sketch retrieval experiments in [42].

Model Accuracy
Attentive-ResNet50 [34] 82.42%
Random-Stroke-Order 82.78%
Attention-using-Sketching-Order 81.74%
Two-Branch-Late-Fusion [42] 81.43%
Two-Branch-Early-Fusion 81.84%
Sketch-R2CNN (ResNet50) 83.25%
Table 2: Alternative design choice studies on the TU-Berlin benchmark.

In addition to our RNN-based encoding method for vector sketches, we also explored a straightforward approach to allow CNNs to gain access to the sketching order information for feature extraction. Specifically, in a preprocessing step, for a sketch in the point sequence representation, we encode its ordering information into an image through rasterization by assigning an intensity value of one to the first point and zero to the last point and linearly interpolating the intensities of the points in-between. Fig. 5 shows some examples of the resulting images. This encoding scheme is based on a hypothesis that users tend to draw more “important” strokes first, and the resulting raster sketches can be considered as temporal-encoding attention maps. We trained a ResNet50 with such hand-crafted attention maps as input, but found that this encoding scheme, with a recognition accuracy of 81.74% (Attention-using-Sketching-Order in Table 2), is not effective and even slightly worse than the baseline with binary image inputs (ResNet50 in Table 1). This indicates that, for CNN-based recognition networks, stroke importance may not always be properly aligned with stroke order under such a straightforward encoding scheme, due to different drawing styles used by different users, and this encoding scheme may even pose challenges to CNNs for learning effective patterns. Thus, instead of “hard-coding” temporal information into images, a more adaptive and robust encoder (e.g., RNN) is needed to accommodate sequential variations in vector sketches.

Next, we discuss arrangements of RNN and CNN in the network architecture design. As mentioned before, Xu et al. [42] use a two-branch late-fusion architecture, which fuses the features extracted from a CNN branch and a parallel RNN branch, for sketch retrieval. In contrast, our design combines an RNN encoder and a CNN feature extractor sequentially in a single branch for sketch classification. We therefore set up another experiment to investigate which of the above two types of architecture is a better scheme to incorporate the addition temporal ordering and grouping information existing in vector sketches. Following [42], we built a similar model, named Two-Branch-Late-Fusion in Table 2, which uses the same RNN cell and CNN backbone as Sketch-R2CNN (ResNet50) for fairness and consistency. The training procedure is the same as Sketch-R2CNN (ResNet50), with the softmax cross entropy loss [42]. The Two-Branch-Late-Fusion achieves a recognition accuracy of on the TU-Berlin benchmark, which is about lower than Sketch-R2CNN (ResNet50). This result reveals that our proposed single-branch architecture can make the CNN, which works as an abstract visual concept extractor, and the RNN, which models human sketching orders, complement each other better than the two-branch architecture. Surprisingly, another observation is that the recognition accuracy of Two-Branch-Late-Fusion, adapted to the sketch classification task from the original sketch retrieval task, is even slightly inferior to that of the single CNN branch (ResNet50 in Table 1). This is also observed from results on the QuickDraw benchmark, as presented in the following section. Due to the lack of implementation details of [42], we postulate that the differences of training strategies ([42]: multi-stage training for CNN and RNN; Ours: joint training of CNN and RNN), CNN backbones ([42]: AlexNet; Ours: ResNet50) and datasets ([42]: pruned QuickDraw dataset; Ours: original TU-Berlin and QuickDraw datasets) may affect the learning of the late-fusion layer and cause the performance degradation.

Figure 4: Recognition comparisons between the CNN-only method (ResNet50) and our Sketch-R2CNN (ResNet50 as the CNN backbone). Failures are in red and successes are in green. Attention maps produced by our RNN are shown in the second row and are color coded. Note that our RNN only predicts attention for stroke pixels; non-stroke pixels are set to have an attention value of zero and are not color-coded.
Figure 5: The first row shows color-coded attention maps produced by our Sketch-R2CNN (ResNet50) for specific object categories. Correspondingly in the second row, we directly encode the sketching order as attention maps, higher attention values for strokes drawn earlier. Note that non-stroke pixels are set to have an attention value of zero and are not color-coded.

Complement to the above experiments on attention estimation with RNN as well as arrangements of RNN and CNN, we stretched the design choice exploration to studying an alternative way of injecting the learned attention from RNN into CNN. In our proposed architecture, the CNN directly takes the attention maps produced by the RNN as input. An alternative architecture is to weigh feature maps of a certain intermediate layer in CNN (which still takes binary sketch images as input) with the attention map by RNN that leverages vector sketches as input. In our implementation, we inject the attention map produced by RNN, which is of size with stroke width threshold , into the output of the residual block [9, 21] of ResNet50. Following the same training procedures as those in Table 2, this alternative architecture, named Two-Branch-Early-Fusion, achieves a recognition accuracy of on the TU-Berlin benchmark, performing slightly better than Two-Branch-Late-Fusion. However the recognition accuracy of Two-Branch-Early-Fusion is still slightly inferior to that of the ResNet50-only model. This may be due to non-stroke pixels in the attention map from RNN having an attention value of zero, which, during the injection, would make convolution features at those corresponding locations vanish, reducing the feature information learned by previous convolutional layers from the input.

Model Accuracy
Sketch-a-Net v2 [44] 74.84%
ResNet50 [9] 82.48 %
Two-Branch-Late-Fusion [42] 82.11%
Sketch-R2CNN (Sketch-a-Net v2) 77.29%
Sketch-R2CNN (ResNet50) 84.41%
Table 3: Evaluations on the QuickDraw benchmark.

Results on QuickDraw Benchmark. We further compared the proposed Sketch-R2CNN with Sketch-a-Net v2 [44], ResNet50-only model, and Two-Branch-Late-Fusion [42] on the QuickDraw benchmark. Note the ResNet50 is pre-trained on ImageNet [29] and served as the CNN backbone in Sketch-R2CNN and Two-Branch-Late-Fusion. Quantitative results are summarized in Table 3, and the performance of each competing method on the QuickDraw benchmark agrees well with those on the TU-Berlin benchmark. Compared to the competitors, Sketch-R2CNN (ResNet50) achieves the highest recognition accuracy on the QuickDraw benchmark, echoing its performance on the TU-Berlin benchmark. It is a similar case for the ResNet50-only model, which still achieves better recognition performance than both Sketch-a-Net v2 and Two-Branch-Late-Fusion. Sketch-R2CNN (ResNet50) and Sketch-R2CNN (Sketch-a-Net v2) improve ResNet50 and Sketch-a-Net v2 respectively by about . Although the sketch quality of QuickDraw may not be as good as that of TU-Berlin, thanks to the voluminous data of QuickDraw (24.15M sketches for training, 862.5K sketches for validation or testing), we still have seen consistent performance improvement of Sketch-R2CNN over CNN-only models, showing the generality of our proposed architecture.

Qualitative Results. Fig. 4 shows some qualitative recognition comparisons between the CNN-only method (ResNet50) and our Sketch-R2CNN (ResNet50). Through visualization, it is observed that the attention maps produced by the RNN in Sketch-R2CNN can help the CNN to focus on more effective stroke parts of the inputs and ignore the interference of irrelevant strokes (e.g., the circle around the crab in Fig. 4) to make better classifications. In contrast, the CNN-only model cannot access the additional ordering and grouping cues existing in vector sketches and thus tends to struggle with sketches that have similar shapes but different category labels. Fig. 5 visualizes the attention maps by our method and the ones encoding sketching order (used in Attention-using-Sketching-Order in Table 2). It is observed that our attention maps estimated by RNN share a certain degree of similarity with the ones using sketching order, but the attention magnitudes by RNN are more adaptively biased.

Figure 6: Recognition failures of our Sketch-R2CNN (ResNet50).

Limitation. As shown in Fig. 6, in some cases, the RNN in Sketch-R2CNN may fail to produce correct attention guidance for the subsequent CNN, leading to recognition failures (e.g., the pumpkin), possibly due to the inability in extracting effective sequential features from inputs with similar temporal ordering and grouping cues as other training sketches in different categories. Some sketches that are seemingly with ambiguous categories (e.g., the toaster) may also pose challenges to our method. It is expected that human would make similar mistakes on such cases. One possible solution to address the ambiguity is to put the sketched objects in context (i.e., scenes), and integrate our method with the context-based recognition methods [47, 48].

5 Conclusion

In this work, we have proposed a novel single-branch attentive network architecture named Sketch-R2CNN for vector sketch recognition. Our RNN-Rasterization-CNN design consistently improves the recognition accuracy of CNN-only models by 1-2% on two existing large-scale sketch recognition benchmarks. The key enabler for joining RNN and CNN together is a novel differentiable neural line rasterization module that performs in-network vector-to-raster sketch conversion. Applying Sketch-R2CNN to other tasks like sketch retrieval or sketch synthesis that need descriptive line-drawing features could be interesting to explore in the future.


  • [1] C. Alvarado and R. Davis. SketchREAD: A multi-domain sketch recognition engine. In Proc. ACM UIST. ACM, 2004.
  • [2] R. Arandjelović and T. M. Sezgin. Sketch recognition by fusion of temporal and image-based features. Pattern Recogn., 44(6):1225–1234, 2011.
  • [3] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM TOG, 31(4):44:1–44:10, July 2012.
  • [4] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa. Sketch-based shape retrieval. ACM TOG, 31(4):31:1–31:10, July 2012.
  • [5] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In Proc. IEEE CVPR, 2018.
  • [6] B. Graham and L. van der Maaten. Submanifold sparse convolutional networks. CoRR, abs/1706.01307, 2017.
  • [7] A. Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013.
  • [8] D. Ha and D. Eck. A neural representation of sketch drawings. In Proc. ICLR, 2018.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE CVPR, June 2016.
  • [10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. IEEE CVPR, 2018.
  • [11] Z. Huang, H. Fu, and R. W. H. Lau. Data-driven segmentation and labeling of freehand sketches. ACM TOG, 33(6):175:1–175:10, Nov. 2014.
  • [12] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In Proc. IEEE CVPR, 2018.
  • [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pages 1097–1105. 2012.
  • [15] J. J. LaViola, Jr. and R. C. Zeleznik. MathPad2: A system for the creation and exploration of mathematical sketches. ACM TOG, 23(3):432–440, Aug. 2004.
  • [16] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Neural Networks: Tricks of the Trade: Second Edition, pages 9–48. 2012.
  • [17] K. Li, K. Pang, J. Song, Y.-Z. Song, T. Xiang, T. M. Hospedales, and H. Zhang. Universal sketch perceptual grouping. In Proc. ECCV, 2018.
  • [18] L. Li, H. Fu, and C.-L. Tai. Fast sketch segmentation and labeling with deep learning. CoRR, abs/1807.11847, 2018.
  • [19] Y. Li, T. M. Hospedales, Y.-Z. Song, and S. Gong. Free-hand sketch recognition by multi-kernel feature learning. CVIU, 137:1 – 11, 2015.
  • [20] Y. Li, Y.-Z. Song, and S. Gong. Sketch recognition by ensemble matching of structured features. In Proc. BMVC, 2013.
  • [21] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proc. IEEE CVPR, July 2017.
  • [22] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proc. IEEE CVPR, 2017.
  • [23] T. Lu, C.-L. Tai, F. Su, and S. Cai. A new recognition model for electronic architectural drawings. CAD, 37(10):1053 – 1069, 2005.
  • [24] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, pages 2204–2212. 2014.
  • [25] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. In Proc. IEEE CVPR, 2017.
  • [26] L. Olsen, F. F. Samavati, M. C. Sousa, and J. A. Jorge. Sketch-based modeling: A survey. Comput. & Graph., 33(1):85 – 103, 2009.
  • [27] T. Y. Ouyang and R. Davis. ChemInk: A natural real-time recognition system for chemical drawings. In Proc. ACM IUI. ACM, 2011.
  • [28] U. Riaz Muhammad, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Learning deep sketch abstraction. In Proc. IEEE CVPR, June 2018.
  • [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252, Dec 2015.
  • [30] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The Sketchy Database: Learning to retrieve badly drawn bunnies. ACM TOG, 35(4):119:1–119:12, July 2016.
  • [31] R. G. Schneider and T. Tuytelaars. Sketch classification and classification-driven analysis using Fisher Vectors. ACM TOG, 33(6):174:1–174:9, Nov. 2014.
  • [32] T. M. Sezgin and R. Davis. Sketch recognition in interspersed drawings using time-based graphical models. Comput. & Graph., 32(5):500–510, 2008.
  • [33] J. Song, K. Pang, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Learning to sketch with shortcut cycle consistency. In Proc. IEEE CVPR, June 2018.
  • [34] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In Proc. IEEE ICCV, 2017.
  • [35] Z. Sun, C. Wang, L. Zhang, and L. Zhang. Free hand-drawn sketch segmentation. In Proc. ECCV, pages 626–639. Springer, 2012.
  • [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. IEEE CVPR, 2015.
  • [37] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In Proc. IEEE CVPR, 2017.
  • [38] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval using convolutional neural networks. In Proc. IEEE CVPR, 2015.
  • [39] X. Wang, X. Chen, and Z. Zha. SketchPointNet: A compact network for robust sketch recognition. In Proc. ICIP, pages 2994–2998, 2018.
  • [40] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang.

    The application of two-level attention models in deep convolutional neural network for fine-grained image classification.

    In Proc. IEEE CVPR, 2015.
  • [41] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.
  • [42] P. Xu, Y. Huang, T. Yuan, K. Pang, Y.-Z. Song, T. Xiang, T. M. Hospedales, Z. Ma, and J. Guo. SketchMate: Deep hashing for million-scale human sketch retrieval. In Proc. IEEE CVPR, June 2018.
  • [43] E. Yanık and T. M. Sezgin. Active learning for sketch recognition. Comput. & Graph., 52:93 – 105, 2015.
  • [44] Q. Yu, Y. Yang, F. Liu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Sketch-a-Net: A deep neural network that beats humans. IJCV, 122(3):411–425, May 2017.
  • [45] Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Sketch-a-net that beats humans. In Proc. BMVC, pages 7.1–7.12, 2015.
  • [46] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao. SketchNet: Sketch classification with web images. In Proc. IEEE CVPR, 2016.
  • [47] J. Zhang, Y. Chen, L. Li, H. Fu, and C.-L. Tai. Context-based sketch classification. In Proc. Expressive, pages 3:1–3:10. ACM, 2018.
  • [48] C. Zou, Q. Yu, R. Du, H. Mo, Y.-Z. Song, T. Xiang, C. Gao, B. Chen, and H. Zhang. SketchyScene: Richly-annotated scene sketches. In Proc. ECCV, September 2018.