Deep Self-Supervised Representation Learning for Free-Hand Sketch

02/03/2020 ∙ by Peng Xu, et al. ∙ University of Surrey Nanyang Technological University 6

In this paper, we tackle for the first time, the problem of self-supervised representation learning for free-hand sketches. This importantly addresses a common problem faced by the sketch community – that annotated supervisory data are difficult to obtain. This problem is very challenging in that sketches are highly abstract and subject to different drawing styles, making existing solutions tailored for photos unsuitable. Key for the success of our self-supervised learning paradigm lies with our sketch-specific designs: (i) we propose a set of pretext tasks specifically designed for sketches that mimic different drawing styles, and (ii) we further exploit the use of a textual convolution network (TCN) in a dual-branch architecture for sketch feature learning, as means to accommodate the sequential stroke nature of sketches. We demonstrate the superiority of our sketch-specific designs through two sketch-related applications (retrieval and recognition) on a million-scale sketch dataset, and show that the proposed approach outperforms the state-of-the-art unsupervised representation learning methods, and significantly narrows the performance gap between with supervised representation learning.



There are no comments yet.


page 1

page 5

Code Repositories


self-supervised learning, deep learning, representation learning, RotNet, temporal convolutional network(TCN), deformation transformation, sketch pre-train, sketch classification, sketch retrieval, free-hand sketch, official code of paper "Deep Self-Supervised Representation Learning for Free-Hand Sketch"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning approaches have now delivered practical-level performances on various artificial intelligence tasks 

[40, 32, 22, 10, 33, 51, 41, 19, 42, 23, 48, 2, 11, 31]. However, most of the state-of-the-art deep models still rely on a massive amount of annotated supervisory data. These labor-intensive supervisions are so expensive that they have become a bottleneck of the general application of deep learning techniques. As a result, deep unsupervised representation learning [1, 9, 18] has gained considerable attention in recent days.

However, most of existing deep learning based unsupervised representation methods in computer vision area are engineered for photo 

[9] and video [15]

. Unsupervised learning for sketches on the other hand remains relatively under-studied. It is nonetheless an important topic – the lack of annotated data problem is particularly salient for sketches, since unlike photos that can be automatically crawled from the internet, sketches have to be drawn one by one.

Sketch-related research has flourished in recent years [47], largely driven by the ubiquitous nature of touchscreen devices. Many problems have been studied to date, including sketch recognition [37, 20, 43, 45], sketch hashing [44, 24]

, sketch-based image retrieval 

[39, 46, 5], sketch synthesis [3], segmentation [38]

, scene understanding 

[49], abstraction [25], just to name a few. However, almost all existing sketch-related deep learning techniques work in supervised settings, relying upon the manually labeled sketch datasets [35, 39] collected via crowdsourcing.

Despite decent performances reported, research progress on sketch understanding is largely bottlenecked by the size of annotated datasets (in their thousands). Recent efforts have since been made to create large-scale datasets  [12, 5], yet their sizes and category coverage (annotated labels) are still far inferior to their photo counterparts [4]. Furthermore, perhaps more importantly, sketch datasets suffer from being not easily extendable for sketches have to be manually produced other than automatically crawled from the internet. In this paper, we attempt to offer a new perspective to alleviate the data scarcity problem – we move away from the commonly supervised learning paradigm, and for the first time study the novel and challenging problem of self-supervised representation learning for sketches.

Fig. 1: Illustration of the human sketching diversity. Different persons have different drawing styles.

Solving the self-learning problem for sketches is however non-trivial. Sketches are distinctively different to photos – a sketch is a temporal sequence of black and white strokes, whereas photos are collection of static pixels exhibiting rich color and texture information. Furthermore, sketches are also highly abstract, and subjected to different drawing styles. All such unique characteristics made self-supervised pretext tasks designed for photo fail to perform on sketch. This is mainly because the commonly adapted patch-based approaches [52, 6, 27] are not compatible with sketches – sketches are formed of sparse strokes as oppose to dense pixel patches (see Figure 2). Our first contribution is therefore a set of sketch-specific pretext tasks that attempt to mimic the various drawing styles incurred in sketches. As shown in Figure 1, human sketching styles are highly diverse. More specially, we first define a set of geometric deformations that simulate variations in human sketching (see Figure 5

). Based on these deformations, we design a set of binary classification pretext tasks to train a deep model that estimates whether a geometric deformation has been applied to the input sketch. Intuitively, this encourages the model to learn to recognize sketches regardless of drawing styles (deformations), in turn forcing the model to learn to represent the input. This is akin to the intuition used by 

[9] by asking for rotation invariance but is otherwise specifically designed for sketches.

Fig. 2: A sketch of alarm clock with its patches and the shuffled tiles. Sketch patches are too abstract to recognize, due to the stroke sparsity.

As our second contribution, we further exploit the use of a textual convolution network (TCN) to address the temporal stroke nature of sketches, and propose a dual-branch CNN-TCN architecture serving as feature extractor for sketch self-supervised representation learning. Current state-of-the-art feature extractors for sketches typically employ a RNN architecture to model stroke sequence [20, 36]. However, a key insight highlighted by recent research, which we share in this paper, is that local stroke ordering can be noisy, and that invariance in sketching is achieved at stroke-group (part) level [25]. This means that RNN-based approaches that model on stroke-level might work counter-productively. Using a TCN however, we are able to use different sizes of convolution kernels to perceive strokes at different granularities (groups), hence producing more discriminative sketch features to help self-supervised learning.

Extensive experiments on million-scale sketches evaluate the superiority of our sketch-specific designs. In particular, we show that, for sketch retrieval and recognition, our self-supervised presentation approach not only outperforms existing state-of-the-art unsupervised representation learning techniques (e.g., self-supervised, clustering-based, reconstruction-based, GAN-based), but also significantly narrows the performance gap with supervised representation learning.

Our contributions can be summarized as: (i) Motivated by human free-hand drawing styles and habits, we propose a set of sketch-specific self-supervised pretext tasks within deep learning pipeline, which provide supervisory signals for semantic feature learning. (ii) We for the first time exploit textual convolution network (TCN) for sketch feature modeling, and propose a dual-branch CNN-TCN feature extractor architecture for self-supervised sketch representation learning. We also show that our CNN-TCN architecture generalizes well under supervised feature representation settings (e.g

., fully supervised sketch recognition), and outperforms existing architectures serving for sketch feature extraction (

i.e., CNN, RNN, CNN-RNN).

The rest of this paper is organized as follows: Section II briefly summarizes related work. Section III describes our proposed sketch-specific self-supervised representation learning method, introducing our proposed sketch-specific self-supervised pretext tasks and feature extractor architecture. Experimental results and discussion are presented in Section IV. We present some conclusions and insights in Section V. Finally, future work is discussed in Section VI.

Ii Related Work

Ii-a Self-Supervised Pretext Tasks

Deep learning based unsupervised representation learning techniques can be broadly categories into self-supervised methods [9], auto-encoder [30], generative adversarial network [7, 34], and clustering [8, 21]. Currently, self-supervised representation learning [9] has achieved state-of-the-art performance on computer vision tasks such as classification, and segmentation. The key technique in self-supervised approaches is defining pretext tasks to force the model learn how to represent the feature of input. Existing self-supervised pretext tasks mainly include patch-level predictions [6, 26, 28], and  image-level predictions [9]. Sketch is essentially different from photo, thus patch-based pretext tasks (e.g., predicting relative position of image patches [6]

) fail to work on sketch, due to that sketch patches are too abstract to recognize. The colorization-related pretext tasks 

[52, 16] are also unsuitable for sketch since sketches are color-free. To the best of our knowledge, our work is the first self-supervised representation learning work on sketches.

Ii-B Sketch Feature Extractor Architecture

Most prior works model sketch as static pixel picture and use CNN as feature extractor, neglecting the sequential drawing patterns on stroke level. [12] proposed the groundbreaking work that uses RNN to model strokes, which was the first to explore the temporal nature of sketch. [44] proposed a dual-branch CNN-RNN architecture, using CNN to extract abstract visual concepts and RNN to learn sketching temporal orders. Some tandem architectures also have been proposed, including CNN on the top of RNN [20], RNN on the top of CNN [36]. However, all previous sketch feature extractor architectures have been proposed in fully-supervised settings. Moreover, as stated in [18], standard network architecture design recipes do not necessarily translate from the fully-supervised setting to the self-supervised setting. Therefore, in this paper, we explore a novel feature extractor architecture specifically purposed for sketch self-supervised learning. The TCN architecture also appropriately address the temporal nature of sketches, while accommodating for stroke granularity. To the best of our knowledge, this work is the first probe that models sketch feature using TCN.

Iii Sketch-Specific Self-Supervised Representation Learning

Iii-a Problem Formulation

We assume training dataset in the form of sketch samples: . Each sketch sample consists of a sketch picture and a corresponding sketch stroke sequence . We aim to learn semantic feature for sketch sample in a self-supervised manner, in which denotes feature extraction.

Iii-B Overview

We aim to extract semantic features for free-hand sketch in self-supervised approach. Inspired by the state-of-the-art self-supervised method [9], we try to train a deep model to estimate the geometric deformation applied to the original input, hoping the model is able to learn how to capture the features of the input. Thus, we would define a set of discrete geometric deformations . In our self-supervised setting, given a sketch sample , we do not know its class label, but we can generate some deformed samples by applying our deformation operators on it as


where denotes the label of deformation. Therefore, given a training sample , the output of the deep model can be formulated as -way softmax, which can be denoted as , assuming that indicates deep neural network parameters. Given training dataset, our objective is to minimize the cross entropy loss over -way softmax:


where indicates the

th value of the output probability logits for deformed sample 


Based on above analysis, we next need to find geometric transformations to define pretext classification tasks that can provide useful supervision signal to drive the model to capture feature of sketch. Sketch can be formated as picture in pixel space, so that the rotation-based self-supervised technique also can be applied to it. However, sketch has several intrinsic traits, e.g., (i) Sketch is highly abstract. (ii) Sketch can be formatted as a stroke sequence. In the following, we propose a set of sketch-specific self-supervised pretext tasks and a novel sketch-specific feature extractor architecture.

Iii-C Sketch-Specific Self-Supervised Pretext Tasks

Free-hand sketch is a special form of visual data sharing some similarities with handwritten character. Even if one person draw the same object or scene more than once, the obtained sketches are impossible to be completely the same. Moreover, different persons have different drawing styles, habits, and abilities. If ask several persons to draw the same cat, some persons maybe habitually draw it with slim style while some persons maybe draw it as a fat cat. Although there are large variations among the obtained sketchy cats under different drawing styles, the basic topological structures of them are the same. Inspired by Handwritten Character Shape Correction [14], in this paper, we aim to use nonlinear functions to model these flexible and irregular drawing deformations , to define a set of discrete self-supervised pretext tasks for sketch. That is to train the deep model to judge whether the input has undergone one kind of deformations, hoping that the model is able to learn how to capture features of sketch.

Given a binaryzation sketch (stroke width is one pixel), its stroke sequence can be denoted as a series of coordinates of the black pixels:


where is the th black pixel (point) of , and is the total amount of black pixels of . We can normalize the coordinates for each black pixel such that . Intuitively, we can use arbitrary functions as displacement functions, and the horizontal and vertical directions are independent to each other. Therefore, the deformable transformations on x and y directions (denoted as , ) for can be performed as


where and are the displacement functions on x and y directions, respectively. As stated in [14], deformation should meet some properties: (i) Displacement functions are nonlinear functions. (ii) Displacement functions are continuous and satisfy boundary conditions: . (iii) Displacement functions should be monotonically increasing functions so that deformation transformation could preserve the topology structure of the original sketch. (iv) Deformation should preserve the smoothness and connectivity of the original sketch. Based on these constraints, we can use following function as one displacement function:


where derives . Thus this displacement function can be simplified as:


where , , and are constants. controls the nonlinear mapping intensity. Here, we would take a concrete example to illustrate how this trigonometric function can perform nonlinear deformation on sketch picture. If we set , for (6), the displacement function becomes as


We can plot two deformation curves and in Figure 4. We can observe that displacement function (7) is nonlinear function that can map linear domain of x-axis into nonlinear domain of z-axis: (i) compresses into , (ii) expands into . As illustrated in Figure 4, various of nonlinear deformation effects can be obtained by selecting different regions of deformations and deforming parameter

. Uneven stroke widths are caused by interpolations during the deformations.

Fig. 3: Illustration of our proposed sketch-specific self-supervised representation learning framework.
Fig. 4: Illustration of two deformation curves with different deforming parameters.
Fig. 5: Illustration of six sketch samples (alarm clock, school bus, donut, cat, eye, soccer) and their deformation pictures. For each sample, original sketch is shown in the left column, while sketches presented in 2nd to 6th columns are respectively obtained by horizontal compression (HC), centripetal compression (CC), vertical compression (VC), leftward compression (LC), and rightward compression (RC).

Assuming that we aim to apply the nonlinear mapping of region of (6) to a coordinate normalized sketch, we can set


such that


Taking (9) into (6), the deformations can be defined as:


where , , , , , and are hyper-parameters, conforming and . As shown in Figure 5, we can adjust , , to achieve different deformation effects to simulate different kinds of free-hand drawing habits. (Limited by page space, only five kinds are shown.) In Figure 5, for each sketch sample, the left column is the original picture, and the five columns on right are respectively transformed by horizontal compression (HC), centripetal compression (CC), vertical compression (VC), leftward compression (LC), and rightward compression (RC). Please note that nonlinear function based deformation can be applied to both sketch picture and sketch stroke sequence, while deformation upon sketch picture is convenient for visualization.

In this paper, we aim to define a set of sketch-specific binary classification taks that make the deep model to judge whether the input has undergone one kind of deformation, as sketch-specific self-supervised representation learning pretext tasks, hoping that the model is able to learn how to capture features of sketch.

Iii-D Sketch-Specific Feature Extractor Architecture

Recently, Kolesnikov et al[18] demonstrate that standard network architecture design recipes do not necessarily translate from fully-supervised setting to self-supervised setting. We will also present similar phenomenon in following experiments that RNN-based feature extractor network fails to converge under our self-supervised training, while RNN-based networks have achieved satisfactory feature representation effects in previous supervised settings [44, 12]. Therefore, it is necessary to explore novel feature extractor network architecture upon sketch self-supervised setting. Most of the previous sketch-related research works use CNNs as feature extractor. In this work, we propose a dual-branch CNN-TCN architecture for sketch feature representation, utilizing CNN to extract abstract semantic concepts from static pixel space and TCN to sequentially probe along sketch strokes by convolution operation. In particular, for sketch feature extraction, the major advantage of TCN is that sequentially probing along sketch strokes at different granularities by varying its convolution kernel sizes (receptive fields). That is using small and large kernels to capture the patterns of short and long strokes, respectively.

Iii-E Sketch-Specific Self-Supervised Representation Learning Framework

Our proposed framework is illustrated in Figure 3, containing two major components: rotation-based representation module and deformation-based representation module. Quaternary classification on rotations () are used as pretext task to train the rotation-based representation module. Our deformation-based representation module is extensible, which can consist of more than one representation sub-module. For each deformation-based representation sub-module, we choose a specific nonlinear deformation, and train sub-module to estimate whether the chosen nonlinear deformation has been applied to the input sketch. This is to say that the pretext task for each deformation-based representation sub-module is a binary classification. We empirically find that multiple binary classification based representation sub-modules work better than the single multi-class classification based representation module. This is mainly due to that classification of diverse deformations is difficult to be trained on the highly abstract sketches. In our rotation-based representation module and deformation-based representation sub-modules, dual-branch CNN-TCN network serves as the feature extractor. In this paper, we focus on developing a general framework for sketch self-supervised learning, thus over-complicated fusion strategies are not discussed here and will be thoroughly compared in the future work. Moreover, CNN and TCN are essentially heterogeneous architectures, so that it’s unpractical to train them synchronously. Therefore, we train our CNNs and TCNs separately. The detailed training and optimization are described in Algorithm 1.

During testing, given a sketch sample , its feature representation can be defined as


where  and  denote the feature extractions of rotation-based module and the th deformation-based sub-module , respectively.  and  are weighting factors. The output feature of the rotation-based module is fused via


where and denote the feature extractions of the CNN and TCN branches of rotation-based module, respectively. is a weighting factor. Similarly, the output feature of the th deformation-based sub-module is fused via


where  and  indicate the feature extractions of the CNN and TCN branches of the th deformation-based sub-module , respectively. is a weighting factor.

1. Train rotation-based module .
1.1. Train CNN branch on , and obtain .
1.2. Train TCN branch on , and obtain .
2. Train deformation-based module as following loop.
for Each deformation-based sub-module  do
     2.1 Train CNN branch on , and obtain .
     2.2 Train TCN branch on , and obtain .
end for
 and  , .
Algorithm 1 Learning algorithm for our sketch-specific self-supervised representation learning framework.

Iv Experiments

Iv-a Experiment Settings

Dataset and Splits

We evaluate our self-supervised representation learning framework on QuickDraw 3.8M [44] dataset, which is a million-scale subset of Google QuickDraw dataset222 [12]. Our self-supervised training and associated validation are conducted on the training set ( sketches) and validation set ( sketches) of QuickDraw 3.8M. After training, our self-supervised feature representations are tested on two sketch tasks (i.e., sketch retrieval and sketch recognition), and we extract features on query set ( sketches) and gallery ( sketches) of QuickDraw 3.8M. For sketch retrieval, we rank the gallery for each query sketch based on Euclidean distance, hoping the similar sketches ranked on the top. For sketch recognition, we

train a fully-connected layer as classifier

on gallery set, and calculate the recognition accuracy on query set.

Evaluation Metric

mAP [44] and classification accuracy (“acc.”) are used as metrics for sketch retrieval and sketch recognition, respectively. In particular, for sketch retrieval, we calculate mAP over the top 1 and top 10 in retrieval ranking list, i.e., “acc.@top1” and “mAP@top10”.

CNN Implementation Details

The input size of our CNNs is , with each brightness channel tiled equally. Plenty of CNN architectures can be utilized here, and for a fair comparison with our main competitor [9], our CNNs are also implemented by AlexNet, with the output dimensionality is .

Fig. 6:

Stroke key point illustration of sketch. Each key point is denoted as a four-bit vector.

TCN Implementation Details

Based on statistic analysis on strokes, researchers have found most sketches of QuickDraw dataset have fewer than 100 strokes [44]. Accordingly, the input array of our TCNs is normalized as

by truncating or padding. Each point is denoted as a four-dimensional vector, in which the first two number are

x and y coordinates, and the last two bits describe pen state. Following the definition in [44], the pen state is “0 1” when the point is the stop point of one stroke. In remaining cases, the pen state is “1 0” , as shown in Figure 6333We experimentally found that if readuce the two-bit pen state as one-bit, similar results will be obtained. We implement a four-layer stacked TCN, where each layer has a series of D convolution kernels with different sizes. In particular, in the first layer of our TCN, D convolution kernels are used to adaptive to our sketch coordinate input. For the 2nd to 4th layers, we use D convolution kernels. The implementation details of our TCN are reported in Table I

. The output of each TCN layer is produced by ReLU activation and

D max pooling. The output dimensionality of our TCN is also

. During training, one fully-connected layer with batch normalization (BN) 

[13] and ReLU activation is used as the classifier for our TCN branch.

Selection of Deformations

By observing a lot of sketch samples, we found that the most representative drawing styles of human mainly include horizontal compression (HC), centripetal compression (CC), vertical compression (VC), leftward compression (LC), and rightward compression (RC). Inspired by this observation, we empirically selected the corresponding deformations to conduct the pretext tasks. We found above selection leads to promising performance, and more deformations will be considered and compared in the future work.

Input Shape Operator Channels Kernel Size (K) Stride
Conv2d_Kx4 16 2,4,6,8,10,12,14,16,18,20 1
Conv1d_K 32 2,4,6,8,10,12,14,16,18,20 1
Conv1d_K 64 2,4,6,8,10,12,14,16,18,20 1
Conv1d_K 128 2,4,6,8,10,12,14,16,18,20 1
FC 4096 - -
FC 345 - -
TABLE I: Implementation details of our TCN. “Conv2d_Kx4” and “Conv1d_K” denote 2D convolution with kernel size of Kx4 and 1D convolution with kernel size of K, respectively. “FC” represents fully-connected layer.

Other Implementation Details

All our experiments are implemented in PyTorch 444 [29] , and run on a single GTX 1080 Ti GPU. The detailed hardware and software configurations of our server are provided in Table II. SGD optimizer (with initial learning rate ) and Adam optimizer (initial learning rate ) are used for CNNs and TCNs, respectively.

CPU two Intel(R) Xeon(R) CPUs (E5-2620 v3 @ 2.40GHz)
RAM 128 GB
HD solid state drive

System Ubuntu 16.04
Python 3.6
PyTorch 0.4.1
TABLE II: Hardware and software details of our experimental environment.


We compare our self-supervised representation approach with several the state-of-the-art deep unsupervised representation techniques, including self-supervised (RotNet [9], Jigsaw [27]), clustering-based (Deep Clustering [1]), generative adversarial network based (DCGAN [34]), and auto-encoder based (Variational Auto-Encoder [17]) approaches. For a fair comparison, we evaluate all competitors based on the same backbone network if applicable. Moreover, in order to evaluate the viewpoint [18] that standard network architecture design recipes do not necessarily translate from fully-supervised setting to self-supervised setting, we also implement some baselines by replace our feature extractor with RNN.

Unsupervised Baselines acc.@top1 mAP@top10
DCGAN [34] 0.1695 0.2239
Auto-Encoder [17] 0.0976 0.1539
Jigsaw [27] 0.0803 0.1270
Deep Clustering [1] 0.1787 0.2396
R+CNN (RotNet)  [9] 0.4706 0.5166
RNN Self-Sup. Baselines We Designed
{pretext task}+{feature extractor}
acc.@top1 mAP@top10
{R}+{RNN} 0.0234 0.0533
{HC}+{RNN} 0.0218 0.0488
{VC}+{RNN} 0.0226 0.0507
{CC}+{RNN} 0.0125 0.0312
{VC&LC}+{RNN} 0.0210 0.0481
{VC&RC}+{RNN} 0.0186 0.0446
Our Full Model
{pretext task}+{feature extractor}
acc.@top1 mAP@top10
0.5024 0.5447
TABLE III: Comparison on retrieval task with state-of-the-art deep learning based unsupervised methods. “R” denotes “rotation”. “&” means that two deformations are applied simultaneously. The / best results on column basis are indicated in red/blue.
Fig. 7: Attention map visualization (clock, donut, blueberry, soccer ball, eye). Color bar ranging from blue to red denotes activated values 0 to 1. Original sketches are in the top row. Middle and bottom rows are obtained by RotNet and our full model, respectively. Best viewed in color.

Iv-B Results and Discussions

Evaluation on Sketch Retrieval

We evaluate our self-supervised learned features on sketch retrieval, by comparing with the features obtained by the state-of-the-art unsupervised representation methods. Retrieval results “acc.@top1” and “mAP@top10” are reported in Table III, where following observations can be made: (i) Except for RotNet [9], all other baselines fail to work well on sketch unsupervised representation learning due to the unique challenges of sketch. Particularly, Jigsaw [27] obtained low retrieval accuracy due to that sketch patches are too abstract to recognize. (ii) RotNet [9] outperforms other baselines by a clear margin, showing us the effectiveness of “image-level” self-supervised pretext tasks for abstract sketch. (iii) Our proposed method obtains better retrieval results over all the baselines listed in Table III. (iv) It is interesting that RNN extractor obtains unsatisfactory performance with a number of self-supervised pretext tasks, whilst RNN networks have achieved the state-of-the-art performance [44] in supervised settings. This confirms that the network designing recipe under fully supervised settings can not be directly transfered to self-supervised setting, which has been demonstrated in [18]. This also demonstrates the necessity of our sketch-specific architecture design in self-supervised feature extraction setting.

{pretext task}+{feature extractor}
acc.@top1 mAP@top10
{HC}+{CNN} 0.1932 0.2556
{HC}+{TCN} 0.1229 0.1756
{HC}+{CNN,TCN} 0.2388 0.2994
{VC}+{CNN} 0.1800 0.2433
{VC}+{TCN} 0.1468 0.2008
{VC}+{CNN,TCN} 0.2435 0.3025
{CC}+{CNN} 0.2555 0.3159
{CC}+{TCN} 0.1489 0.2048
{CC}+{CNN,TCN} 0.2876 0.3428
{VC&LC}+{CNN} 0.2459 0.3053
{VC&LC}+{TCN} 0.2003 0.2580
{VC&LC}+{CNN,TCN} 0.2574 0.3132
{VC&RC}+{CNN} 0.2265 0.2879
{VC&RC}+{TCN} 0.1870 0.2427
{VC&RC}+{CNN,TCN} 0.2367 0.2931
{HC,VC}+{CNN,TCN} 0.2842 0.3404
{HC,VC,CC}+{CNN,TCN} 0.3060 0.3582
{HC,VC,CC,VC&LC}+{CNN,TCN} 0.3060 0.3685
0.3180 0.3718
TABLE IV: Sketch retrieval ablation study on our proposed self-supervised representation learning framework. “&” means that two deformations are applied simultaneously. The / best results on column basis are indicated in red/blue.

{pretext task}+{feature extractor}
acc.@top1 mAP@top10
{R}+{CNN} (RotNet) [9] 0.4706 0.5166
{R}+{TCN} 0.3072 0.3639
{R}+{CNN,TCN} 0.4932 0.5360
TABLE V: Sketch retrieval ablation study on the contribution of dual-branch CNN-TCN to rotation-based self-supervised learning. “R” denotes “rotation”. The / best results on column basis are indicated in red/blue.

Although RotNet is the strongest baseline to ours, its rotation-based pretext task fails to work well on centrosymmetric sketches, e.g., donut, soccer ball. Intuitively, given a centrosymmetric sketch, visual variation caused by rotation is limited and difficult to captured even for human eye. We visualize attention maps for some centrosymmetric sketches in Figure 7, where middle and bottom rows are obtained by RotNet and our full model, respectively. Based on our color bar, we observe that compared with the attention maps in the middle row, ours have larger activated values. This means that our proposed model works more sensitively to the strokes of centrosymmetric sketches.

Moreover, we also conduct some ablation studies on retrieval to evaluate the contributions of our deformation-based pretext tasks and CNN-TCN architecture, by combining different pretext tasks and feature extractors within our proposed self-supervised framework. From Table IV, we observe that: (i) Given a deformation-based pretext task, our dual-branch CNN-TCN brings performance improvement over CNN and TCN. (ii) Based on our CNN-TCN feature extraction, with more kinds of deformation-based pretext tasks involved, better performance will be achieved.

To further demonstrate the generality of our CNN-TCN architecture on image-level self-supervised pretext tasks, we also implement ablation study to evaluate whether CNN-TCN architecture can improve rotation-based self-supervised method, i.e., RotNet. Table V shows that CNN-TCN extractor brings performance improvement for rotation-based self-supervised learning, and outperforms both single-branch CNN and TCN. This phenomenon also illustrates that CNN and TCN could produce complementary features in sketch self-supervised learning setting.

Evaluation on Sketch Recognition

We also evaluate our self-supervised learned features on sketch recognition task. We train our model on QuickDraw 3.8M training set, and extract features for its gallery set and query set. Then, we use gallery features and the associated ground-truth labels to train a linear classifier. Classification accuracy is calculated on QuickDraw 3.8M query set. Similar operations are performed for our competitors. For a fair comparison, we keep the classifier configuration the same for all our classification experiments.

The following observations can be obtained from Table VI: (i) For sketch recognition, our model and its variant outperform the state-of-the-art unsupervised competitors by a large margin ( vs. ), demonstrating the superiority of our sketch-specific design. (ii) When stroke deformation based self-supervised signals are added, percent improvement on recognition accuracy is obtained. (iii) The performance gap between the state-of-the-art supervised sketch recognition model Sketch-a-Net [50] and ours is narrow ( vs. ).

Unsupervised Baselines acc.
DCGAN [34] 0.1057
Auto-Encoder [17] 0.1856
Jigsaw  [27] 0.2894
Deep Clustering [1] 0.0764
R+CNN (RotNet) [9] 0.5149
Our Method Abbr.
{pretext task}+{feature extractor}
{R}+{CNN,TCN} 0.5473
Supervised Methods acc.
Sketch-a-Net [50] 0.6871
TABLE VI: Comparison on sketch recognition with the state-of-the-art deep learning based unsupervised methods.“R” denotes “rotation”. “&” means that two deformations are applied simultaneously. The / best results are indicated in red/blue.

V Conclusion

In this paper, we propose the novel problem of self-supervised representation learning for free-hand sketches, and contribute the first deep network based framework to solve this challenging problem. In particular, by recognizing the intrinsic traits of sketches, we propose a set of sketch-specific self-supervised pretext tasks, and a dual-branch TCN-CNN architecture serving as feature extractor. We evaluate our self-supervised representation features on two tasks of sketch retrieval and sketch recognition. Our extensive experiments on million-scale sketches demonstrate that our proposed self-supervised representation method outperforms the state-of-the-art unsupervised competitors, and significantly narrows the gap with supervised representation learning on sketches.

We sincerely hope our work can motivate more self-supervised representation learning in the sketch research community.

Vi Future Work

As the aforementioned, free-hand sketch has its domian-unique technical challenges, since it is essentially different to natual photo. Therefore, designing sketch-specific pretext tasks for free-hand sketch oriented self-supervised deep learning is significant. In particular, in future work, we will try to design sketch-specific pretext tasks from fine-grained perspective, involing more stroke-level analysis.


  • [1] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In ECCV, Cited by: §I, §IV-A, TABLE III, TABLE VI.
  • [2] D. Chang, Ding,Yifeng, J. Xie, A. K. Bhunia, X. Li, Z. Ma, M. Wu, Guo,Jun, and Y. Song (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. TIP. Cited by: §I.
  • [3] W. Chen and J. Hays (2018) SketchyGAN: towards diverse and realistic sketch to image synthesis. In CVPR, Cited by: §I.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §I.
  • [5] S. Dey, P. Riba, A. Dutta, J. Llados, and Y. Song (2019) Doodle to search: practical zero-shot sketch-based image retrieval. In CVPR, Cited by: §I, §I.
  • [6] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §I, §II-A.
  • [7] J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §II-A.
  • [8] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, Cited by: §II-A.
  • [9] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR, Cited by: §I, §I, §I, §II-A, §III-B, §IV-A, §IV-A, §IV-B, TABLE III, TABLE V, TABLE VI.
  • [10] C. Guo, C. Li, J. Guo, R. Cong, H. Fu, and P. Han (2018)

    Hierarchical features driven residual learning for depth map super-resolution

    TIP. Cited by: §I.
  • [11] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong (2020) Zero-reference deep curve estimation for low-light image enhancement. arXiv preprint arXiv:2001.06826. Cited by: §I.
  • [12] D. Ha and D. Eck (2018) A neural representation of sketch drawings. In ICLR, Cited by: §I, §II-B, §III-D, §IV-A.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §IV-A.
  • [14] L. Jin, J. Huang, J. Yin, and Q. He (2000) Deformation transformation for handwritten chinese character shape correction. In ICMI, Cited by: §III-C, §III-C.
  • [15] D. Kim, D. Cho, and I. S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In AAAI, Cited by: §I.
  • [16] D. Kim, D. Cho, D. Yoo, and I. S. Kweon (2018) Learning image representations by completing damaged jigsaw puzzles. In WACV, Cited by: §II-A.
  • [17] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IV-A, TABLE III, TABLE VI.
  • [18] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In CVPR, Cited by: §I, §II-B, §III-D, §IV-A, §IV-B.
  • [19] C. Li, C. Guo, J. Guo, P. Han, H. Fu, and R. Cong (2019) PDR-net: perception-inspired single image dehazing network with refinement. TMM. Cited by: §I.
  • [20] L. Li, C. Zou, Y. Zheng, Q. Su, H. Fu, and C. Tai (2018) Sketch-r2cnn: an attentive network for vector sketch recognition. arXiv preprint arXiv:1811.08170. Cited by: §I, §I, §II-B.
  • [21] R. Liao, A. Schwing, R. Zemel, and R. Urtasun (2016) Learning deep parsimonious representations. In NIPS, Cited by: §II-A.
  • [22] K. Liu, W. Liu, C. Gan, Tan,Mingkui, and H. Ma (2018) T-c3d: temporal convolutional 3d network for real-time action recognition. In AAAI, Cited by: §I.
  • [23] K. Liu and H. Ma (2019)

    Exploring background-bias for anomaly detection in surveillance videos

    In ACM MM, Cited by: §I.
  • [24] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In CVPR, Cited by: §I.
  • [25] U. R. Muhammad, Y. Yang, Y. Song, T. Xiang, and T. M. Hospedales (2018) Learning deep sketch abstraction. In CVPR, Cited by: §I, §I.
  • [26] T. N. Mundhenk, D. Ho, and B. Y. Chen (2018) Improvements to context based self-supervised learning.. In CVPR, Cited by: §II-A.
  • [27] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §I, §IV-A, §IV-B, TABLE III, TABLE VI.
  • [28] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash (2018) Boosting self-supervised learning via knowledge transfer. In CVPR, Cited by: §II-A.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §IV-A.
  • [30] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §II-A.
  • [31] P. Qin, X. Wang, W. Chen, C. Zhang, W. Xu, and W. Y. Wang (2020)

    Generative adversarial zero-shot relational learning for knowledge graphs

    arXiv preprint arXiv:2001.02332. Cited by: §I.
  • [32] P. Qin, W. Xu, and W. Y. Wang (2018) Dsgan: generative adversarial training for distant supervision relation extraction. arXiv preprint arXiv:1805.09929. Cited by: §I.
  • [33] P. Qin, W. Xu, and W. Y. Wang (2018)

    Robust distant supervision relation extraction via deep reinforcement learning

    arXiv preprint arXiv:1805.09927. Cited by: §I.
  • [34] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §II-A, §IV-A, TABLE III, TABLE VI.
  • [35] P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM TOG. Cited by: §I.
  • [36] R. K. Sarvadevabhatla, J. Kundu, et al. (2016)

    Enabling my robot to play pictionary: recurrent neural networks for sketch recognition

    In MM, Cited by: §I, §II-B.
  • [37] R. G. Schneider and T. Tuytelaars (2014) Sketch classification and classification-driven analysis using fisher vectors. ACM TOG. Cited by: §I.
  • [38] R. G. Schneider and T. Tuytelaars (2016) Example-based sketch segmentation and labeling using crfs. ACM TOG. Cited by: §I.
  • [39] J. Song, Q. Yu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In ICCV, Cited by: §I.
  • [40] K. Song, F. Nie, J. Han, and X. Li (2017) Parameter free large margin nearest neighbor for distance metric learning. In AAAI, Cited by: §I.
  • [41] K. Wei, M. Yang, H. Wang, C. Deng, and X. Liu (2019) Adversarial fine-grained composition learning for unseen attribute-object recognition. In ICCV, Cited by: §I.
  • [42] J. Xie, Z. Ma, G. Zhang, J. Xue, Z. Tan, and J. Guo (2019) Soft dropout and its variational bayes approximation. In

    International Workshop on Machine Learning for Signal Processing

    Cited by: §I.
  • [43] Y. Xie, P. Xu, and Z. Ma (2019) Deep zero-shot learning for scene sketch. In ICIP, Cited by: §I.
  • [44] P. Xu, Y. Huang, T. Yuan, K. Pang, Y. Song, T. Xiang, T. M. Hospedales, Z. Ma, and J. Guo (2018) SketchMate: deep hashing for million-scale human sketch retrieval. In CVPR, Cited by: §I, §II-B, §III-D, §IV-A, §IV-A, §IV-A, §IV-B.
  • [45] P. Xu, C. K. Joshi, and X. Bresson (2019) Multi-graph transformer for free-hand sketch recognition. arXiv preprint arXiv:1912.11258. Cited by: §I.
  • [46] P. Xu, Q. Yin, Y. Huang, Y. Song, Z. Ma, L. Wang, T. Xiang, W. B. Kleijn, and J. Guo (2018) Cross-modal subspace learning for fine-grained sketch-based image retrieval. Neurocomputing. Cited by: §I.
  • [47] P. Xu (2020) Deep learning for free-hand sketch: a survey. arXiv preprint arXiv:2001.02600. Cited by: §I.
  • [48] X. Xu, L. Cheong, and Z. Li (2019) Learning for multi-model and multi-type fitting. arXiv preprint arXiv:1901.10254. Cited by: §I.
  • [49] Y. Ye, Y. Lu, and H. Jiang (2016) Human’s scene sketch understanding. In ICMR, Cited by: §I.
  • [50] Q. Yu, Y. Yang, F. Liu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Sketch-a-net: a deep neural network that beats humans. IJCV. Cited by: §IV-B, TABLE VI.
  • [51] C. Zhang, C. Zhu, J. Xiao, X. Xu, and Y. Liu (2018) Image ordinal classification and understanding: grid dropout with masking label. In ICME, Cited by: §I.
  • [52] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §I, §II-A.