SketchTransfer: A Challenging New Task for Exploring Detail-Invariance and the Abstractions Learned by Deep Networks

12/25/2019 ∙ by Alex Lamb, et al. ∙ Google aalto Université de Montréal 40

Deep networks have achieved excellent results in perceptual tasks, yet their ability to generalize to variations not seen during training has come under increasing scrutiny. In this work we focus on their ability to have invariance towards the presence or absence of details. For example, humans are able to watch cartoons, which are missing many visual details, without being explicitly trained to do so. As another example, 3D rendering software is a relatively recent development, yet people are able to understand such rendered scenes even though they are missing details (consider a film like Toy Story). The failure of machine learning algorithms to do this indicates a significant gap in generalization between human abilities and the abilities of deep networks. We propose a dataset that will make it easier to study the detail-invariance problem concretely. We produce a concrete task for this: SketchTransfer, and we show that state-of-the-art domain transfer algorithms still struggle with this task. The state-of-the-art technique which achieves over 95% on MNIST SVHN transfer only achieves 59% accuracy on the SketchTransfer task, which is much better than random (11% accuracy) but falls short of the 87% accuracy of a classifier trained directly on labeled sketches. This indicates that this task is approachable with today's best methods but has substantial room for improvement.



There are no comments yet.


page 2

page 3

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 2: On the left, a person’s sketch of a dollar bill from memory. On the right, the same person’s sketch with access to a reference. This indicates that humans only remember and can perform recognition with only a small handful of salient aspects of data and have substantial detail invariance. (Image credit: The empty brain [9])

Humans experience a high resolution world which is rich in details, yet are able to discern small amounts of highly relevant information from the world. Thus humans learn abstractions that store important pieces of information while discarding others. For example, Figure 2 shows how only a subset of the visual details in a dollar bill are remember by a person asked to produce a sketch without a reference.

The quality of the abstractions learned by humans may be an important factor in the flexibility and adaptability of human perception: we are able to understand objects and our environment even as details change.

Disturbingly, a growing body of evidence shows that while deep networks have visual perception which is competitive with humans (at least in some cases), the abstractions learned by deep networks differ substantially from what humans learn.

An exploration by [37] showed that deep networks correctly classify objects when only high-frequency components of the image are preserved but fail completely when only low-frequency components are preserved (the low frequency version of image is a blurred version of that image). This is the opposite of the inductive bias demonstrated by humans.

The adversarial examples literature has [33]

found that neural network’s predictions can be changed by extremely small (adversarially selected) perturbations which are imperceptible to human vision. At the same time these small perturbations have either a very small effect on humans (when given very little time to process an image)

[8] or no effect at all.

The BagNet project [3]

explored using bags of local features for Imagenet classification and found that such models could achieve competitive results. Additionally, they found that if one performed style transfer on real images

[13] and retained the textures while scrambling the image’s content, standard convolutional imagenet classifiers retained strong performance. This is evidence that convnets trained on datasets similar to Imagenet may learn primarily from local textures while discarding global structure, indicating a strong lack of detail invariance. On artificial images with the local texture from one image but the global shape and content of another image, imagenet classifier’s nearly always prefer the class associated with the texture rather than the content [14]. Additionally this can be partially addressed and generalization improved by training on artificially stylized examples in the imagenet dataset.

We create a dataset, which we call SketchTransfer, where a neural network is evaluated on the quality of its abstractions learned without explicit supervision. More specifically, we constructed a dataset in which a model must be able to classify images of sketches while only having access to labels for real images. As an example of why this task is difficult, consider sketches of dogs and cats. Many of these sketches choose to focus on the face of the animal. Oftentimes the only clear difference between a dog sketch and a cat sketch is the shape of the ears, with dogs having round ears and cats having pointed ears. While the shape of the ears is a feature which is present in real images of cats and dogs, a neural network may not pick up on this highly salient feature.

While ideally a network could perform this task with just real data and without any access to data points with missing details, this may make the task too difficult. To make the task easier, we give the network access to unlabeled sketches with the long-term goal of creating networks that can perform well with as little sketch data as possible or perhaps no sketch data at all (in practice we found the task is challenging even with a substantial amount of sketch data).

An emerging view in machine learning is that a great deal of learning is accomplished in a “self-supervised way” by relying primarily on the structure in unlabeled natural data as a form of supervision. A successful algorithm on the SketchTransfer dataset could provide evidence that detail invariance is also achievable without the use of explicit of supervision of data with missing details.

Our new dataset, which we call SketchTransfer, provides the following contributions:

  • A new dataset for transfer learning, which is approachable with today’s state of the art methods but is still very challenging.

  • A demonstration that even well-regularized algorithms fail to generalize across different levels of detail.

  • A brief exposition of a few state-of-the-art approaches to transfer learning and experiments with them on the SketchTransfer dataset.

2 SketchTransfer

The SketchTransfer training dataset consists of two parts: labeled real images and unlabeled sketch images. The test dataset consists of labeled sketch images. To make the task as straightforward as possible, we used the already popular CIFAR-10 dataset as the source of labeled real images.

We used the quickdraw dataset [23] as the source of sketch images. This dataset consists of 345 classes and was collected by asking volunteers to quickly sketch a given class with a 20 second time limit. For the SketchTransfer dataset, we rendered the QuickDraw images at a fixed resolution of 32x32.

The quickdraw dataset contains many more classes than CIFAR-10 and the classes are slightly different and generally more detailed. To solve this we defined a correspondence between the classes in CIFAR-10 and a relevant subset of the quickdraw classes (shown in Table 1). For example we map the quickdraw classes “Car” and “Police Car” onto the CIFAR-10 class automobile.

One small issue is that CIFAR-10 contains a deer class and QuickDraw doesn’t. Thus we elected to make our SketchTransfer test set only have 9 classes, but for the convenience of researchers we keep the CIFAR-10 training set the same (having 10 classes). Our final dataset of sketches consists of 9 classes which correspond directly to CIFAR-10 classes. This sketch dataset has a total of 90000 training images and 22500 test images (10000 and 2500 per class respectively).

Examples from both real and sketch images across all the classes can be seen in Figure 1 and Figure 3. One important property of SketchTransfer is that aside from sharing common classes, there is no explicit pairing between the real images and the sketch images.

We received permission from the Google Creative Lab to use the Quickdraw dataset [23].

Plane Car Bird Cat Dog Frog Horse Ship Truck

Figure 3: Examples from all of the classes in the SketchTransfer dataset, with real images from the class and sketch images from the class.

3 Related Datasets

Datasets of human sketches make it possible for algorithms to learn representations closely aligned with human priors. Prior to QuickDraw [23], the Sketch dataset [7]

of 20K hand sketches was used to explore feature extraction techniques. A later work, the Sketchy dataset 


, provided 70K vector sketches paired with corresponding photo images for various classes to facilitate a larger-scale exploration of human sketches. ShadowDraw 

[24] used a dataset of 30K raster images combined with extracted vectorized features to construct an interactive system that predicts what a finished drawing looks like based on a set of incomplete brush strokes from the user’s digital canvas. In addition to human sketches, ancient Eastern scripts also have sketch-like properties. For example, [4] considered using an RNN and VAEs to generate sketched Japanese characters. Understanding and recognizing these historical Japanese characters has become an important and widely studied problem in the digital humanities field [5, 6].

In developmental childhood psychology, [26, 25]

performed a series of experiments in which they tasked children of various ages with drawing sketches of certain classes, and then studying the properties of the sketches using convolutional neural networks. Their main finding was that visual features become more distinctive in the drawings produced by older children and that this goes beyond differences in visuomotor controls. This provides some evidence that for humans the deeper knowledge gained about objects from growing older, is reflected in sketches.

Cifar-10 Class QuickDraw Classes
Airplane Airplane
Automobile Car       Police Car
Bird Bird Duck Flamingo Owl Parrot Penguin Swan
Cat Cat Lion        Tiger
Deer n/a
Dog Dog
Frog Frog
Horse Horse
Ship Cruise Ship Sailboat Speedboat
Truck        Truck  Firetruck
Table 1: Class correspondences used to construct the sketch part of the SketchTransfer dataset.

4 Baselines

We evaluated several techniques on the proposed SketchTransfer dataset. First we considered techniques which only use the labeled source data (real images) including the use of regularization. As these models don’t see the sketches during training, we might expect that it would be difficult for these models to generalize well on sketches. For techniques which take advantage of the unlabeled data, we primarily focus on methods which perform well on existing benchmarks. One such benchmark is SVHN to MNIST, which involves transfer learning between very detailed color images of digits (SVHN) and plain black-and-white digit images (MNIST) [27]. One key idea is to try to make the model’s representations on source datapoints and target domain datapoints more similar, which we discuss in Section 4.3

. Another key idea is to enforce a consistency based objective on the target domain, which is also widely used in semi-supervised learning

[35]. The idea with this is to discourage the classifier’s decision boundary from passing through regions where the data in the target domain has high density (and thus encourage connected regions of high density to share the same predicted label). Virtual adversarial training (Section 4.4) and virtual mixup training (Section 4.6) are both motivated by this notion of consistency.

4.1 Training only on Labeled Source Data

As the simplest baseline, we consider simply training a network on the CIFAR-10 dataset and evaluating on sketches.

4.2 Regularized Training on Source Data

One possible approach is to still only use the real labeled images, but use regularization. A possibility is that this will force the model to use the more salient features in the data to predict on real images which will then be usable when predicting on sketches. For example, [34] found that the use of Manifold Mixup improves robustness to several classes of artificial image distortions.


Mixup is a state of the art regularizer for deep networks which involves training with linear interpolations of images and using a corresponding interpolation of the labels (Equation 

1). In its original formulation [39], it is only applicable to training with labeled data, although an unsupervised variant has also been explored [1].

4.3 Adversarial Domain Adaptation

Domain adversarial training [12]

consists of taking the hidden representations learned by a network on a source domain and a target domain and using an adversarially trained discriminator

to make the representations follow the same overall distribution. Since the discriminator is trained on hidden states (for both source and target domains), it does not require labeled data from the target domain.


The adversarial domain adaptation objective (Equation 2) augments the usual classifier objective on source data

and a hyperparameter

is used to adjust the weight of the adversarial loss, which aims to make features look similar for examples from the source and target distributions.

4.4 Virtual Adversarial Training

The virtual adversarial training (VAT) [29] algorithm is based on the Consistency regularization principle. The central idea of this algorithm is to use pseudolabels of unlabeled samples to generate adversarial perturbations, and use these adversarial perturbations instead of random perturbations for Consistency regularization. We always used the FGSM attack for computing the virtual adversarial perturbations [15].


Thus given some ball with a radius around each of the data points we encourage the change in the model’s prediction to be small (Equation 3).

4.5 Dirt-T

The Decision Boundary Iterative Refinement Training with a Teacher (or “Dirt-T”) algorithm [32] starts with a pre-trained model and considers an iterative refinement procedure to enforce the cluster assumption on the unlabeled target data (the cluster assumption is the notion that decision boundaries are more likely to pass through regions of low-density). Because this is only done on the target domain, it leads to high loss on the source domain, but also allows for the possibility of distribution-shift between the source and target domains.

4.6 Virtual Mixup Training (VMT)

The approaches which combine consistency regularization with Mixup training [40, 34, 36] have shown to achieve state-of-the-art results in semi-supervised learning paradigm [35, 2]. VMT [27] extends these approaches in the paradigm of Unsupervised Domain Adaptation by combining them with the Domain Adversarial Training [12]. More formally, VMT [27]

augments the loss function of Domain Adversarial Training and the entropy minimization loss with the Mixup loss.

In the case of target domain, since the targets are not available, similar to the [35, 2, 29], the pseudolabels of the unlabeled samples are used for mixing. The mixing of samples and pseudolabels are done as follows (where ):


The Mixup loss function on pseudolabels (Equation 4), that can be augmented with other losses, is given as follows:


4.7 Rotation Prediction

We can also consider adding self-supervised objectives, with the intuition that solving this objective may require the model to have a more thorough understanding of salient aspects than is required to predict the class. One such self-supervised objective that we consider, due to its simplicity, is randomly rotating input images by either 0 degrees, 90 degrees, 180 degrees, or 270 degrees, and having the model classify the degree of rotation as a 4-way classification task [10]. As this objective does not require labels, we can apply it on both the real images and the sketch images.

4.8 CyCada and CycleGAN

The CyCada project [20] explored the use of the CycleGAN algorithm [42] for transfer learning to new domains. Essentially the CycleGAN consists of an encoder which maps from the source domain to the target domain and a decoder which maps back to the source domain. A discriminator is used to encourage the encoded domain to follow the same distribution as the target domain. Notably this can be done without having access to paired examples from the source and target domains.

5 Experiments

First we consider running all the baselines discussed in Section 4

on the full SketchTransfer dataset, with 10000 unlabeled images per class (or zero per class, for the baselines which don’t use the unlabeled sketches). All numbers are accuracies. We ran two identical trials of each experiment with different random seeds and report the mean and standard deviation. The results of running with these baseline algorithms is presented in Table 


We note that our best result not using labeled sketch data is the same combination of techniques which achieved the best result in Virtual Mixup Training [27]. However on their tasks involving MNIST to SVHN and SVHN to MNIST transfer, this approach was able to almost match the performance of a classifier trained on labeled examples from the target domain. However on the SketchTransfer task there is still a clear gap between the 59% obtained with only unlabeled sketches and the 87% obtained by training on labeled sketches.

Inst. Norm
  Per Class
Table 2: Performance of different baseline models on SketchTransfer. We ran two trials for each experiment and we report mean with standard deviation in parenthesis. We also report a supervised learning result for a model trained directly on the labeled sketch training set. Our results indicate that using the unlabeled sketch data improves substantially but still falls short of the a model trained directly on labeled sketches.

5.1 Analysis

While we demonstrated in Table 2 that a model trained only on real images performs better than average on sketches, and this performance is improved by the use of transfer learning with sketches, it gives little direct insight into the nature of this improvement. To address this we considered two different analysis on the learned models. To make this analysis as clean as possible, we only considered our best model trained only on real images, our best model trained on unlabeled sketches, and our best model trained on label sketches. The best model was selected according to the test accuracies in Table 2. All of these analysis were performed on the test set.

First we considered a confusion matrix analysis, given in Figure 

4. The confusion matrix shows the distribution over predicted classes for each ground truth class. A stronger diagonal indicates higher accuracy. We can see that a model trained only on real images is most accurate on planes, trucks, and ships. The confusion matrix for the model trained with unlabeled sketches shows dramatically better classification on the bird, cat, dog, frog, and horse classes.

Additionally, we conducted an analysis where we collected the k-nearest neighbors for test images using different distance metrics (Figure 5). First, we considered a simple euclidean distance in pixel space, and this yielded extremely poor nearest-neighbors, with very low semantic and class similarity. Next we used different encoders and ran it until reaching an 8x8 spatial representation and used simple euclidean distance in this space. With a purely random encoder, this yielded poor results. However, with networks trained on real images, the quality of the nearest neighbors was substantially improved in terms of semantic similarity. We also observed that the network trained using transfer learning on unlabeled sketches had even more semantically relevant nearest neighbors.

Figure 4: Confusion matrices indicating predicted and true classes for a model trained only on real images (top), our best transfer learning model (center), and a model trained on labeled sketch images (bottom). A stronger diagonal indicates better performance. The model trained only on real images has the best performance on planes and trucks. The model exploiting unlabeled images performs better on many more classes, but still struggles to separate cats and dogs as well as horses and dogs.

Pixel-Based Randomly Initialized Network Network Trained only on Real Images Network Trained with Unlabeled Sketch Images Network trained on Labeled Sketch images

Figure 5: Nearest Neighbors on the test set using different distances (original on the left). Note that the distance learned after transfer learning is clearly more semantically meaningful, and even a well-regularized model using only the source data learned surprisingly good features for sketches.

6 Discussion

In this work, we propose the problem of whether a machine learning algorithm can develop representations of everyday things that can be mapped to human doodles, but giving the learning algorithm access to unlabelled doodle data. We could also consider an alternative approach where we train an artificial agent to sequentially draw, one stroke at a time, a given pixel image (such as CIFAR-10 samples) and by constraining the number of brush strokes our agent is allowed to make, we may be able to similar the biological constraints of the human anatomy and thus also develop a human doodle-like representation of actual images.

This approach has been explored in Artist Agent [38] and SPIRAL [11] where an agent is trained to paint an image from a data distribution using a paint simulator. Subsequent works [41, 21] combined the SPIRAL algorithm with the sketch-rnn [17] algorithm and enhanced the drawing agent by giving it an internal world model [18] to demonstrate that the agent can learn to paint inside its imagination. While the aforementioned works train an agent with an objective function to produce doodles that are as photorealistic as possible, a recent work [30] trains an agent to optimize instead for the content loss defined by a neural style transfer algorithm [13], opening up the exciting possibility of training agents that can produce truly abstract versions of photos.

However, the doodle representations of everyday objects developed by humans are not merely confined to our constrained biological ability to draw objects sequentially with our hands, stroke-by-stroke, but cultural influence is also at play. The location meta-data from the original QuickDraw dataset provided interesting examples of this phenomenon. For instance, it has been observed that most Americans draw circles counterclockwise, while most people in Japan draw them clockwise [19]. Chairs drawn in different countries tend to have different orientations [22]. Snowman in hotter countries consists of two snowballs, while for colder countries consists tend to have three [28]. By giving our algorithm actual unlabelled doodle data produced by humans around the world, we give our machine learning algorithms a chance to learn the true representation of objects developed by all of humanity.

7 Future Work

The idea of SketchTransfer is quite general and we identify a few ways in which it could be extended:

  • While CIFAR-10 does contain realistic images, it is limited in a few ways. First, the images are rather small, being only 32x32. Second, the number of labeled images is relatively small. Third, they generally only contain the object of interest and lack context, both spatial and temporal. Selecting a dataset which is stronger along any of these axis could be a useful improvement to the SketchTransfer task.

  • Sketches are to a large extent an extreme case of images which lack irrelevant details. Various cartoons and shaded illustrations may be a reasonable middle-ground and serve as an easier analogue task to SketchTransfer.

  • We elected to make the task easier by providing access to unlabeled sketch images. Another way to accomplish this might be to provide a very small number of labeled sketch images and use meta-learning to perform few-shot (or one-shot) classification of sketches. The task classification without using any sketch data during training would be challenging, yet [16] demonstrated strong generalization to changing environments by encourage independent mechanisms as an inductive bias of the architecture.

8 Conclusion

Human perceptual understanding shows a great deal of invariance to details and emphasis on salient features which today’s deep networks lack. We have introduced SketchTransfer, a new dataset for studying this phenomenon concretely and quantitatively. We applied the current state-of-the-art algorithms on domain transfer to this task (along with ablations). Intriguingly we found that they only achieve 60% accuracy on the SketchTransfer task, which is between the 11% accuracy of a random classifier but falls dramatically short of the 90% classifier trained on the labeled sketch dataset. This indicates that this new SketchTransfer task could be a powerful testbed for exploring detail invariance and abstractions learned by deep networks.