Mastering Sketching: Adversarial Augmentation for Structured Prediction

03/27/2017 ∙ by Edgar Simo-Serra, et al. ∙ 0

We present an integral framework for training sketch simplification networks that convert challenging rough sketches into clean line drawings. Our approach augments a simplification network with a discriminator network, training both networks jointly so that the discriminator network discerns whether a line drawing is a real training data or the output of the simplification network, which in turn tries to fool it. This approach has two major advantages. First, because the discriminator network learns the structure in line drawings, it encourages the output sketches of the simplification network to be more similar in appearance to the training sketches. Second, we can also train the simplification network with additional unsupervised data, using the discriminator network as a substitute teacher. Thus, by adding only rough sketches without simplified line drawings, or only line drawings without the original rough sketches, we can improve the quality of the sketch simplification. We show how our framework can be used to train models that significantly outperform the state of the art in the sketch simplification task, despite using the same architecture for inference. We additionally present an approach to optimize for a single image, which improves accuracy at the cost of additional computation time. Finally, we show that, using the same framework, it is possible to train the network to perform the inverse problem, i.e., convert simple line sketches into pencil drawings, which is not possible using the standard mean squared error loss. We validate our framework with two user tests, where our approach is preferred to the state of the art in sketch simplification 92.3 5.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Annotated Images Sketches “in the wild”
Figure 2: Comparison between the supervised dataset of [Simo-Serra et al. 2016] and rough sketches found in the wild. The difficulty of obtaining high quality and diverse rough sketches and their corresponding simplified sketches greatly limits performance on rough sketches “in the wild” that can be significantly different from the annotated data used for training models. The three images on the left of the Sketches “in the wild” are copyrighted by David Revoy and licensed under CC-by 4.0.

Sketching plays a critical role in the initial stages of artistic work such as product design and animation, allowing an artist to quickly draft and visualize their thoughts. However, the process of digitizing and cleaning up the rough pencil drawings involves a large overhead. This process is called sketch simplification, and involves simplifying multiple overlapped lines into a single line and erasing superfluous lines that are used as references when drawing, as shown in Fig. Mastering Sketching: Adversarial Augmentation for Structured Prediction. Due to the large variety of drawing styles and tools, developing a generic approach for sketch simplification is a challenging task. In this work, we propose a novel approach for sketch simplification that is able to outperform current approaches. Furthermore, our proposed framework can also be used to do the inverse problem, i.e., pencil drawing generation as shown in Fig. 1.

Recently, an approach to automatize the sketch simplification task with a fully convolutional network was proposed by Simo-Serra et al. SimoSerraSIGGRAPH2016. In order to train this network, large amounts of supervised data, consisting of pairs of rough sketches and their corresponding sketch simplifications, was obtained. To collect this data, the tedious process of inverse dataset construction was used. This involves hiring artists to draw a rough sketch, simplify the sketch into a clean line drawing, and finally redraw the rough sketch on top of the line drawing to ensure the training data is aligned. This is not only time- and money- consuming, but the resulting data also differs from the true rough sketches drawn without clean line drawings as references, as shown in Fig. 2. The resulting models trained with this data therefore generalize poorly to rough sketches found “in the wild”, which are representative of true sketch simplification usage cases. Here, we propose a framework that can incorporate these unlabeled sketches found “in the wild” into the learning process and significantly improve the performance in real usage cases.

Our approach combines a fully-convolutional sketch simplification network with a discriminator network that is trained to discern real line drawings from those generated by a network. The simplification network is trained to both simplify sketches, and also

to deceive the discriminator network. In contrast to optimizing with only the standard mean squared error loss, which only considers individual pixel and not their neighbors, our proposed loss considers the entire output image. This allows significantly improving the quality of the sketch simplifications. Furthermore, the same framework allows augmenting the supervised training dataset with unsupervised examples, leading to a hybrid supervised and unsupervised learning framework which we call

Adversarial Augmentation. We evaluate our approach on a diverse set of challenging rough sketches, comparing against the state of the art and alternate optimization strategies. We also perform a user study, in which our approach is preferred 92.3% of the time to the leading competing approach.

We also evaluate our framework on the inverse sketch simplification problem, i.e., generating pencil drawings from line drawings. We note that, unlike the sketch simplification problem, converting clean sketches into rough sketches cannot be accomplished by using a straightforward supervised approach such as a mean squared error loss: the model is unable to learn the structure of the output, and instead of producing rough pencil drawings, it produces a blurry representations of the input. However, by using our adversarial augmentation framework, we can successfully train a model to convert clean sketches into rough sketches.

Finally, as another usage case of our framework, we also consider the case of augmenting the training data with the test data input. We note that we use the test data in an unsupervised manner: no privileged information is used. By using the discriminator network and training jointly with the test images, we can significantly improve the results at the cost of computational efficiency.

In summary, we present:

  • [noitemsep,nolistsep,leftmargin=*]

  • A unified framework to jointly train from supervised and unsupervised data.

  • Significant improvements over the state of the art in sketch simplification quality.

  • A method to further improve the quality by single image training.

  • The first approach to converting simplified line drawings into rough pencil-drawn-like images.

2 Background

Sketch Simplification.

As sketch simplification is a tedious manual task for most artists, many approaches attempting to automate it have been proposed. A common approach is assisting the user by adjusting the strokes by, for instance, using geometric constraints [Igarashi et al. 1997], fitting Bézier curves [Bae et al. 2008]

, or merging strokes based on heuristics 

[Grimm and Joshi 2012]

. These approaches require all the strokes and their drawing order as input. Along similar lines, many work has focused on the problem of simplifying vector images 

[Shesh and Chen 2008, Orbay and Kara 2011, Lindlbauer et al. 2013, Fišer et al. 2015, Liu et al. 2015]. However, approaches that can be used on raster images have been unable to process complex real-world sketches [Noris et al. 2013, Chen et al. 2013]. Most of these approaches rely on a pre-processing stage that converts the image into a graph which is then optimized [Hilaire and Tombre 2006, Noris et al. 2013, Favreau et al. 2016]; however, they generally cannot recover from errors in the pre-processing stage. Simo-Serra et al. SimoSerraSIGGRAPH2016 proposed a fully-automatic approach for simplifying sketches directly from raster images of rough sketches by using a fully-convolutional network. However, their approach still needs large amounts of supervised data, consisting of pairs of rough sketches and their corresponding sketch simplifications, tediously created by the process of inverse dataset construction. Their training sketches differ from the true rough sketches, as shown in Fig. 2. The resulting models trained with this data therefore generalize poorly to real rough sketches. We build upon their work and propose a framework that can incorporate unlabeled real sketches into the learning process and significantly improve the performance in real usage cases. Thus, in contrast to all previous works, we consider challenging real-world scanned data that is significantly more complex than previously attempted.

Generative Adversarial Networks.

In order to train generative models using unsupervised data with back-propagation, Goodfellow et al. GoodfellowNIPS2014 proposed the Generative Adversarial Networks (GAN). In the GAN approach, a generative model is paired with a discriminative model and trained jointly. The discriminative model is trained to discern whether or not an image is real or artificially generated, while the generative model is trained to deceive the discriminative model. By training both jointly, it is possible to train the generative model to create realistic images from random inputs [Radford et al. 2016]. There is also a variant, Conditional GAN (CGAN), that learns a conditional generative model. This can be used to generate images conditioned on class labels [Mirza and Osindero 2014]. In a concurrent work, using CGAN for the image-to-image synthesis problem was recently proposed in a preprint [Isola et al. 2016]

, where the authors use a CGAN loss and apply it to tasks such as image colorization and scene reconstruction from labels. However, CGAN is unable to use unsupervised data, which helps improve performance significantly. In this paper, we compare against a strong CGAN-baseline, using the sketch simplification model of 

[Simo-Serra et al. 2016] with large amounts of data augmentation, and show that our approach can generate significantly better sketch simplification. The difference in architecture of our approach compared with CGAN is illustrated in Fig. 3.

Pencil Drawing Generation.

To our knowledge, the inverse problem of converting clean sketches to pencil drawings has not been tackled before. Making natural images appear like sketches has been widely studied [Kang et al. 2007, Lu et al. 2012], as natural images have rich gradients which can be exploited for the task. However, using clean sketches that contain very sparse information as inputs is an entirely different problem. In order to produce realistic strokes, most approaches rely on a dataset of examples [Berger et al. 2013], while our approach can directly create novel realistic rough-sketch strokes. As we will show, the discriminative adversarial training proves critical in obtaining realistic pencil-drawn outputs.

2.1 Deep Learning

We base our work on deep multi-layer convolutional neural networks [Fukushima 1988, LeCun et al. 1989]

, which have seen a surge in usage in the past few years, and have seen application in a wide variety of problems. Just restricting our attention to those with image input and output, there are such recent work as super-resolution 

[Dong et al. 2016], semantic segmentation [Noh et al. 2015], and image colorization [Iizuka et al. 2016]. These networks are all built upon convolutional layers of the form:


where for a convolution kernel, each output channel and coordinates , the output value is computed as an affine transformation of the input pixel for all input channels with a shared weight matrix formed by and bias value

that is run through a non-linear activation function

. The most widely used non-linear activation function is the Rectified Linear Unit (ReLU) where

 [Nair and Hinton 2010].

These layers are a series of learnable filters with and being the learnable parameters. In order to train a network, a dataset consisting of pairs of input and their corresponding ground truth

are used in conjunction with a loss function

that measures the error between the output of the network and the ground truth

. This error is used to update the learnable parameters with the backpropagation algorithm 

[Rumelhart et al. 1986]. In this work we also consider the scenario in which not all data is necessarily in the form of pairs , but can also be in the form of single samples and that are not corresponding pairs.

Our work is based on fully-convolutional neural network models that can be applied to images of any resolution. These networks generally follow an encoder-decoder architecture, in which the first layers of the network have an increased stride to lower the resolution of the input layer. At lower resolutions, the subsequent layers are able to process larger regions of the input image: for instance, a

-pixel convolution on an image at half resolution is computed with a -pixel area of the original image. Additionally, by processing at lower resolutions, both the memory requirements and computation times are significantly decreased. In this paper, we base our network model on that of [Simo-Serra et al. 2016] and show that we can greatly improve the performance of the resulting model by using a significantly improved learning approach.

3 Adversarial Augmentation

Figure 3: Overview of our approach, Adversarial Augmentation, compared with different network training approaches. (a) To train a prediction network , the supervised training trains by using a specific loss that encourages the output of the network to match that of the ground truth data . (b) The Conditional Generative Adversarial Network (CGAN) introduces an additional discriminator network that attempts to discern whether or not an image is from the training data or a prediction by the prediction network , while is trained to deceive . The discriminator network takes two inputs and

and estimates the conditional probability that

comes from the training data given . (c) Our approach fuses the supervised training and the adversarial training. We use a specific loss to force the output to be coherent with the input, while also using a discriminator network , similar to CGAN, but not conditioned on (Supervised Adversarial Training). (d) The fact that only takes as input, and that only takes , enables us to use training data and that are not related, i.e., in an unsupervised manner, to further train and (Unsupervised Data Augmentation).
Figure 4: Overview of our adversarial augmentation framework. We train the simplification network using both supervised pairs of data , and unsupervised data and . The discriminator network is trained to distinguish real line drawings from those output by the simplification network, while the simplification network is trained to deceive the discriminator network. For the data pairs we additionally use the MSE loss which forces the simplification network outputs to correspond to the input rough sketches. The two images forming Rough Sketches are copyrighted by David Revoy and licensed under CC-by 4.0.

We present adversarial augmentation, which is the fusion of unsupervised and adversarial training focused on the purpose of augmenting existing networks for structured prediction tasks. An overview of our approach compared with different training approaches can be seen in Fig. 3. Similar to Generative Adversarial Networks (GAN) [Goodfellow et al. 2014], we employ a discriminator network that attempts to distinguish whether an image comes from real data or is the output of another network. Unlike in the case of standard supervised losses such as the Mean Squared Error (MSE), with the discriminator network the output is encouraged to have a global consistency similar to the training images. Furthermore, in contrast to other existing approaches, our proposed approach permits augmenting the training dataset with unlabeled examples, both increasing performance and generalization, while having much more stable training. An overview of our framework can be seen in Fig. 4.

While this work focuses on sketch simplification, the presented approach is general and applicable to other structured prediction problems, such as semantic segmentation or saliency detection.

3.1 The GAN Framework

The purpose of Generative Adversarial Network (GAN) [Goodfellow et al. 2014] is, given a training set of samples, estimating a generative model that stochastically generates samples similar to those drawn from a distribution represented by the given training set . So a trained generative model produces, for instance, pictures of random cars similar to given set of sample pictures. A generative model is described as a neural network model that maps a random vector to an output . In the GAN framework, we train two network models: a) the generative model above, and b) a discriminator model that computes the probability that a structured input (e.g., image) came from the real data, rather than the output of . We jointly optimize and with respect to the expectation value:


by alternately maximizing the classification log-likelihood of and then optimizing to deceive by minimizing the classification log-likelihood of . By this process, it is possible to train the generative model to create realistic images from random inputs [Radford et al. 2016]. In practice, however, this min-max optimization is unstable and hard to tune so that desired results can be obtained, which led to some follow-up work [Radford et al. 2016, Salimans et al. 2016]. This model is also limited in that it can only handle relatively low, fixed resolutions.

This generative model was later extended to the Conditional Generative Adversarial Networks (CGAN) [Mirza and Osindero 2014], which models that generates conditioned on some input . Here, the discriminator model also takes as an additional input to evaluate the conditional probability given . Thus, and are optimized with the objective:


Note that the first expectation value is the average over supervised training data consisting of input-output pairs. Because of this, the CGAN framework is only applicable to supervised data.

The Conditional GAN can be used for the same purpose as ours, namely prediction, rather than sample generation. For that, we just omit the random input , so that is now a deterministic prediction given the input . In this case, we thus optimize:


3.2 Adversarial Augmentation Framework

In our view, the CGAN framework has one large limitation when used for the structured prediction problem. As we mentioned above, the CGAN objective function (4) can only be trained with supervised training set, because the crucial discriminator model is conditioned on the input . When data is hard to obtain, this becomes a significant hindrance to performance.

3.2.1 Supervised Adversarial Training

To train the prediction model jointly with the discriminator model that is not conditioned on the input , here we assume that the model has an associated supervised training loss , where is the ground truth output corresponding to the input . We define the supervised adversarial training as optimizing:


where is a weighting hyper-parameter and the expectation value is over a supervised training set of input-output pairs. This can be seen as a weighted combination of the loss and a GAN-like adversarial loss, trained on pairs of supervised samples. The difference from GAN is that here we have supervised data, while GAN does not. The difference from CGAN is that the coupling between the input and the ground truth output is through the conditional discriminator in the case of CGAN (Eq. (4)), while in the case of the supervised adversarial training (Eq. (5)), they are coupled directly through the supervised training loss .

The training consists of jointly maximizing the output of the discriminator network and minimizing the loss of the model with structured output by . For each training iteration, we alternately optimize and until convergence. The hyper-parameter controls the influence of the adversarial training on the network and is critical for training. Setting too low gives no advantage over training with the supervised training loss, while setting it too high causes the results to lose coherency with the inputs, i.e., the network generates realistic outputs, however they do not correlate with the inputs.

3.2.2 Unsupervised Data Augmentation

Our objective function above is motivated by the desire for unsupervised training. In most structured prediction problems, creating supervised training data by annotating the inputs is a time-consuming task. However, in many cases, it is much easier to procure non-matching input and output data; so it is desirable to be able to somehow use them to train our prediction model. Note that in our objective function (5), the first term inside the expectation value only needs , whereas the second only takes . This suggests that we can train using these terms separately with unsupervised data. It turns out that we can indeed use the supervised adversarial objective function to also incorporate the unsupervised data into the training, by separating the first two terms in the expectation value over supervised data in (5) into new expectation values over unsupervised data.

Suppose that we have large amounts of both input data and output data , in addition to a dataset of fully-annotated pairs. We modify the optimization function to:


where is a weighting hyper-parameter for the unsupervised data term.

Kernel Activation
Layer Type Size Function Output
input - -
convolutional ReLU
convolutional ReLU
convolutional ReLU
convolutional ReLU
convolutional ReLU
dropout (50%) - -
convolutional ReLU
dropout (50%) - -
fully connected - Sigmoid
Table 1:

Architecture of the discriminator network. All convolutional layers use padding to avoid a decrease in output size and a stride of

to half the resolution of the output.

Optimization is done on both and jointly using supervised data and unsupervised data and . If learning from only and , the model will not necessarily learn the mapping , but only to generate realistic outputs , which is not the objective of structured prediction problems. Thus, using the supervised dataset is still critical for training, as well as the model loss . The supervised data can be seen as an anchor that forces the model to generate outputs coherent with the inputs, while the unsupervised data is used to encourage the model to generate realistic outputs for a wider variety of inputs. See Fig. 4 for a visualization of the approach. As we note above, it is not possible to train CGAN models (4) using unsupervised data, as the discriminator network requires both the input and output of the model as input.

One interesting benefit of being able to use unsupervised data is that, when after training we use the prediction network to actually predict, we can use the input to train the network on the fly to improve the results of the prediction. This improves the prediction results by encouraging the prediction network to deceive the discriminator network and thus have more realistic outputs. This does, however, incur a heavy overhead as it requires using the entire training framework and optimizing the network.

valign=t Input Image valign=t valign=t valign=t
valign=t [Favreau et al. 2016] valign=t valign=t valign=t
valign=t LtS [Simo-Serra et al. 2016] valign=t valign=t valign=t
valign=t Ours valign=t valign=t valign=t
Figure 5: Comparison against the state of the art methods of [Favreau et al. 2016] and LtS [Simo-Serra et al. 2016]. For the approach of [Favreau et al. 2016], we had to additionally preprocess the image with a tone curve and tune the default parameters in order to obtain the shown results. Without this manual tweaking, recognizable outputs were not obtained. For both LtS and our approach, we did not preprocess the image and here we just visualize the network output. While [Favreau et al. 2016] manages to capture the global structure somewhat, many different parts of the image are missing due to the complexity of the scene. LtS fails to simplify most regions in the scene and outputs blurry areas for low-confident regions, which are dominant in these images. Our approach on the other hand is able to simplify all images, both preserving detail and obtaining crisp and clean outputs. The images are copyrighted by David Revoy and licensed under CC-by 4.0.

4 Mastering Sketching

We focus on applying our approach to the sketch simplification problem and its inverse. Sketch simplification consists of processing rough sketches such as those drawn by pencil into clean images that are amenable to vectorization. Our approach is also the first approach that can handle the inverse problem, that is, converting clean sketches into pencil drawings.

4.1 Simplification Network

In order to simplify rough sketches, we rely on the same model architecture as [Simo-Serra et al. 2016]. The model consists of a 23-layer fully-convolutional network that has three main building blocks: down-convolutions, convolutions with a stride of 2 to half the resolution; flat-convolutions, convolutions with a stride of 1 that maintain the resolution; and up-convolutions, convolutions with a stride of to double the resolution. In all cases, padding to compensate the reduction in size caused by the convolution kernel as well as the ReLU activation functions are employed. The general structure of the network follows an hourglass shape, that is, the first seven layers contain three down-convolution layers to decrease the resolution to one eighth. Afterwards, seven flat-convolutions are used to further process the image, and finally, the last nine layers contain three up-convolution layers to restore the resolution to that of the input size. There are two exceptional layers: the first layer is a down-convolution layer with a kernel and padding, and the last layer uses a sigmoid activation function to output a greyscale image where all values are in the range. In contrast with [Simo-Serra et al. 2016], which used a weighted MSE loss, we use the MSE loss as the model loss


where is the Euclidean norm.

Note that the MSE loss itself is not a structured prediction loss, i.e., it is oblivious of any structure between the component pixels. For each output pixel, neighboring output pixels have no effect. However, by additionally using the supervised adversarial loss as done in Eq. (6), the resulting optimization does take into account the structure of the output.

4.2 Discriminator Network

The objective of the auxiliary discriminator network is not that of high performance, but to help train the simplification network. If the discriminator network becomes too strong with respect to the simplification network, the gradients used for training generated by the discriminator network tend to vanish, causing the optimization to fail to converge. To avoid this issue, the network is kept “shallow”, uses large amounts of pooling, and employs dropout [Srivastava et al. 2014]. This also allows reducing the overall memory usage and computation time, speeding up the training itself.

We base our discriminator network on a small CNN with seven layers, the last being fully connected. Similarly to the simplification network, the first layer uses a convolution and all subsequent convolutional layers use convolutions. The first convolutional layer has 16 filters and all subsequent convolutional layers double the number of filters. We additionally incorporate 50% dropout [Srivastava et al. 2014] layers after the last two convolutional layers. All fully-connected layers use Rectified Linear Units (ReLU) except the final layer, which uses the sigmoid activation function to have a single output that corresponds to the probability that the input came from the real data instead of the output of . An overview of the architecture can be seen in Table 1.

4.3 Training

Adversarial networks are notoriously hard to train; and this has led to a series of heuristics for training. In particular, for Generative Adversarial Networks (GAN), the balance between the learning of the discriminative and generative components is critical, i.e., if the discriminative component is too strong, the generative component is unable to learn and vice versa. Unlike GAN, which has to rely entirely on the adversarial network for learning, we also have supervised data , which partially simplifies the training.

Training of both networks is done with backpropagation [Rumelhart et al. 1986]. For a more fine-grained control of the training, we balance the supervised training loss and the adversarial loss such that the gradients are roughly the same order of magnitude.

An alternate training scheme is used for both networks: in each iteration, we first update the discriminator network with a mini-batch, and then proceed to update the generative network using the same mini-batch.

During the training, we use batch normalization layers 

[Ioffe and Szegedy 2015] after all convolutional layers for both the simplification network and the discriminator network, which are then folded into the preceding convolutional layers during the testing. Optimization is done using the ADADELTA algorithm [Zeiler 2012], which abolishes the need for tuning hyper-parameters such as the learning rate, adaptively setting a different learning rate for all the weights in the network.

We follow a similar data augmentation approach as [Simo-Serra et al. 2016], namely training with eight additional levels of downsampling: , , , , , , , and , while using the constant-size image patch crops. All training output images are thresholded so that pixel values below 0.9 are set to 0 (pixels are in the range). All the images are randomly rotated and flipped during the training. Furthermore, we sample the image patches with more probability from larger images, such that patches from a image will be four times more likely to appear than patches from a image. We subtract the mean of the input images of the supervised dataset from all images. Finally, with a probability of 10%, the ground truth images are used as both input and output with the supervised loss, to teach the model that sketches that are already simplified should not be modified.

(a) Absolute rating.
LtS Ours
absolute 3.28 4.43
vs LtS - 92.3%
vs Ours 7.7% -
(a) Mean results for all users.
Figure 6: Results of the user study in which we evaluate the state of the art of LtS [Simo-Serra et al. 2016] and our approach on both absolute (1-5 scale) and relative (which result is better?) scales.
Input LtS Ours
Figure 7: Example results included in the user study. The drawing is copyrighted by flickr user “Yama Q” and licensed under CC-by-nc 4.0
Input MSE Loss Adv. Aug. (Artist 1) Adv. Aug. (Artist 2)
Figure 8: Examples of pencil drawing generation with our training framework. We compare three models: one trained with the standard MSE loss, and two models trained with adversarial augmentation using data from two different artists. In the first column, we show the input to all three models, followed by the outputs of each model. The first row shows the entire image, while the bottom row shows the area highlighted in red in the input image zoomed. We can see that the MSE loss only succeeds in blurring the input image, while the two models trained with adversarial augmentation are able to show realistic pencil drawings. We also show how training on data from different artists gives significantly different results. Artist 1 tends to add lots of smudge marks even far away from the lines, while artist 2 uses many overlapping lines to give the shape and form to the drawing.
Figure 9: More examples of pencil drawing generation. The line drawings on the left are automatically converted to the pencil drawings on the right.
Input Supervised-only Ours
Figure 10: Visualization of the benefits of using additional unsupervised data for training with our approach. For rough sketches fairly different from those in the training data we can see a clear benefit when using additional unsupervised data. Note that this data, in contrast with supervised data, is simple to obtain. We note that other approaches such as CGAN are unable to use unsupervised data in training. The bottom image is copyrighted by David Revoy and licensed under CC-by 4.0.

5 Experiments

We train our models using the supervised dataset from [Simo-Serra et al. 2016], consisting of 68 pairs of rough sketches with their corresponding clean sketches (), in addition to 109 clean sketches () and 85 rough sketches () taken from Flickr and other sources. Note that the additional training data and require no additional annotations, unlike the original supervised dataset. Some examples of the data used for training are shown in Fig. 4. We set and train for 150,000 iterations. Each batch consists of 16 pairs of image patches sampled from the 68 pairs in , 16 image patches sampled from the 109 clean sketches in , and 16 image patches sampled from the 85 rough sketches in . We initialize the model weights for all models by training exclusively in a supervised fashion on the supervised data , and in particular use the state-of-the-art model [Simo-Serra et al. 2016]. We note this pretraining is critical for learning and that without it training is greatly slowed down and in the case of CGAN it does not converge.

5.1 Comparison against the State of the Art

We compare against the state of the art of [Favreau et al. 2016] and Learning to Sketch (LtS) [Simo-Serra et al. 2016] in Fig. 5. We found that the post-processing done in LtS requires per-image tuning for images “in the wild” and opt to show directly the model outputs instead. We can see that in general the approach of [Favreau et al. 2016] fails to conserve most fine details, while conserving unnecessary details. On the other hand, LtS has low confidence on most fine details resulting in large blurry sections. Our proposed approach is able to both conserve fine details and avoid blurring.

5.2 User Study

We perform two user studies for a quantitative analysis on additional test data that is not part of our unsupervised training set. For both studies, we process 50 images with both our approach and LtS. In the first study, we randomly show the output of both approaches side-by-side to 15 users, asking them to choose the better result, while in the second study we show both the input rough sketch and a simplified sketch and ask them to rate the conversion on a scale of 1 to 5. We directly display the output of both networks; and the order of the images shown is randomized for every user. Evaluation results with an example of evaluated images are shown in Fig. 6.

In the absolute evaluation we can see that, while both approaches are scored fairly high, our approach obtains 1.15 points above the state of the art on a scale of 1 to 5. In the relative evaluation, our approach is preferred to the state of the art 92.3% of the time, with 7 of the 13 users preferring our approach 100% of the time, and the lowest scoring user preferring our approach 72% of the time. This highlights the importance of using adversarial augmentation to obtain more realistic sketch outputs, avoiding blurry or ill-defined areas. From the example images, we can see that the LtS model in general tends to blur complicated areas it is not able to fully parse, while our approach always produces well-defined and crisp outputs. Note that both network architectures are exactly the same: only the learning process and thus the weight values change. Additional qualitative examples are shown in Fig. 5, where it can be seen that our approach outputs sketch simplifications without blurring.

5.3 Pencil Drawing Generation

We also apply our proposed approach to the inverse problem of sketch simplification, that is, pencil drawing generation. We swap the input and output of the training data used for sketch simplification and train new models. However, unlike sketch simplification, it turns out that it is not possible to obtain realistic results without supervised adversarial training: the output just becomes a blurred version of the input. Only by introducing the adversarial augmentation framework is it possible to learn to produce realistic pencil sketches. We train three models: one with the MSE loss, and two with adversarial augmentation for different artists. MSE loss and Artist 1 models are trained on 22 image pairs, while the Artist 2 model is trained on 80 image pairs. We do not augment the training data with unsupervised examples, as we only have training pairs for both artists. Results are shown in Fig. 8. We can see how the adversarial augmentation is critical in obtaining realistic outputs and not just a blurred version of the input. Furthermore, by training on different artists, we seem to obtain models that capture each artists’ personality and nuances. Additional results are shown in Fig. 9.

5.4 Generalizing with Unsupervised Data

One of the main advantages of our approach is the ability to exploit unsupervised data. This is very beneficial as acquiring matching pairs of rough sketches and simplified sketches is very time consuming and laborious. Furthermore, it is hard to obtain examples from many different illustrators to teach the model to simplify a wide variety of styles. We train a model using the supervised adversarial loss, i.e., without unsupervised data, by setting and compare against our full model using unsupervised data in Fig. 10. We can see a clear benefit in images fairly different from those in the training data, indicating better generalization of the model. In contrast to our approach, existing approaches are unable to benefit from a mix of supervised and unsupervised data.

5.5 Single-Image Optimization

Input Output Optimized Input Output Optimized
Figure 11: Single image optimization. We show examples of images in which our proposed model does not obtain very good results, and optimize our model for these single images in an unsupervised fashion. This optimization process allows adapting the model to new data without annotations. The images are copyrighted by David Revoy and licensed under CC-by 4.0.

As another extension of our framework, we introduce the single-image optimization. Since we are able to directly use unsupervised data, it seems natural to use the test set with the adversarial augmentation framework to optimize the model for the test data. Note that this is done in the test time and does not involve any privileged information as the test set is used in a fully unsupervised manner. We test this approach using a single additional image and optimizing the network for this image. Optimization is done by using the adversarial augmentation from Eq. (6) with , ; with consisting of the single test image. The other hyper-parameters are set to the same values as used for sketch simplification. Results are shown in Fig. 11. We can see how optimizing results on a single test image can provide a further increase in accuracy, particularly when considering very hard images. In particular, in the left image, using the pretrained model leads to a non-recognizable output, as there is very little contrast in the input image. We do note, however, that this procedure leads to inference times a few orders of magnitude slower than using a pretrained network.

5.6 Comparison against Conditional GAN

Input CGAN Ours
Figure 12: Comparison of our approach against the Conditional GAN approach. The bottom image is copyrighted by David Revoy and licensed under CC-by 4.0.

We also perform a qualitative comparison against the recent Conditional GAN (CGAN) approach as an alternative learning scheme. As in the other comparisons, the CGAN is pretrained using the model of [Simo-Serra et al. 2016]. The training data is the same as our model when using only supervised data, the difference lies in the loss. The CGAN model uses a loss based on Eq. (4), while the supervised model uses Eq. (5). The discriminator network of the CGAN model uses both the rough sketch and the simplified sketch as an input, while in our approach only uses the simplified sketch . We note that we found the CGAN model to be much more unstable during training, several times becoming completely unstable forcing us to redo the training. This is likely caused by only using the GAN loss in contrast with our model that also uses the MSE loss for training stability.

Results are shown in Fig. 12. We can see that the CGAN approach is able to produce non-blurry crisp lines thanks to the GAN loss, however, it fails at simplifying the input image and adds additional artefacts. This is likely caused by the GAN loss itself, as it is a very unstable loss prone to artefacting. Our approach on the other hand uses a different loss that also includes the MSE loss which adds stability and coherency to the output images.

5.7 Discussion

Input Only Unsupervised Ours
Figure 13: Comparison of our approach with and without supervised data. With only unsupervised data, the output loses its coherency with the input and ends up looking like abstract line drawings.

While our approach can make great use of unsupervised data, it still has an important dependency on high quality supervised data, without which it would not be possible to obtain good results. As an extreme case, we train a model without supervised data and show results in Fig. 13. Note that this model uses the initial weights of the LtS model, without which it would not be possible to train it. While the output images do look like line drawings, they have lost any coherency with the input rough sketch.

6 Conclusions

We have presented the adversarial augmentation for structured prediction and applied it to the sketch simplification task as well as its inverse problem, i.e., pencil drawing generation. We have shown that by augmenting the standard model loss with a supervised adversarial loss, it is possible to get much more realistic structured outputs. Furthermore, the same framework allows for unsupervised data augmentation, essential for structured prediction tasks in which obtaining additional annotated training data is very costly. The proposed approach also allows opening the door to tasks that are not possible with standard losses such as generating pencil drawings from clean sketches and has wide applicability to most structured prediction problems. As adversarial augmentation only applies to the training, the resulting models have exactly the same inference properties as the non-augmented versions. As a further extension of the problem, we show that the framework can also be used to optimize for a single input for situations in which accuracy is valued more than quick computation. This can, for example, be used to personalize the model to different artists using only unsupervised rough and clean training data from each particular artist.

7 Acknowledgements

This work was partially supported by JST CREST and JST ACT-I Grant Number JPMJPR16UD.


  • [Bae et al. 2008] Bae, S.-H., Balakrishnan, R., and Singh, K. 2008. Ilovesketch: As-natural-as-possible sketching system for creating 3d curve models. In ACM Symposium on User Interface Software and Technology, 151–160.
  • [Berger et al. 2013] Berger, I., Shamir, A., Mahler, M., Carter, E., and Hodgins, J. 2013. Style and abstraction in portrait sketching. ACM Transactions on Graphics 32, 4, 55.
  • [Chen et al. 2013] Chen, J., Guennebaud, G., Barla, P., and Granier, X. 2013. Non-oriented mls gradient fields. Computer Graphics Forum 32, 8, 98–109.
  • [Dong et al. 2016] Dong, C., Loy, C. C., He, K., and Tang, X. 2016. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 2, 295–307.
  • [Favreau et al. 2016] Favreau, J.-D., Lafarge, F., and Bousseau, A. 2016. Fidelity vs. simplicity: a global approach to line drawing vectorization. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 35, 4.
  • [Fišer et al. 2015] Fišer, J., Asente, P., and Sýkora, D. 2015. Shipshape: A drawing beautification assistant. In Workshop on Sketch-Based Interfaces and Modeling, 49–57.
  • [Fukushima 1988] Fukushima, K. 1988.

    Neocognitron: A hierarchical neural network capable of visual pattern recognition.

    Neural Networks 1, 2, 119–130.
  • [Goodfellow et al. 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. 2014. Generative adversarial nets. In Conference on Neural Information Processing Systems.
  • [Grimm and Joshi 2012] Grimm, C., and Joshi, P. 2012. Just drawit: A 3d sketching system. In nternational Symposium on Sketch-Based Interfaces and Modeling, 121–130.
  • [Hilaire and Tombre 2006] Hilaire, X., and Tombre, K. 2006. Robust and accurate vectorization of line drawings. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 6, 890–904.
  • [Igarashi et al. 1997] Igarashi, T., Matsuoka, S., Kawachiya, S., and Tanaka, H. 1997. Interactive beautification: A technique for rapid geometric design. In ACM Symposium on User Interface Software and Technology, 105–114.
  • [Iizuka et al. 2016] Iizuka, S., Simo-Serra, E., and Ishikawa, H. 2016. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 35, 4.
  • [Ioffe and Szegedy 2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

  • [Isola et al. 2016] Isola, P., Zhu, J., Zhou, T., and Efros, A. A. 2016. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004.
  • [Kang et al. 2007] Kang, H., Lee, S., and Chui, C. K. 2007. Coherent line drawing. In International Symposium on Non-Photorealistic Animation and Rendering, 43–50.
  • [LeCun et al. 1989] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation 1, 4, 541–551.
  • [Lindlbauer et al. 2013] Lindlbauer, D., Haller, M., Hancock, M. S., Scott, S. D., and Stuerzlinger, W. 2013. Perceptual grouping: selection assistance for digital sketching. In International Conference on Interactive Tabletops and Surfaces, 51–60.
  • [Liu et al. 2015] Liu, X., Wong, T.-T., and Heng, P.-A. 2015. Closure-aware sketch simplification. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 34, 6, 168:1–168:10.
  • [Lu et al. 2012] Lu, C., Xu, L., and Jia, J. 2012. Combining sketch and tone for pencil drawing production. In International Symposium on Non-Photorealistic Animation and Rendering, 65–73.
  • [Mirza and Osindero 2014] Mirza, M., and Osindero, S. 2014. Conditional generative adversarial nets. In

    Conference on Neural Image Processing Deep Learning Workshop

  • [Nair and Hinton 2010] Nair, V., and Hinton, G. E. 2010.

    Rectified linear units improve restricted boltzmann machines.

    In International Conference on Machine Learning, 807–814.
  • [Noh et al. 2015] Noh, H., Hong, S., and Han, B. 2015. Learning deconvolution network for semantic segmentation. In

    International Conference on Computer Vision

  • [Noris et al. 2013] Noris, G., Hornung, A., Sumner, R. W., Simmons, M., and Gross, M. 2013. Topology-driven vectorization of clean line drawings. ACM Transactions on Graphics 32, 1, 4:1–4:11.
  • [Orbay and Kara 2011] Orbay, G., and Kara, L. 2011. Beautification of design sketches using trainable stroke clustering and curve fitting. IEEE Transactions on Visualization and Computer Graphics 17, 5, 694–708.
  • [Radford et al. 2016] Radford, A., Metz, L., and Chintala, S. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations.
  • [Rumelhart et al. 1986] Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning representations by back-propagating errors. In Nature.
  • [Salimans et al. 2016] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. 2016. Improved techniques for training gans. In Conference on Neural Information Processing Systems.
  • [Shesh and Chen 2008] Shesh, A., and Chen, B. 2008. Efficient and dynamic simplification of line drawings. Computer Graphics Forum 27, 2, 537–545.
  • [Simo-Serra et al. 2016] Simo-Serra, E., Iizuka, S., Sasaki, K., and Ishikawa, H. 2016. Learning to Simplify: Fully Convolutional Networks for Rough Sketch Cleanup. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 35, 4.
  • [Srivastava et al. 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958.
  • [Zeiler 2012] Zeiler, M. D. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.