Image Generation from Freehand Scene Sketches

03/05/2020 ∙ by Chengying Gao, et al. ∙ HUAWEI Technologies Co., Ltd. SUN YAT-SEN UNIVERSITY 11

We introduce the first method for automatic image generation from scene-level freehand sketches. Our model allows for controllable image generation by specifying the synthesis goal via freehand sketches. The key contribution is an attribute vector bridged generative adversarial network called edgeGAN which supports high visual-quality image content generation without using freehand sketches as training data. We build a large-scale composite dataset called SketchyCOCO to comprehensively evaluate our solution. We validate our approach on the task of both objectlevel and scene-level image generation on SketchyCOCO. We demonstrate the method's capacity to generate realistic complex scene-level images from a variety of freehand sketches by quantitative, qualitative results, and ablation studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years generative adversarial networks (GANs) [16] have shown significant success in modeling high dimensional distributions of visual data. In particular, high-fidelity images could be achieved by unconditional generative models trained on object-level data (e.g., animal pictures in [4]), class-specific datasets (e.g., indoor scenes [34]), or even single image with repeated textures[33]. For practical applications, automatic image synthesis which can generate images and videos in response to specific requirements could be more useful. This explains why there are increasingly studies on the adversarial networks conditioned by another input signal like texts [38, 21], semantic maps [22, 6, 35, 28], layouts [21, 39], and scene graphs [2, 24]. Compared to these sources, a free-hand sketch has its unique strength in expressing the user’s idea in an intuitive and flexible way. Specifically, to describe an object or scene, sketches can better transmit the user’s intention than other sources since they lessen the uncertainty by naturally providing more low-level details such as object location, pose, and texture.

BigGAN [4] StackGAN [38] ContexturalGAN [27] Pix2pix [22] Sg2im[24] Layout2im [39] Ours Ashual et al. [2]
Figure 1: The proposed approach allows users to controllably generate realistic scene-level images with many objects from freehand sketches, which is in stark contrast to unconditional GAN and conditional GAN in that we use scene sketch as context (a weak constraint) instead of generating from noise [4] or with harder condition like semantic maps [2, 29] or edge maps [22].

In this paper, we take the use of generative adversarial networks into a new problem: controllably generating realistic images with many objects and relationships from a freehand scene-level sketch as shown in Figure 1. This problem is extremely challenging because of several factors. First, freehand sketches are characterized by various levels of abstractness, where there are a thousand different appearances from a thousand users even expressing the same common object, which makes it difficult for existing techniques to model the mapping from a freehand scene sketch to realistic natural images that exactly meet the user’s intention. SketchyGAN [9] has shown that it is hard to generate high-quality realistic images even from sketches depicting even single objects with a state-of-the-art model. More importantly, freehand scene sketches are often incomplete and contain foreground and background. For example, users often prefer to sketch the foreground objects they are most concerned with specific detailed appearances and would like the result to exactly respect this requirement while roughly drawing the background or leaving blank the details unworthy of care, which requires the algorithm to be capable of coping with different user requirements.

To make this challenging problem resolvable, we decompose the problem of complex scene generation into two sequential stages, foreground and background generation, based on the characteristics of scene-level sketching. The first stage focuses on foreground generation where the generated image content of sketches is supposed to exactly meet the user’s specific requirement. The second stage is responsible for background generation where the generated image content may be loosely aligned with the sketches. Since the appearance of each object in the foreground has been specified by the user, it is possible to generate realistic and reasonable image content from the individual foreground objects separately. Moreover, the already generated foreground can provide more constraints on the background generation, which makes background generation easier, i.e., progressive scene generation reduces the complexity of the problem.

To address the data variance problem caused by the abstract characteristic of sketches, we propose a new neural network architecture called edgeGAN. It learns a joint embedding for images and various-style edge maps depicting those images into a shared latent space where vectors can represent high-level attribute information (i.e., object pose, appearance information) from cross-domain data. With the bridge of the attribute vectors in the shared latent space, we are able to transfer the problem of image generation from freehand sketches to the one from edge maps and then bypass the requirement of collecting freehand sketches as training data, as well as address the challenge of modeling one-to-many correspondences between an image and infinite freehand sketches which can depict it.

To evaluate our solution, we build a large scale composite dataset called SketchyCOCO based on MS COCO Stuff [5]. This dataset includes 14K+ pairwise examples of scene-level images and sketches, 20K+ triplets examples of foreground sketches, images, and edge maps which cover 14 classes, 27K+ pairwise examples of background sketches and image examples which cover 3 classes, and the segmentation ground truth of 14K+ scene sketches.

We compare the proposed edgeGAN to existing sketch-to-image approaches. Both qualitative and quantitative results show that the proposed edgeGAN achieves significantly superior performance. For the evaluation of scene-level image generation, we compare our approach with two leading scene-level image generation systems, i.e., GauGAN [29] and Ashual et al. [2], which respectively use semantic maps and layouts/scene graphs as the input data. Qualitative and quantitative results show our approach can produce higher-quality results than Ashual et al. [2] either with layouts or with scene graphs as their input data. Even compared to GauGAN [29] which takes as input the harder-constraint semantic maps, our approach can also achieve comparative performance.

We summarize our contributions as follows:

  • We propose the first deep neural network based framework for image generation from scene-level freehand sketches.

  • We contribute a novel generative model called edgeGAN for object-level image generation from freehand sketches. This model can be trained in an end-to-end manner and does not need sketch-image pairwise ground truth for training.

  • We construct a large scale composite dataset called SketchyCOCO based on MS COCO Stuff [5]. This dataset will greatly facilitate related research.

2 Related Work

Sketch-Based Image Synthesis.

Early sketch-based image synthesis approaches are based on image retrieval. Sketch2Photo 

[7] and PhotoSketcher [15] synthesize realistic images by compositing objects and backgrounds retrieved from a given sketch. PoseShop [8] composites images of people by letting users input an additional 2D skeleton into the query so that the retrieval will be more precise. Recently, SketchyGAN [9] and contextualGAN [27] have demonstrated the value of variant GANs for image generation from freehand sketches. Different from SketchyGAN [9] and contextualGAN [27], which mainly solve the problem of image generation from object-level sketches depicting single objects, our approach focuses on generating images from scene-level sketches.

Conditional Image Generation. Several recent studies have demonstrated the potential of variant GANs for scene-level complex image generation from text [38, 21], scene graph [24], semantic layout map [21, 39], and edge map [22, 6]. Most of these methods use a multi-stage coarse-to-fine strategy to infer the image appearances of all semantic layouts in the input or intermediate results at the same time. We instead take another way and use a divide-and-conquer strategy to sequentially generate the foreground and background appearances of the image because of the unique characteristics of freehand scene sketches where objects can be segmented easily.

On object-level image generation, our edgeGAN is in stark contrast to unconditional GANs and conditional GANs in that we use a sketch as context (a weak constraint) instead of generating from noise like DCGAN [30], Wasserstein GANs [1], WGAP-GP [17] and their variants, or with hard condition such as an edge map [10, 11, 25, 22], semantic map [22, 6, 35, 28], while providing more precise control than those using text [38, 24, 21], layout [21, 39] and scene graph [2, 24] as context.

Figure 2: Workflow of the proposed framework.

3 Methods

Our approach includes three sequential modules: sketch segmentation, foreground generation, and background generation. These three modules are illustrated in Fig. 2. Given a scene sketch, all object instances are first located and recognized by the sketch segmentation module, and after that image content is generated from each foreground object instance (i.e., instances belonging to the foreground categories) individually in a random order by the foreground generation module. By taking background sketches and the output of the previous foreground generation step as input, we produce the output image by generating the background image in a single pass. We next describe the details of each module.

Figure 3: Structure of the proposed edgeGAN. It contains four sub-networks: two generators and , three discriminators , , and , an edge encoder

and an image classifier

. EdgeGAN learns a joint embedding for an image and various-style edge maps depicting this image into a shared latent space where vectors can encode high-level attribute information from cross-modality data.

3.1 Sketch Segmentation

Sketch segmentation is an instance segmentation task. For a given scene sketch, each sketch instance depicting an object instance can be denoted as a tuple (, , ). denotes the instance’s category. is a binary mask indicating the drawing of the instance. indicates the bounding box of the mask. has the format (, , , ), where (, ) is the top-left corner and (, ) denotes the bottom-right. The task of instance segmentation is to accurately find out the tuples of all instances in a scene sketch. We use Mask R-CNN [19] for the segmentation task. In Fig. 2, we illustrate the result on a representative scene sketch.

3.2 Foreground Generation

Overall Architecture of EdgeGAN. EdgeGAN is proposed to generate high-quality image content for each individual foreground sketches segmented by the sketch segmentation module. Apart from realistic generation, edgeGAN is also supposed to provide a faithful generation where the generated image exactly meets the user’s specific requirement. Previous methods could not satisfy both of the two requirements simultaneously because of the unique challenge caused by freehand sketches: there could be infinite freehand sketches depicting a single image. Directly modeling the mapping between a single image and its corresponding sketches like SketchyGAN [9] is difficult because of the huge size of the mapping space, we therefore instead address the challenge in another feasible way: learn a common representation for an object expressed by cross-domain data.

To this end, we design an adversarial learning architecture shown in Fig. 3(a) for edgeGAN. Rather than directly learning images from sketches, edgeGAN transfers the problem of sketch-to-image generation to the problem of generating the image from an attribute vector which encodes the expression intent of freehand sketches. At the training stage, edgeGAN learns a common attribute vector for an object image and its edge maps by feeding adversarial networks with images and their various-drawing-style edge maps. At the inference stage (see Fig. 3)(b)), edgeGAN captures the user’s expression intent with an attribute vector and then generates the desired image from the attribute vector.

Structure of EdgeGAN. As shown in Fig. 3(a), edgeGAN has two channels: one including generator and discriminator for edge map generation, the other one including generator and discriminator for image generation. Both and takes the same noise vector together with an one-hot vector indicting a specific category as the input. Discriminators and attempt to distinguish the generated images or edge maps from real distribution. Another discriminator is used to encourage the generated fake image and edge maps depicting the same object by telling if the generated fake image matches the fake edge map, which takes the output of both and as input (the image and edge map are concatenated along the channel dimension). The encoder is used to encourage the encoded attribute information of edge maps to be close to the noise vector fed to and through a loss. A classifier is used to infer the category label of the output of , which is used to encourage the generated fake image to be recognized as the desired category via a focal loss [26]. The detailed structures of each module of edgeGAN are illustrated in Fig. 3(c).

We implement encoder with the same encoder module in bicycleGAN [40] since they play a similar role functionally, i.e., our encoder encodes the “content” (e.g., the pose and texture information) rather than the drawing style into a latent vector, while the encoder in bicycleGAN encodes properties into latent vectors. For the classifier , we use an architecture similar to the discriminator of SketchyGAN while ignoring the adversarial loss and only using the focal loss [26] as the classification loss to predict the category label. The architecture of all generators and discriminators are based on WGAP-GP [17]. Objective function and more training details can be found in the supplementary document.

3.3 Background Generation

Once all of the foreground instances have been synthesized, we train pix2pix [22] to generate the background. The major challenge of the background generation task is that most backgrounds of the scene sketches contain both background instances and blank region in the area (as shown in Fig. 2), which means some area belonging to the background is uncertain because of the lack of sketch constraint. By leveraging pix2pix and using the generated foreground instances as constraints, we can allow the network to generate a reasonable background matching the synthesized foreground instances. Taking Fig. 2 as an example, the region below the zebras of the input image contains no background sketches for constraints, and the output image shows that such a region can be filled in with grass and ground with a good transition. We show more results in Fig. 4.

Figure 4: From top to bottom: input sketches, and the images generated by our approach.

4 Dataset Construction

Figure 5: Illustration of five-tuple ground truth data of SketchyCOCO, i.e., (a) {foreground image, foreground sketch, foreground edge maps}, (b) {background image, background sketch}, (c) {scene image, foreground image & background sketch}, (d) {scene image, scene sketch}, and (e) sketch segmentation.

We initialize the construction by collecting object-level freehand sketches covering 3 background classes and 14 foreground classes from the Sketchy dataset [32], Tuberlin dataset [12], and QuickDraw dataset [18]). For each class, we split these sketches into two parts: for constructing the training set, and use the remaining for the testing set. We collect 14081 natural images from COCO Stuff [5] containing at least one of 17 categories of objects and split them into two sets, for training set and the remaining for testing set. Using the segmentation masks of these natural images, we insert background sketches, i.e., clouds, grass, and tree sketches, into random positions within the corresponding background images of these images. In this step, natural images in training set and testing set respectively use the sketches from the corresponding training or testing sets.

This step produces pairs of background sketch-image examples (shown in Fig. 6). After that, for each foreground object in the natural image, we retrieve the most similar sketch with the same class label as the corresponding foreground object in the image. This step employs the sketch-image embedding method proposed in the Sketchy database [32]. In addition, in order to obtain more data for training object generation model, we collect foreground objects from the full COCO Stuff dataset. With this step and the artificial selection, we obtain triplets examples of foreground sketches, images as shown in Fig. 6 and edge maps. Since all the background objects and foreground objects of natural images from COCO Stuff have category and layout information, we therefore obtain the layout (e.g., bounding boxes of objects) and segmentation information for the synthesized scene sketches as well. After the construction of both background and foreground sketches, we naturally obtain five-tuple ground truth data (Fig. 5). Note that in the above steps, natural images in training and testing set can only be inserted by sketches from the training and testing sets, respectively.

In summary, shown as Fig. 5, we collect triplets of {foreground sketch, foreground image, foreground edge map} examples covering 14 classes, pairs of {background sketch, background image} examples covering 3 classes, pairs of {foreground image&background sketch, scene image} examples, pairs of {scene sketch, scene image} examples, and the segmentation ground truth for scene sketches. All object-level entities, i.e., images, edge maps, sketches, are scaled to two resolutions of and . Accordingly, all scene-level entities are scaled to and .

Figure 6: Representative sketch-image pairwise examples from 14 foreground and 3 background categories in SketchyCOCO; The data size of each individual category, split to training/testing, is shown on the top.

5 Experiments

5.1 Object-level Image Generation

Baselines. We compare edgeGAN with the general image-to-image model pix2pix [22] and two existing sketch-to-image models, ContextualGAN [27] and SketchyGAN[9], on the collected 20198 triplets {foreground sketch, foreground image, foreground edge maps} examples. Unlike SketchyGAN and pix2pix which may use both edge maps and freehand sketches for training data as their paper did, edgeGAN and ContextualGAN only take as input edge maps and do not use any freehand sketches for the training. For fair and thorough evaluation, we set up several different training modes for SketchyGAN, pix2pix, and ContextualGAN. We next introduce these modes for each model.

  • edgeGAN: we train a model by using foreground images and only their edge maps from all foreground object categories.

  • ContextualGAN  [27]: we use foreground images and their edge maps to separately train a model for each foreground object category, since the original method cannot use a single model to learn the sketch-to-image correspondence for multiple categories.

  • SketchyGAN  [9]: we train the original SketchyGAN in two modes. The first mode denoted as SketchyGAN-E uses foreground images and only their edge maps for training. Since SketchyGAN may use both edge maps and freehand sketches for training data as their paper did, we also train SketchyGAN in another mode: using foreground images and {their edge maps + sketches} for training. In this training mode called SketchyGAN-E&S, we follow the same training strategy as SketchyGAN did in  [9] to feed edge maps to the model first and then fine-tune it with sketches.

  • pix2pix  [22]: we train the original pix2pix architecture in four modes. The first two modes are denoted as pix2pix-E-SEP and pix2pix-S-SEP, in which we separately train 14 models by using edge maps and sketches from the 14 foreground categories, respectively. The other two modes are denoted as pix2pix-E-MIX and pix2pix-S-MIX, in which we train a single model respectively using only edge maps and sketches from all the 14 categories.

Qualitative results. We show the representative results of the three comparison methods in Fig 7. In general, edgeGAN provides much more realistic results than ContextualGAN. In terms of the faithfulness (i.e., whether the input sketches can depict the generated images), edgeGAN is also superior than ContextualGAN. This can be explained by the fact that edgeGAN uses the learnt attribute vector which captures reliable high-level attribute information from the cross-domain data for the supervision of image generation. In contrast, ContextualGAN uses a low-level sketch-edge similarity metrics, which is sensitive to the abstract level of the input sketch, for the supervision of image generation.

Compared to edgeGAN which produces realistic images, pix2pix and SketchyGAN which seemingly just colorize the input sketches and do not change the original shapes of the input sketches when the two models are trained with only edge maps (e.g., see Fig. 7 b1, c1, and c2). This may be because the outputs of both SketchyGAN and pix2pix are strongly constrained by the input (i.e., one-to-one correspondence provided by the training data). When the input is a freehand sketch from another domain, these two models are weak to produce realistic results since they only see edge maps during the training. In contrast, the output of edgeGAN is relatively weakly constrained by the input sketch since its generator takes as input the attribute vector learnt from cross-domain data rather than the input sketch. Therefore, edgeGAN can achieve better results than pix2pix and SketchyGAN because it is relatively insensitive to cross-domain input data.

By augmenting or changing the training data with freehand sketches, both SketchyGAN and pix2pix can produce realistic for local patches for some categories but without capturing sufficient global information as we can see that the output shapes of the examples in Fig. 7 b2, c3 and c4 are distorted.

Figure 7: From left to right: input sketches, results from edgeGAN, ContextualGAN (a), two training modes of SketchyGAN (i.e., SketchyGAN-E (b1) and SketchyGAN-E&S) (b2), four training modes of pix2pix, i.e, pix2pix-E-SEP (c1), pix2pix-E-MIX (c2),pix2pix-S-MIX(c3), and pix2pix-S-SEP(c4)

Quantitative results. We carry out both realism and faithfulness evaluation for the quantitative comparison. We use FID [20] and Accuracy  [2] as the realism metrics. FID is used to compute the distance between the distributions of the generated images and real images. Lower FID value means better image quality and diversity. Accuracy is the average classification accuracy score output by the classifier ResNet-101 by feeding it with generated images. It can be used as another metrics to measure the image realism. Higher accuracy value means better image realism. It is worth mentioning that the Inception Score [31] metrics is not suitable for our task. It is because several recent researches including [3]

find the Inception Score is basically only reliable for the models trained on ImageNet. We measure the faithfulness of the generated images by computing how close the edge maps of the generated images are close to the input sketches. Specifically, we use the shape similarity based on the

gabor feature distance [14] between the input sketch and the edge map generated by canny edge detector from the generated image as a faithfulness metric.

The quantitative results are summarized as Table 1

where we can see that the proposed edgeGAN achieves the best results in terms of the realism metrics. However, in terms of the faithfulness metrics, our method is better than most of the competitors but is not as good as pix2pix-E-SEP, pix2pix-E-MIX, SketchyGAN-E. This is because the results generated by these methods look more like a colorization of the input sketches whose shapes are almost the same as the input sketch (see Fig. 

7 b1, c1, c2), rather than being realistic. The quantitative results basically confirm our observations in the qualitative study.

Model(object) FID Accuracy Shape Similarity
Ours 87.59 0.8866 2.294e+04
ContextualGAN 225.15 0.3770 2.660e+04
SketchyGAN-E&S 137.87 0.1270 2.315e+04
SketchyGAN-E 141.48 0.2773 1.996e+04
pix2pix-E-SEP 143.08 0.6125 2.136e+04
pix2pix-S-SEP 196.04 0.4579 2.527e+04
pix2pix-E-MIX 128.84 0.4990 2.103e+04
pix2pix-S-MIX 163.28 0.2231 2.569e+04
Model(scene) FID SSIM FID(local)
Ours 164.83 0.288 112.03
GauGAN-semantic sketch 157.51 0.281 146.73
GauGAN-semantic map 68.93 0.388 98.75
 [2]-layout 262.02 0.240 199.24
 [2]-scene graph 271.49 0.242 202.93
Table 1: The results of quantitative experiments.
Figure 8: Scene-level comparison. Please see the text in Section 5.2 for the details.

5.2 Scene-level Image Generation

Baselines. There is no existing approach which is specifically designed for image generation from scene-level freehand sketches. SketchyGAN was originally proposed for object-level image generation from freehand sketches. Theoretically, it can also be used for the scene-level freehand sketches. pix2pix [22] is a popular general image-to-image model which is supposed to be applied in all the image translation tasks. We therefore use SketchyGAN [9] and pix2pix [22] as the baseline methods.

Since we have 14081 pairs of {scene sketch, scene image} examples, it is intuitive to directly train the pix2pix and SketchyGAN models to learn the mapping from sketches to images. We therefore conducted the experiments on the entities with lower resolutions, e.g.,

. We found that the training of either pix2pix or SketchyGAN was prone to mode collapse, often after 60 epochs (80 epochs for SketchyGAN), even all the 14081 pairs of {scene sketch, scene image} examples from the SketchyCOCO dataset were used. The reason may be that the data variety is too huge to be modeled. Even the size of 14K pairs is still insufficient to complete a successful training. However, even with

the 14081 pairs of {foreground image & background sketch, scene image} examples, We can still use the same pix2pix model for background generation without any mode collapse. This is because the generated foreground provides sufficient prior information and constraints for the background inference.

Comparison with other systems. We also compare our approach with the advanced approaches which generate images using constraints from other modalities.

  • GauGAN [29]: The original GauGAN model takes the semantic maps as input. We found that the GauGAN model can also be used as a method to generate images from semantic sketches where both the edges of the sketches and the background regions have category labels as shown in the column 7 of Fig. 8. In our experiments, we test the public model pre-trained on the dataset COCO Stuff. In addition, we trained a model by taking as input the semantic sketches on our collected SketchyCOCO dataset. The results are shown in Fig. 8 columns 6 and 8.

  • Ashual et al. [2]: the approach proposed by Ashual et al. can use either layouts or scene graphs as input. We therefore compared both of the two modes with their pre-trained model. For fairness, we only test the categories included in the SketchyCOCO dataset. The results are shown in Fig. 8 columns 2 and 4.

Qualitative results. From Fig. 8, we can see the images generated by freehand sketches are much more realistic than those generated from scene graphs or layouts by Ashual et al. [2], especially in the foreground object regions. This is because freehand sketches provide a harder constraint compared to scene graphs or layouts (it provides more information including the pose and texture information than scene graphs or layouts). Compared to GauGAN with semantic sketches as input, our approach generally produce more realistic images, especially in the foreground object regions. However, compared to the GauGAN model trained on the COCO Stuff dataset, the results generated by our approach have less color diversity than those generated by GauGAN (the envidence could be found by comparing the results in the second row in Fig. 8).

In terms of faithfulness, we can see that our approach has significantly superior performance over all the compositors. Even compared to the GauGAN model trained using semantic maps (a harder constraint compared to the sketch data), our approach achieve much better results, the envidences of which can be found in the generated foreground objects (the zebra, cow, elephants generated by GauGAN have blur or unreasonable contexts).

In general, our approach can produce much better results in terms of the overall visual quality and the faithfulness of the foreground objects than both GauGAN and Ashual et al.’s method. The overall visual quality of the whole image is also comparative to the state-of-the-art system.

Quantitative results. We adopt three metrics to evaluate the faithfulness and realisem of the generated scene-level images. Apart from FID, the structural similarity metrics(SSIM) [36] is also used to quantify how close the generated image is to the ground truth image. Higher SSIM value means closer. The last metrics, called FID(local), is used to compute the FID value of the foreground object regions in the generated images. From Table 1 we can see most the comparison results confirm our observations and conclusions in the qualitative study except for the comparison with the GauGAN model.

There are several reasons why the GauGAN model trained on semantic maps are superior than our model in terms of all the metrics. Apart from the inherent advantages offered by the semantic map data as a harder constraint, the GauGAN model trained on the semantic maps contains all the categories in the COCO-stuff dataset while our model only see 17 categories in the SketchyCOCO dataset. So the categories and number of instances in the generated image by GauGAN are the same with ground truth while our results can only contain the part of them. Since our approach takes as input the free-hand sketches, which may be much more accessible than semantic maps used by GauGAN, we believe our approach might still be a competitive system for an image generation tool compared to the GauGAN model.

5.3 Ablation Study

We conducted comprehensive experiments to analyze each component of our approach, which includes: a) whether the encoder have learnt the high level cross-domain attribute information, b) how the joint discriminator works, and c) which GAN model suits our approach the most, and d) whether multi-scale discriminators can be used to improve the results. Due to the limited space, in this section we only present our investigation towards the most important study, i.e., study a) and put the other studies into the supplementary document.

Figure 9: Results from edges or sketches with different style.Column 1 to 4:different freehand sketches. Column 5 to 9: edges from canny, FDoG [23], Photocopy(PC), Photo-sketch [13] and XDoG. [37]

We test different styles of drawings, including sketches and edge maps generated by various filters as input. We show the results in Fig. 9. We can see that our model works for a large variety of line drawing styles although some of them are not included in the training dataset. We believe that the attribute vector from the Encoder can extract the high-level attribute information of the line drawings no matter what styles they are.

6 Discussion and Limitation

We have studied the controllability and robustness of background generation. As shown in Fig. 4 (a) to (d), We add one more background category at a time on the blank background. From the results, the output image will change reasonably according to the newly added background sketches, which means the sketches do have controlled the correct generation in the different regions of an image. Moreover, it can be seen that the output images are reasonable although the input background has large unconstrained blank space. Moverover, we have studied our approach’s capability of producing diverse results. As shown in Fig. 4 (d) to (g), we have changed the location and size of the foreground objects in the scene sketch keeping the background unchanged. As a result, there are significant changes in the background generation. Taking the foreground as a constraint for background training, the foreground and background blend well. We can see that the approach even generates shadow under the giraffe. The SketchyCOCO has relative smaller size of training data in some categories such as the traffic light and in Fig. 6. We need to further augment the SketchyCOCO data in the future.

7 Conclusion

For the first time, this paper presents a neural network based framework to target the problem of generating scene-level images from freehand sketches. We have built a large scale composite dataset called SketchyCOCO based on MS COCO Stuff for the evaluation of our solution. Comprehensive experiments demonstrate the proposed approach can generate realistic and faithful images from a wide range of freehand sketches.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv:1701.07875, 2017.
  • [2] Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In ICCV, pages 4561–4569, 2019.
  • [3] Ali Borji. Pros and cons of GAN evaluation measures. CVIU, 179:41–65, 2019.
  • [4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
  • [5] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. CoRR, abs/1612.03716, 5:8, 2016.
  • [6] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, pages 1511–1520, 2017.
  • [7] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. Sketch2photo: Internet image montage. In TOG, volume 28, page 124, 2009.
  • [8] Tao Chen, Ping Tan, Li-Qian Ma, Ming-Ming Cheng, Ariel Shamir, and Shi-Min Hu. Poseshop: Human image database construction and personalized content synthesis. TVCG, 19(5):824–837, 2013.
  • [9] Wengling Chen and James Hays. Sketchygan: Towards diverse and realistic sketch to image synthesis. In CVPR, pages 9416–9425, 2018.
  • [10] Zezhou Cheng, Qingxiong Yang, and Bin Sheng. Deep Colorization. In ICCV, pages 415–423, 2015.
  • [11] Aditya Deshpande, Jason Rock, and David Forsyth. Learning Large-Scale Automatic Image Colorization. In ICCV, pages 567–575, 2015.
  • [12] Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ACM TOG, 31(4):44–1, 2012.
  • [13] Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. Photosketch: A sketch based image query and compositing system. In SIGGRAPH 2009: talks, page 60. ACM, 2009.
  • [14] Mathias Eitz, Ronald Richter, Tamy Boubekeur, Kristian Hildebrand, and Marc Alexa. Sketch-based shape retrieval. TOG, 31(4):31, 2012.
  • [15] Mathias Eitz, Ronald Richter, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. Photosketcher: interactive sketch-based image synthesis. IEEE Computer Graphics and Applications, 31(6):56–66, 2011.
  • [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NPIS. 2014.
  • [17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, pages 5767–5777, 2017.
  • [18] David Ha and Douglas Eck. A neural representation of sketch drawings. arXiv:1704.03477, 2017.
  • [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
  • [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
  • [21] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018.
  • [22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [23] Chenfanfu Jiang, Yixin Zhu, Siyuan Qi, Siyuan Huang, Jenny Lin, Xingwen Guo, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. Configurable, photorealistic image rendering and ground truth synthesis by sampling stochastic grammars representing indoor scenes. arXiv:1704.00112, 2017.
  • [24] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. arXiv:1804.01622, 2018.
  • [25] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. arXiv:1603.06668v3, 2016.
  • [26] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. TPAMI, 2018.
  • [27] Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung Tang. Image generation from sketch constraint using contextual gan. In ECCV, pages 205–220, 2018.
  • [28] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
  • [29] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
  • [30] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
  • [31] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, pages 2234–2242, 2016.
  • [32] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM TOG, 35(4):119, 2016.
  • [33] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan: Learning a generative model from a single natural image. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 4570–4580, 2019.
  • [34] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • [35] Mehmet Ozgur Turkoglu, William Thong, Luuk Spreeuwers, and Berkay Kicanaoglu. A layer-based sequential framework for scene generation with gans. arXiv:1902.00671, 2019.
  • [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600–612, 2004.
  • [37] Holger Winnemöller, Jan Eric Kyprianidis, and Sven C Olsen. Xdog: an extended difference-of-gaussians compendium including advanced image stylization. Computers & Graphics, 36(6):740–753, 2012.
  • [38] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [39] Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. Image generation from layout. In CVPR, 2019.
  • [40] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, pages 465–476, 2017.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv:1701.07875, 2017.
  • [2] Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In ICCV, pages 4561–4569, 2019.
  • [3] Ali Borji. Pros and cons of GAN evaluation measures. CVIU, 179:41–65, 2019.
  • [4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
  • [5] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. CoRR, abs/1612.03716, 5:8, 2016.
  • [6] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, pages 1511–1520, 2017.
  • [7] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. Sketch2photo: Internet image montage. In TOG, volume 28, page 124, 2009.
  • [8] Tao Chen, Ping Tan, Li-Qian Ma, Ming-Ming Cheng, Ariel Shamir, and Shi-Min Hu. Poseshop: Human image database construction and personalized content synthesis. TVCG, 19(5):824–837, 2013.
  • [9] Wengling Chen and James Hays. Sketchygan: Towards diverse and realistic sketch to image synthesis. In CVPR, pages 9416–9425, 2018.
  • [10] Zezhou Cheng, Qingxiong Yang, and Bin Sheng. Deep Colorization. In ICCV, pages 415–423, 2015.
  • [11] Aditya Deshpande, Jason Rock, and David Forsyth. Learning Large-Scale Automatic Image Colorization. In ICCV, pages 567–575, 2015.
  • [12] Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ACM TOG, 31(4):44–1, 2012.
  • [13] Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. Photosketch: A sketch based image query and compositing system. In SIGGRAPH 2009: talks, page 60. ACM, 2009.
  • [14] Mathias Eitz, Ronald Richter, Tamy Boubekeur, Kristian Hildebrand, and Marc Alexa. Sketch-based shape retrieval. TOG, 31(4):31, 2012.
  • [15] Mathias Eitz, Ronald Richter, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. Photosketcher: interactive sketch-based image synthesis. IEEE Computer Graphics and Applications, 31(6):56–66, 2011.
  • [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NPIS. 2014.
  • [17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, pages 5767–5777, 2017.
  • [18] David Ha and Douglas Eck. A neural representation of sketch drawings. arXiv:1704.03477, 2017.
  • [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
  • [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
  • [21] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018.
  • [22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  • [23] Chenfanfu Jiang, Yixin Zhu, Siyuan Qi, Siyuan Huang, Jenny Lin, Xingwen Guo, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. Configurable, photorealistic image rendering and ground truth synthesis by sampling stochastic grammars representing indoor scenes. arXiv:1704.00112, 2017.
  • [24] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. arXiv:1804.01622, 2018.
  • [25] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. arXiv:1603.06668v3, 2016.
  • [26] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. TPAMI, 2018.
  • [27] Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung Tang. Image generation from sketch constraint using contextual gan. In ECCV, pages 205–220, 2018.
  • [28] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
  • [29] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
  • [30] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
  • [31] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, pages 2234–2242, 2016.
  • [32] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM TOG, 35(4):119, 2016.
  • [33] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan: Learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pages 4570–4580, 2019.
  • [34] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • [35] Mehmet Ozgur Turkoglu, William Thong, Luuk Spreeuwers, and Berkay Kicanaoglu. A layer-based sequential framework for scene generation with gans. arXiv:1902.00671, 2019.
  • [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600–612, 2004.
  • [37] Holger Winnemöller, Jan Eric Kyprianidis, and Sven C Olsen. Xdog: an extended difference-of-gaussians compendium including advanced image stylization. Computers & Graphics, 36(6):740–753, 2012.
  • [38] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [39] Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. Image generation from layout. In CVPR, 2019.
  • [40] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, pages 465–476, 2017.