Sketch-Guided Scenery Image Outpainting

by   Yaxiong Wang, et al.
Xi'an Jiaotong University

The outpainting results produced by existing approaches are often too random to meet users' requirement. In this work, we take the image outpainting one step forward by allowing users to harvest personal custom outpainting results using sketches as the guidance. To this end, we propose an encoder-decoder based network to conduct sketch-guided outpainting, where two alignment modules are adopted to impose the generated content to be realistic and consistent with the provided sketches. First, we apply a holistic alignment module to make the synthesized part be similar to the real one from the global view. Second, we reversely produce the sketches from the synthesized part and encourage them be consistent with the ground-truth ones using a sketch alignment module. In this way, the learned generator will be imposed to pay more attention to fine details and be sensitive to the guiding sketches. To our knowledge, this work is the first attempt to explore the challenging yet meaningful conditional scenery image outpainting. We conduct extensive experiments on two collected benchmarks to qualitatively and quantitatively validate the effectiveness of our approach compared with the other state-of-the-art generative models.


page 2

page 4

page 5

page 7

page 8

page 9


Self-Supervised Sketch-to-Image Synthesis

Imagining a colored realistic image from an arbitrarily drawn sketch is ...

Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence

This paper tackles the automatic colorization task of a sketch image giv...

Mask Embedding in conditional GAN for Guided Synthesis of High Resolution Images

Recent advancements in conditional Generative Adversarial Networks (cGAN...

Quality Guided Sketch-to-Photo Image Synthesis

Facial sketches drawn by artists are widely used for visual identificati...

Example-Guided Image Synthesis across Arbitrary Scenes using Masked Spatial-Channel Attention and Self-Supervision

Example-guided image synthesis has recently been attempted to synthesize...

MOST-Net: A Memory Oriented Style Transfer Network for Face Sketch Synthesis

Face sketch synthesis has been widely used in multi-media entertainment ...

ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation

Existing text-guided image manipulation methods aim to modify the appear...

I Introduction

Image outpainting, also known as image extrapolation, aims at predicting unknown regions beyond the boundary according to currently seen image pixels. Many disparate disciplines demand a strong need for high-quality image extensions. For example, in virtual reality, it is often necessary to simulate different camera extrinsics according to current visual content, which requires making a reasonable extension for the original image.

For an input image, traditional outpainting methods usually focus on designing searching strategies to find regions in a candidate pool  [21, 22, 23, 24, 35]. As a consequence, their performances heavily depend on the size of the pool and suffer from limited searching space. Inspired by the success of Generative Adversarial Networks (GANs) [36], researchers recently propose to synthesize additional contents for the inputs [7, 8, 3]

. However, traditional searching approaches and GAN-based methods both only focus on the authenticity and the semantic consistency between the new content and the original input, and the synthesized random outpainting results are usually below users’ expectations. Furthermore, since there are no random variables or control information introduced, existing outpainting systems can only produce one outpainting result for an input image. These limitations make the current outpainting methods unable to meet practical situations. To address the weaknesses of existing outpainting methods, one promising solution is to enable users to acquire personal custom outpainting results, by simply providing guided sketches based on their preferences, as shown in Fig. 

1. To construct such an outpainting system, several challenges need to be tackled

  • Reasonable pixel filling with spatial consistency. The sketch is a simple and crude clue, which only supplies the desired shape while the corresponding color needs to be adaptively filled by considering the spatial structure information.

  • Reasonable synthesis with sketch consistency. An expected controllable outpainting system is that the synthesized image should exactly match the guiding sketches. Therefore, the learned generator should be sensitive to the users’ input and impose the synthesized part to be consistent with the given sketches.

Fig. 1: Illustration of the sketch-guided scenery image outpainting. Our proposed method can synthesize the desired outpainting results according to the sketches manually drawn by users.

To address the above raised challenges and facilitate the Sketch-Guided Scenery Image Outpainting (SGSIO), we contribute a robust system with three basic modules, i.e. generator module, holistic alignment module and sketch alignment module. To be specific, we introduce two position channels to the generative module as assistant inputs, which can help the generator perform reasonable pixels filling, by building an explicit link between semantic regions and specific positions. In addition, to guarantee the predicted content exactly matches the users’ expectations, we further adopt a conditional skip connection mechanism to prevent the desired sketch from distorting, which is achieved by emphasizing the desired shape in decoding step. To further make the synthesized content be realistic and consistent with the provided sketches, we augment the system with two alignment modules. In general, the holistic alignment module focuses on synthesizing authentic images from a global perspective, while the sketch alignment module attempts to reconstruct the fine details for further enhancing the authenticity. First, the holistic alignment module is employed to generate authentic images by adversarial training, using a global discriminator and a local discriminator for collaboration. In this way, the synthesized part is encouraged to be similar to the real one from its global appearance. Second, we introduce an additional sketch alignment module to help the network rebuild image details by enforcing the generator to reconstruct the high-frequency information in images, imposing the generator to pay more attention to the detailed sketches and produce reasonable new content with sketch consistency accordingly. Several examples generated by the proposed outpainting system are shown in Fig. 1. For the input images, our system allows users to edit the predicted content by free-style sketches based on their preferences. Users may expect special clouds with specific shape or control the mountain trend, with our proposed system, these expectations can be easily achieved by feeding the manually drawn sketches to guide the synthesis. It can be observed that our system can smoothly produce the synthesized part regarding the consistency from both the given sketches and the surrounding contextual information.

This work makes the first attempt to conduct conditional scenery image outpainting. Our proposed method not only makes it possible for users to control the image extrapolation but achieves state-of-the-art performances on NS6K dataset [8]. To evaluate the robustness and test the performance under more complex scenarios, we further build a new dataset called NS8K by removing some similar images from NS6K and absorb additional thousands of images with diverse appearances from world-wide famous scenic spots. Our method is more outstanding on both two datasets. The contributions of this paper can be summarized as follows:

  • We consider a new outpainting task that allows users to control the scenery image outpainting by free-form sketches, which is an under explored task. We hope this work can serve as a solid baseline and ease future research for conditional image outpainting.

  • We develop an outpainting system with one module for generation and two modules for appearance alignment. With the assistant of the proposed modules, our outpainting system can pay more attention to fine details and successfully produce sketch-consistent outputs.

  • We contribute a natural image dataset NS8K. This new dataset contains much more complex and diverse images than the original NS6K.

The reminder of this paper is organized as follows: In section II, we review the related works of this paper. Section III elaborates the details of our proposed sketch-guided image outpainting network. The expeirments are shown in section IV. Finally, conclusion and future works are given in section V.

Ii Related work

The sketch-guided image outpainting attracts few attention these years. In this section, we would subsequently review the existing works in three sub-field: image-to-image translation, image inpainting and image outpainting.

Ii-a Image to Image Translation

The image to image (I2I) translation attempts to map the images in one domain to another [10, 25], since the I2I problem was proposed by Isola et al.  [10], it has attracted much attention due to the well generality for many downstream tasks. Isola et al. make the first attempt and design a general solution, i.e. Pix2Pix, for tacking image-to-image translation [10]. Zhu et al. cyclically synthesize the source map and target image, they propose a novel network named CycleGan to improve the quality of results [16]. The classic structures, Pix2Pix and CycleGan, design simple structures but result in very promising results, which motivate more researchers to employ the I2I framework as their basic architectures. To be specific, Yu et al. employ the I2I architecture and design a multi-mapping style transfer framework [11]. Luan et al. capture the style from a reference image and transfer the style to the target one [12]. In [9] and [15], the authors model the sketch to image synthesize as an image to image translation framework, the developed network could synthesize the realistic images from the input sketches. Lu et al. and Madam et al. employ the I2I framework and restore the sharp images from the corresponding blurred ones [13, 14]. Besides the style transfer, sketch to image synthesise and image debulrring, many tasks like image inpainting [2, 5], image denoising [39, 38]

, and image super-Resolution 

[40, 41] all attempt to design solutions based on I2I framework. However, I2I methods only focus on synthesizing authentic images while pay no attention to the semantic and stylistic consistency between the input and the generated part. Consequently, it will acquire a poor performance if we stiffly extend the I2I methods to conduct image outpainting [8, 7].

Fig. 2: The architecture of the proposed method. Our framework consists of three modules. The generator module takes the image and sketch as inputs and predicts new content beyond the boundary for the input image. The holistic alignment module is responsible to discriminate whether the synthesized part is fake or not from a global view, and the sketch alignment module focuses on imposing the generator to be sensitive to the fine details and recovering the high-frequency information to boost the outpainting quality.

Ii-B Image Inpainting

Image inpainting targets at reconstructing missing areas in the corrupted images  [4, 1, 19, 20], which has been well explored. Extensive efforts have been dedicated to this field. Yu et al. design a contextual attention module and propose to compensate the missing areas using the pixels from similar regions [5]. Iizuka et al.

employ a fully-convolutional neural network and use global and local context discriminators to train an inpainting system 

[18]. Recently, the focused case for image inpainting has been moving from predicting the missing region with formal shape to the irregular inpainting  [2, 5, 17, 4, 6]. Liu et al. propose a partial convolution, which could progressively predict the missing pixels from the surrounding content [6]. Yu et al. develop a gate convolution to adaptively learn a soft mask, and the designed architecture could significantly improve the inpainting results. In [17], Xie et al. attempt to the learn a feature re-normalization by the desinged learnable attention map module, which could effectively adapt to the irregular holes. Guo et al. propose a full-resolution residual network (FRRN) to fill irregular holes, the authors aim at compensating more textural details for the damaged areas [4]. Han et al. propose a two-stage image-to-image generation framework, which could perform compatible and diverse inpainting [42]. In [43], Ren et al. split the inpainting task as two part, i.e. structure reconstruction and texture generation, and the authors design a two-stage framework to yield texture-detailed results. The classic methodology of image inpainting predicts the damaged pixels from the neighbors based on the convolutional operation. These methods could make a success on image inpainting, however, they suffer from the lack of surrounding pixels when applied on outpainting task  [7, 8]. Comparing to image inpainting, it is an extra challenge for outpainting that the missing region is relatively large and far away from the valid pixels.

Ii-C Image Outpainting

Traditional outpainting methods first search relevant patches from a pre-defined candidate pool, and then the retrieved patches are stitched with the input image to conduct extrapolation [21, 22, 23, 24]. Zhang et al. formulate the outpainting in the shift-map image synthesis framework, the authors search a guide image and analyze the self-similarity of the guide image to generate the allowable local transformations, which is then applied to the input image to conduct extrapolation [23]. Wang et al. use the library images to determine the consistent content for the regions and propose a data-driven approach to extrapolate the image to a given distinctly large one [24]. The performance of the traditional search-based methods depends on the searching results and the number of candidate images. Awful patch choosing would cause poor performance. These methods are not flexible enough and hard to extend to more complex situations. Inspired by the success of the generative adversarial networks (GANs), recently researchers utilize GANs framework to synthesize new contents beyond the boundary [7, 8, 3]. Yang et al.

utilize GANs and the recurrent neural network to iteratively predict new contents for current region. Teterwak

et al. design a powerful discriminator that takes the groundtruth image as input to form an improved inception loss, the developed network could efficiently restore the images. In [44], Zhang et al. study a special outpainting task, aiming to generate a set of realistic backgrounds with a given small foreground region. Wang et al. allow the users to control the margin of the boundary, and propose to synthesize the news contents matching the expected resolution [45]. Wu et al. explore the outpainting problem for the portrait image, the authors design a two-stage framework to produce realistic portrait image [46]. However, existing outpainting methods only focus on generating realistic images but introduce no extra information to guide the final synthesis. As a consequence, these works all produce random results.

Iii Methodology

Iii-a Overview

As shown in Fig. 2, our framework comprises of three modules, i.e. the generator module, the holistic alignment module and the sketch alignment module. Our generator module takes an image and its sketch as inputs, and synthesizes additional right half content using its right counterpart as the guidance. The holistic alignment module is responsible to predict a scalar, which coarsely indicates the input is generator-produced or real from a global view. In contrast, the sketch alignment module aims at pursuing the detailed agreement between the sketches inferred from the synthesized part and the ground-truth. These three modules are jointly trained by the classic adversarial loss [36] and the proposed sketch alignment loss. During the training process, our system takes the left half image, the left half sketches, and the right half sketches as inputs to reconstruct the entire image. At the testing stage, users could feed an image and the free-form guiding sketches, to synthesize the desired image with additional right half, as shown in Fig. 1. In the following, we will first introduce each module one by one, and provide the training details subsequently.

Iii-B Generator Module

Following the previous state of the art outpainting method [8], our generator also takes an encoder-decoder based network. The encoder compresses the inputs to hidden features, and a LSTM [28] encoder collaborating with a LSTM decoder are employed to predict the hidden features of the complete image, which is further fed into the subsequent decoder layers to restore the complete image.

Compared to synthesize random contents, sketch-guided outpainting poses two extra challenges for the generator. First, with limited training samples, the learned outpainting model is often hard to cover all the free-style sketches drawn by users, making it not robust to those novel sketches. Since it is not practical to augment the training with a huge number of images with various sketches, one promising solution is to impose the filled pixels around the given sketches to be consistent with the contextual information and the learned prior knowledge. Intuitively, cloud and sky usually appear at the upper part of one scenery image while grass is the opposite. If the learned outpainting model could be well equipped with position-aware prior knowledge, it will take white or blue colors to fill those novel sketches given at the upper part rather than the green one. Motivated by the relation cues between the pixels and the position, we introduce two position channels, to help the generator learn position-aware knowledge and be robust to predict the filled pixels for novel sketches. Second, we should ensure the synthesized image exactly match the guided sketches, since we find the generator often produce deformed results comparing to the given sketches in our experiments. For some extreme cases, the generated content even totally loses the shape information for relative small guiding sketches, as shown in Fig. 3 and Fig. 9. Consequently, the synthesized results do not well match users’ intentions. To address this issue, we design a conditional skip connection to emphasize the desired shape in the decoding stage, which can effectively help the generator “remember" the information of the expected shape.

(a) The inputs
(b) Results W/O CSC
(c) Results with CSC
(d) The original images
Fig. 3: The generator without CSC can not ensure the results exactly meet the guiding sketches. When the guiding sketch is small, the model may totally ‘forget’ the sketch in decoding step.

Iii-B1 Position Channels

Intuitively, for a scenery image, different types of objects should appear in specific positions, e.g. the clouds should locate in the sky (top part) instead of the ground (bottom part), while the lake/land/rock are more likely to be in the bottom part. Therefore, for scenery image outpainting, the position relation between the semantic region and the specific position is a helpful clue for new content prediction, what’s more, the learned positional prior knowledge would play a key role in helping the network robustly generalize to the free-style outpainting. Inspired by the above considerations and the successes of position maps on position-sensitive tasks [27], we utilize two additional position channels, i.e. the width channel and the height channel , to assist our system in predicting reasonable pixels for the outpainting:


where and , are the width and height of image in training set, respectively. The values of two position channels range in different intervals while maintain the same changing step, i.e. .

The convolutional neural network can capture some position cues by enlarging receptive field [30], however, the captured position information is implicit and not powerful enough to benefit the overall outpainting. Different from the implicit position cues from CNN, our position channels attempt to model the explicit relation between the pixel semantics and its position, and provide more forceful assistant information, to help the generator predict reasonable pixels, especially for the free-form outpainting.

In practice, the position channels are also split into two parts along with the width dimension, i.e. the left part and the right part

, to fit the generator architecture. The left two position channels are first encoded by a convolution layer and then concatenated with the fused features from the input image and sketch, which is further encoded to obtain the final left hidden representations. In the following decoding step, the right half position channels and the right half sketch are first encoded by a position encoder

and a sketch encoder , respectively. Then, the compressed two types of representations are element-wise added with the sequential features from the LSTM encoder, to serve as the initial state for the following LSTM decoder whose responsibility is to predict the hidden features of the full image. The subsequent decoder module takes the full hidden features to rebuild the complete image, by a series of convolution and upsampling operations.

Iii-B2 Conditional Skip Connection

To synthesize the expected image that exactly matches the guiding sketch, we design a conditional skip connection structure inspired by U-net [37] to emphasize the desired shape in each decoding step. Let and be the outputs of the -th layer in and , respectively, the right half features of output in the -th decoding layer are denoted as

. These three tensors are first channel-wise concatenated and then fed forward through three convolution layers with kernel size

, and , to get new features with the same shape as

. To make the training more stable, we also introduce a residual connection to conduct an element-wise addition between the new features and

to get the final output of the CSC module . And the output of the -th decoding layer is updated by replacing the original right half feature with accordingly.

The difference between the CSC and the U-net is two-fold: First, the CSC only focuses on the right half of the decoder feature which corresponds to the guiding sketch and the corresponding position region. Second, two components of connection in CSC are not symmetrical, features from the condition encoding modules contain the guiding information and the assistant position features, while features in decoding layers encode the additional visual feature transferred from the inputs. In our experiment, the CSC can not only improve the free-form oupainting, but speed up the network convergence as shown in Fig. 10.

Iii-C Holistic Alignment Module

The holistic alignment module, which is responsible for discriminating the input image is fake or real, is introduced to conduct adversarial learning. Following the same strategy in [5] and [8]

, our holistic alignment module consists of two discriminators. The local discriminator discriminates whether the synthesized part is generator-produced or real, while the global one determines whether the entire image is real or not. Both of them take the concatenation of image and sketch as inputs and output a 1-D scalar by several striding convolution and a fully connected layer.

The overall architecture employs the Wasserstein GANs [29] framework, and the network parameters are trained by solving the min-max optimization:


where D is the discriminator and G indicates the generator, and are the discriminator loss and generator loss, respectively. The last term in Eq. 2 is the gradient penalty to enforce the Lipschitz constraint  [29],

is a random sample from a probability distribution


Besides the adversarial loss, the reconstruction loss is also implemented as a masked loss, which optimizes for coarse image agreement:


where is a mask proposed by Yang et al. [8] to reduce the weight of reconstruction loss along the prediction direction, indicates the element-wise product and is the groundtruth image.

Iii-D Sketch Alignment Module

The holistic alignment module only focuses on the overall image and the synthesized part, and pays less attention to the details of the synthesized content. Consequently, the outpainting model trained with the holistic alignment module could successfully restore most of the low-frequency information but fail to well keep the high-frequency details, leading to the blurry boundary between different semantic regions.

To further enhance the outpainting quality, we augment the system with a sketch alignment module to restore the high-frequency information. To be specific, with the generated synthesized part, we first leverage an edge detector [26] to reversely produce its edge map where high-frequency information is maintained. Then, the sketches from input are adopted as ground-truth and a sketch-based alignment loss is applied to impose the inferred edge map to be consistent with the ground-truth one. Formally, let be the sketch of the synthesized image from generator, which can be obtained by feeding the image rebuilding through the edge detector, and is the sketch map from ground-truth, our sketch-based alignment loss is defined as follow:


where indicates the loss. The reconstruction defined by Eq. 5 enforces the generator to recover the high frequency details in the groudtruth image, it is also in line with our sketch-guided setting.

Iii-E Training

(a) Pix2Pix [10]
(b) NSIO [8]
(c) BDIE [7]
(d) Ours
(e) GT
Fig. 4: Three exemplary results for the image restoring according to the the original sketches. The transition boundary of Pix2Pix is not smooth enough, which makes the final images look separate. The NSIO and the BDIE can achieve acceptable transitions, but the generated parts lack reality. Our results not only achieves a smooth transition from the left to the right, but the semantic consistency is satisfactory.
(a) The inputs
(b) Pix2Pix [10]
(c) NSIO [8]
(d) BDIE [7]
(e) Ours
Fig. 5: Results for sketch-guided outpainting. All the comparison methods suffer from the sudden transition around the boundary or the unreasonable pixel filling for the desired shape, as a consequence, they all fail to produce satisfactory results. While our model could achieve a smooth transition from the original input to the synthesized right half and preserve the semantic consistency of the entire image very well.

The model is trained via a combination of the adversarial loss and our proposed sketch alignment loss. Our discriminator loss comprises of the global discriminator loss , and the local discriminator loss , which can be obtained according to the Eq. 2, by feeding the inputs to corresponding discriminator module. In summary, the full discriminator loss reads:


and the generator loss is formulated as:


where and are the trade-off weights.

In our practice, we find directly training the network could successfully restore the images according to the original sketches, but can not perform well on free-form outpainting. The reason may stem from the sketch overfitting. Therefore, a random sketch masking strategy is designed to conduct sketch augmentation. Our designed sketch augmentation is only applied on the right half sketch, since the outpainting model conditions on the right sketch. For current right half sketch in training stage:

  • make the sketch unchanged with probability 0.4,

  • mask a randomly selected patch whose scale ranges from 4848 to 64128, in the top part and the bottom part with probability 0.2 and 0.4, respectively.

The sketches in bottom part are paid more attention, because we find the sketches in bottom are richer and more complex than the ones in top part.

Fig. 6: Several outpainting results for free-form sketches. Our method successfully generates natural images with diverse freestyle sketches.

Iv Experimental Results

Iv-a Experiment setup

Layer Input Output Kernel size & Strides
G-Conv 128x128x4 64x64x64 4x4, strides=2
Conv 128x128x2 64x64x64 4x4, strides=2
G-Conv 64x64x128 32x32x128 3x3, strides=2
G-Conv 32x32x128 16x16x256 1x1, strides=2
G-Resblockx3 16x16x256 16x16x256 1x1,3x3,1x1, strides=1
G-Conv 16x16x256 8x8x512 3x3, strides=2
G-Resblockx4 8x8x512 8x8x512 1x1,3x3,1x1, strides=1
G-Conv 8x8x512 4x4x1024 3x3, strides=1
G-Resblockx5 4x4x1024 4x4x1024 1x1,3x3,1x1, strides=1
Conv 4x4x1024 4x4x512 3x3, strides=1
LSTM Encoder 4x(4x512) 1x(4x512) -
Sketch Encoder 128x128x1 1x(4x512) -
Position Encoder 128x128x2 1x(4x512) -
Sum 1x(4x512) x 3 1x(4x512) None
LSTM Decoder 1x(4x512) 4x4x512 -
Concat (4x4x512)x2 4x8x512 None
G-Resblockx2 4x8x512 4x8x512 1x1,3x3,1x1, strides=1
CSC+G-DeConv 4x8x512 8x16x512 3x3
G-Resblockx3 8x16x512 8x16x512 1x1,3x3,1x1, strides=1
CSC+G-DeConv 8x16x512 16x32x256 3x3
G-Resblockx4 16x32x256 16x32x256 1x1,3x3,1x1, strides=1
CSC+G-DeConv 16x32x256 32x64x128 3x3
G-DeConv 32x64x128 64x128x64 3x3
G-DeConv 64x128x64 128x256x3 3x3
TABLE I: The architecture of our generator, where the G-Conv and G-DeConv indicate the gated convolution and the gated deconvolution [2], respectively. G-Resblock refers to the resblock [48] whose convolution operations are replaced by the gated convolutions.

Iv-A1 Datasets

The NS6K dataset [8], which consists of 6,040 natural scenery images, is employed to evaluate our method, and the data split follows Yang’s setting [8]. However, we find most of the sketches from HED [26] in NS6K are simple, and there are many images similar to each other. To test our method under more practical situations, we first pick up 4,040 images from NS6K by filtering out 2,000 similar images, and then collect 4,075 images from the ‘Google images’ by utilizing eighteen scenery keywords as index including Alps, Arches Park, La Digue, etc. The collected images and the selected images from NS6K form a more diverse and complex dataset called NS8K, which contains 8115 natural scenery images. Of these, 1,500 images are for testing while the rest is taken as training data.

Fig. 7: The results for multi-step prediction. The longer results could be synthesized by taking current outpainting for next-step prediction.

Iv-A2 Implement Details

Following Yu et al.  [2], the sketch in this work is obtained using the HED edge detector [26], we first extract the edge map, and set the values above 0.6 as 1 to get the binary sketch. To maintain better consistency, the pre-trained HED detector, whose parameters are frozen during training, is also used for edge map detection in our sketch alignment module. Our network synthesizes image by providing an image and a guiding sketch with shape 128128 as inputs. For the generator, the image concatenated with the sketch is first passed through a gated convolution layer [2], while the raw position channels are encoded by a convolution layer. Two feature maps are channel-wise concatenated and fed forward through several gated convolution layers, to get a tensor with shape 44512. The encoder LSTM [28] collaborating with the decoder LSTM produce another tensor with shape , which couples with the left one to rebuild the complete image with shape 128256 by a series of gated deconvolution layers [2]. The feature dimension of the hidden state for both LSTMs is set as 2048. We choose the gated structure because of its superior in dealing with the binary sketch [7, 2]. The detailed architecture of our generator is exhibited in Table I

. Our network is implemented using the tensorflow platform 

[31] and trained on 2 NVIDIA GTX 1080Ti GPUs. The parameters of the generator and discriminators are jointly updated using the Adam optimizer [32] with batch size 30. The weight for reconstruction loss is set as 0.998, while the adversarial loss weight

is set as 0.002. Hyperparameters

, and

are fixed as 1, 0.9 and 10, respectively. The training iteration is up to 800 epochs and starts with learning rate 0.0001, which is discounted by 0.1 after 200 epochs.

(a) Baseline
(b) + PCs
(c) + CSC
(d) + SAL
(e) GT
Fig. 8: Visual ablation comparison on image rebuilding, where the PCs, CSC, SAL and GT represent the position channels, conditional skip connection, sketch alignment loss and groundtruth, respectively.
(a) Inputs
(b) Baseline
(c) + PCs
(d) + CSC
(e) + SAL
Fig. 9: Visual ablation comparison on free-form outpainting, the results get improved when a new module is equipped.

Iv-A3 Evaluation Metric

Three criteria are used to evaluate the proposed method, i.e. the Frchet Inception Distance (FID) [34], the Inception Score (IS) [33] and the Mean Satisfaction Degree (MSD). To conduct an objective comparison, we feed the original sketches from the test dataset to rebuild the corresponding images, and the FID and IS can be obtained according to the synthesized images and the groundtruth data:


where IS is defined based on the KL-divergence between the classification distribution of the fake sample and the mean probability on each class, while the FID first employs the Inception-V3 network [47] to extract the 2048-d features and then computes the statistics distance to evaluate the generation model, the

are the mean vector and the covariance matrix of the fake features,

is the trace operation.

As for the free-form outpainting, we can not obtain objective performance since there is no grountruth available. Therefore, we employ a subjective metric, i.e. the Mean Satisfactory Degree (MSD), to evaluate the quality of free-form outpainting. First, we randomly select 300 images from the test dataset and replace their original right sketches with manually drawn free-form ones, there are 77 different types of sketches in total. Then, 20 volunteers are invited to label the satisfaction degree of each synthesized sample as three levels: 0-poor, 1-ordinary and 2-good. The mean value of all labels on test images is taken as the mean satisfaction degree (MSD). Comparing to image restoring from the original sketches, using free-form sketches for outpainting is closer to the practical situation, since the right half image is usually not available. Therefore, the MSD is more important for performance evaluation.

Iv-B Quantitative Comparisons

Table II and Table III show the results of our method and three competing methods on NS6K and NS8K dataset, respectively. The comparison methods include two outpainting methods NSIO [8], BDIE [7], and a classic image to image translation work, Pix2Pix [10]. All comparison methods conduct the same data augmentation including randomly cropping and flipping, and are trained by 1,500 epoch iterations. The Pix2Pix [10]

is trained by the loss functions in 

[8], the right half of the input image is masked and channel-wise concatenated with the sketch to translate to the original image. For NSIO [8], the left sketch and the right half sketch are used in the same way as our method. For BDIE [10], the sketch is directly channel-wise concatenated with the masked input to synthesize the full image.

Method Pix2Pix  [10] NSIO  [8] BDIE  [7] ours
FID 21.197 13.17 13.424 10.998
IS 2.783 2.887 2.899 2.920
MSD 0.472 0.544 0.777 1.027
TABLE II: The results of four methods on NS6K under evaluation criteria IS (the higher the better), FID (the lower the better) and MSD (the higher the better).
Method Pix2Pix  [10] NSIO  [8] BDIE  [7] Ours
FID 18.327 11.153 10.891 10.390
IS 3.013 3.254 3.276 3.321
MSD 0.615 0.706 0.725 1.031
TABLE III: The performance of four methods on the NS8K dataset.

As shown in Table II, the FID of our method can reach 10.998, which is much better than all competing methods. On the free-form outpainting, the superiority of our method is more obvious. Our method could achieve 1.027 MSD, while the competitive methods’ are only 0.777 at best. From Table III, our method is still much more outstanding for the free-form outpainting on NS8K dataset. As shown in Table III, the competing method, BDIE [7], achieves comparable performance on the image rebuilding task. For example, our FID is 10.390 while the FID of BDIE [7] could reach 10.891. However, using free-form sketches for outpainting is closer to the practical situation since the right half image is usually not available. Although the competing methods could achieve the acceptable performance on image rebuilding according to the original sketches, they perform much worse on the free-form outpainting. The MSD of BDIE [7] is only 0.725 on NS8K dataset, and the other two comparison methods are even much worse. In contrast, our method performs much better on the free-form outpainting and could achieve 1.031 MSD, which surpasses the comparison methods by a large margin. From Table II and Table III, the existing outpainting methods could deal with the original sketches but fail to generalize to the more practical situation, i.e. the free-form outpainting. While our proposed method not only achieves the best performance on image rebuilding according to the original sketches but could harvest more satisfactory results for the free-style image outpainting on both datasets, which validates the effectiveness of our proposed approach.

Iv-C Qualitative Results

Fig. 4 and Fig. 5 show the qualitative comparison of the four methods on the image rebuilding and free-form outpainting, respectively. As shown in Fig. 4, the Pix2Pix [10] could not achieve the smooth transition around the boundary, since there is no module or loss designed to stitch the boundary in its architecture. As for the NSIO [8] and BDIE [7], the boundary between the original image and the synthesized part is relatively smooth, however, the results still suffer from the lack of textural details and disharmonious pixels. While our method could not only achieve the smooth transition from the left to the right half but could synthesize results with more textural details. The superiority of our method is more obvious on the free-form outpainting, as shown in Fig. 5. All the competing methods could not ensure the semantic consistency and the smooth boundary and fail to fill reasonable pixels for the free-style sketches. Even for the simple sketch like a single line (the second row in Fig. 5), the comparison methods could not make a success, which reveals the poor generalization of these methods and the challenges of the free-form outpainting task. While our proposed network could successfully predict the reasonable pixels for the free-style sketches by the learned positional prior knowledge, which helps the network synthesize much more natural results.

Fig. 1 and Fig. 6 exhibit several groups of outpainting results with many diverse free-form sketches. As shown in these two figures, our method could not only successfully conduct outpainting for the sketches similar to the training data but could well generalize to the unseen sketches like circle shape, heart shape. Thanks to the learned prior positional knowledge, even though the provided sketch does not correspond to semantically meaningful content as shown in the fourth column Fig. 6, our approach could also fill the reasonable pixels for the sketch and produce relatively semantically consistent content. From Fig. 1 and Fig. 6, it can be intuitively seen that the proposed method can produce natural outpainting results for diverse manually drawn sketches. Even though most of the provided sketches are missed in the training set, our model can still generate realistic images and preserve the semantic consistency well. Besides, our method could also synthesize longer images by taking current output as the input for next prediction, Fig. 7 shows two example for 9-step prediction.

Iv-D Ablation Study

To validate the effectiveness of each component in our system, we conduct ablation study on NS6K to verify their respective contributions, results are reported in Table IV. Our baseline only employs the sketch and the Gated Convolution  [2] to conduct outpainting. From Table IV, the model with four parts simultaneously utilized achieves the best MSD, and when a new mechanism is equipped, the MSD gets improved, which validates the contribution of each component.

baseline 2.889 12.87 0.587
baseline 2.882 12.27 0.661
baseline 2.873 11.526 0.747
baseline 2.883 11.451 0.879
baseline 2.920 10.998 1.027
TABLE IV: The contribution of each part. RSM, PCs, CSC and SAL indicate random sketch masking, position channels, conditional skip connection and sketch alignment loss, respectively.
Fig. 10: The loss tendencies in our training procedure. The curves with ‘CSC’ means the training loss with our conditional skip connection, while ‘NCSC’ indicates the loss without the CSC module.

Although the random sketch masking strategy and position channels cause a slight performance drop according to the Inception Score, they could both boost the FID and the MSD, therefore these two strategies play important roles in our system. The FID and IS in Table IV show that the conditional skip connection makes a little contribution to the image rebuilding, however, it could effectively improve the free-style outpainting. What’s more, we find the CSC can also make a faster convergence for the network training. Fig. 10 shows the tendencies of the average discriminator and generator losses in every training epoch. Our network without the CSC requires around 1,500 epochs training, while once the CSC module is equipped, the network converges after about 700 epochs, which remarkably speeds the network training. From Fig. 10, our CSC module could also make the training steadier. Furthermore, when the sketch alignment loss is equipped, our method achieves the best performance with FID and MSD scores of 10.998 and 1.027, respectively.

Fig. 8 and Fig. 9 exhibit the visual ablation comparison on image restoring and free-form outpainting, respectively. As shown in these two figures, the baseline method produces some abrupt pixels which make the overall image not authentic enough, and the synthesized results suffer from the lack of textural details, especially for the free-form outpainting. By introducing the position channels, the model could predict the reasonable pixels with the learned positional relation between the pixels and the specific positions. The sketch alignment loss imposes the generator to restore the high-frequency information, consequently, when the sketch alignment loss is equipped, the boundary of different semantic regions is clearer, and more details could be observed from Fig. 8 and Fig. 9. From Fig. 8, the contribution of the conditional skip connection is not obvious for image rebuilding, but it is important for the free-style outpainting on preventing the desired sketch from distorting, as shown in top row in Fig. 9.

V Conclusion and Future Work

We have presented the first solution for the under explored sketch-guided image outpainting problem, which is a meaningful yet challenging task. The developed framework allows users to guide the outpainting results by free-style sketches. Our encoder compresses inputs to hidden features, and the decoder integrates the hidden features and the guiding information to build the desired image. Specifically, two position channels are introduced for reasonable pixel filling, and a conditional skip connection is proposed to make the results spatial consistent with the guiding sketch. To restore the high-frequency details, we design the sketch alignment loss to further boost the outpainting quality. In addition, we contribute a more complex and diverse scenery image dataset NS8K for further image outpainting study. Experiments on two benchmarks demonstrate the effectiveness and the ability of our model on sketch-guided image outpainting. Although the proposed method could outperform all existing image outpainting models, the results of free-style sketches still suffer from the lack of textural details. In our future work, we would develop model to compensate more textural details for the free-style outpainting.


  • [1] C. Xie, S. Liu, C. Li, M. Cheng, W. Zuo, X. Liu, S. Wen, and E. Ding, Image Inpainting with Learnable Bidirectional Attention Maps, ICCV:8857-8866, 2019.
  • [2] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and Thomas S. Huang, Free-Form Image Inpainting with Gated Convolution, ICCV:4470-4479, 2018.
  • [3] Z. Guo, Z. Chen, T. Yu, J. Chen, and S. Liu, Painting Outside the Box: Image Outpainting with GANs, Corr abs/1808.08483, 2018.
  • [4] Z. Guo, Z. Chen, T. Yu, J. Chen, and S. Liu, Progressive Image Inpainting with Full-Resolution Residual, ACM Multimedia:2496-2504, 2019.
  • [5] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and Thomas S Huang, Generative Image Inpainting With Contextual Attention, CVPR:5505-5514, 2018.
  • [6] G. Liu, F. Reda, K. Shih, T. Wang, A. Tao, and B.Catanzaro, Image Inpainting for Irregular Holes Using Partial Convolutions, ECCV:89-105 2018.
  • [7] P.Teterwak, A. Sarna, D. Krishnan, A. Maschinot, D. Belanger, C. Liu, and W. Freeman, Boundless: Generative Adversarial Networks for Image Extension, ICCV:10520-10529, 2019.
  • [8] Z. Yang, J. Dong, P. Liu, Y. Yang, and S. Yan, Very Long Natural Scenery Image Prediction by Outpainting, ICCV:10560-10569, 2019.
  • [9] W. Chen, and H. James, SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis, CVPR:9416-9425, 2017.
  • [10] P. Isola, J. Zhu, T. Zhou, and A. Efros,,

    Image-To-Image Translation With Conditional Adversarial Networks

    , CVPR:5967-5976, 2017.
  • [11] X. Yu, Y. Chen, T. Li, S. Liu and G. Li, Multi-mapping Image-to-Image Translation via Learning Disentanglement, NIPS:2990-2999, 2019.
  • [12] F. Luan, S. Paris, E. Shechtman, and K. Bala, Deep Photo Style Transfer, CVPR:6997-7005, 2017.
  • [13] B. Lu, C. Chen, and C. Rama, Unsupervised Domain-Specific Deblurring via Disentangled Representations, CVPR:10225-10234, 2019.
  • [14] M.Nimisha, S. Kumar, and A. Rajagopalan, Unsupervised Class-Specific Deblurring, ECCV:358-374, 2018.
  • [15] Y. Lu, S. Wu, Y. Tai, and C. Tang, Image Generation from Sketch Constraint Using Contextual GAN, ECCV:213-228, 2018.
  • [16] J. Zhu, P. Taesung, I. Phillip, and A. Efros, Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks, ICCV:2242-2251, 2017.
  • [17] C. Xie, S. Liu, C. Li, M. Cheng, W. Zuo, X. Liu, S. Wen, and E. Ding, Image Inpainting with Learnable Bidrectional Attention Maps, ICCV: 8857-8867, 2019.
  • [18] S. Iizuka, E. Simo-Serra, and H. Ishikawa, Globally and Locally Consistent Image Completion, ACM Trans. Graph, vol.36, no.4, pages: 107:1:107:14, 2017.
  • [19] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros, Context Encoders: Feature Learning by Inpainting, CVPR:2536-2544, 2017.
  • [20] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, High-Resolution Image Inpainting Using Multi-Scale Neural Patch Synthesis, CVPR:4076-4084, 2017.
  • [21] J. Kopf, W. Kienzle, S. Drucker, and S. Kang, Quality Prediction for Image Completion, vol.31, no.6, pages: 131:1-131:8, ACM Trans. Graph, 2012.
  • [22] J. Sivic, B. Kaneva, A. Torralba, S. Avidan, and William T. Freeman, Creating and exploring a large photo realistic virual space, CVPR Workshops, 2008.
  • [23] Y. Zhang, J. Xiao, J. Hays, and P. Tan, FrameBreak: Dramatic Image Extrapolation by Guided Shift-Maps, CVPR:1171-1178, 2013.
  • [24] M. Wang, Y. Lai, Y. Liang, R. Martin, and S. Hu, BiggerPicture: data-driven image extrapolation using graph matching, ACM Trans. Graph, vol.33, no.6, pages: 173:1-173:13, 2014.
  • [25] A. Radford, L. Metz, and S. Chintala, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, ICLR, 2016.
  • [26] S. Xie, and Z. Tu, Holistically-Nested Edge Detection, ICCV:1395-1403, 2015.
  • [27] R. Liu, J. Lehman, P. Molino, F. Such, E. Frank, A. Sergeev, and J. Yosinski, An intriguing failing of convolutional neural networks and the CoordConv solution, NIPS:9628-9639, 2018.
  • [28] S. Hochreiter, and J. Schmidhuber, Long Short-Term Memory, vol.9, no. 8, 1735-1780, Neural Computation:1735-1380, 1997.
  • [29] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, Improved Training of Wasserstein GANs, NIPS:5767-5777, 2017.
  • [30] K. Simonyan, and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR, 2015.
  • [31] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng,

    TensorFlow: A System for Large-Scale Machine Learning

    , 12th USENIX Symposium on Operating Systems Design and Implementation, 2016.
  • [32] D. Kingma, and J. Ba, Adam: A Method for Stochastic Optimization, ICLR, 2015.
  • [33] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, Improved Techniques for Training GANs, NIPS:2226-2234, 2016.
  • [34] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NISP:6626-6637, 2017.
  • [35] T. Chen, M. Cheng, P. Tan, A. Shamir, and S. Hu, Sketch2Photo: internet image montage, vol.28, no.5, ACM Trans. Graph, 2009.
  • [36] Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Generative adversarial nets, NIPS:2672-2680, 2014.
  • [37] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convo- lutional networks for biomedical image segmentation, MICCAI:234-241, 2015.
  • [38] J. Chen, J. Chen, H. Chao, and Yang,Image Blind Denoising With Generative Adversarial Network Based Noise Modeling, CVPR:3155-3164, 2018.
  • [39] Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. Kalra, Y. Zhang, L. Sun, and G. Wang,Low-Dose CT Image Denoising Using a Generative Adversarial Network With Wasserstein Distance and Perceptual Loss, IEEE Trans. Med. Imaging, vol.37, no.6, pages: 1348-1357, 2018.
  • [40] Y. Zhang, S. Liu, C. Dong, X. Zhang, and Y. Yuan,Multiple Cycle-in-Cycle Generative Adversarial Networks for Unsupervised Image Super-Resolution, IEEE Trans. Image Process., vol.29, pages: 1101-1112, 2020.
  • [41] A. Lucas, S. Tapia, R. Molina, and A. Katsaggelos,Generative Adversarial Networks and Perceptual Losses for Video Super-Resolution, IEEE Trans. Image Process., vol.28, no.7, pages: 3312-3327, 2019.
  • [42] X. Han, Z. Wu, W. Huang, M. Scott, and L. Davis,FiNet: Compatible and Diverse Fashion Image Inpainting, ICCV:4480-4490, 2019.
  • [43] Y. Ren, X. Yu, R. Zhang, Thomas H. Li, S. Liu, and G. Li,StructureFlow: Image Inpainting via Structure-Aware Appearance Flow, ICCV:181-190, 2019.
  • [44] L. Zhang, J. Wang, and J.Shi, Multimodal Image Outpainting with Regularized Normalized Diversification, WAVC:3422-3431, 2020.
  • [45] Y. Wang, X. Tao, X. Shen, and J. Jia,Wide-Context Semantic Image Extrapolation, CVPR:1399-1408, 2019.
  • [46] X. Wu, R. Li, F. Zhang, J. Liu, J. Wang, A. Shamir, and S. Hu,Deep Portrait Image Completion and Extrapolation, IEEE Trans. Image Process., vol. 29, pages: 2344-2355, 2020.
  • [47] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,

    Rethinking the Inception Architecture for Computer Vision

    , CVPR: 2818-2816, 2016.
  • [48] K. He, X. Zhang, S. Ren, and J. Sun,Deep Residual Learning for Image Recognition, CVPR: 770-778, 2016.