Image outpainting, as illustrated in Fig. 1, is to generate new contents beyond the original boundaries for a given image. The generated image should be consistent with the given input, both on spatial configuration and semantic content. Although image outpainting can be used in various applications, the solutions with promising results are still in shortage due to the difficulties of this problem.
The difficulties for image outpainting exist in two aspects. First, it is not easy to keep the generated image consistent with the given input in terms of the spatial configuration and semantic content. Previous works, e.g.,  needs local warping to make sure there is no sudden change between the input image and the generated region, especially around the boundaries of the two images. Second, it is hard to make the generated image look realistic since it has less contextual information comparing with image inpainting [2, 16].
For solving image outpainting problems, a few preliminary works were published [14, 23, 34, 27]. However, none of those works [14, 23, 34, 27] utilize ConvNets. Those works attempt to “search” image patch(es) from given candidates, concatenate the best match(es) with the original input spatially. Those works have their limitations: (1) they need handcrafted features to summarize the image; (2) they need image processing techniques, for example, local warping , to make sure there is no sudden visual change between input and generated images; (3) the final performance is heavily dependent on the size of the candidate pool.
Inspired by the success of deep networks on inpainting problems , we draw on a similar encoder-decoder structure with a global and a local adversarial loss, to solve image outpainting. In our architecture, the encoder is to compress the given input into a compact feature representation, and the decoder generates a new image based on the compact feature representation. More than that, to solve the two challenging problems in image outpainting, we make several innovative improvements in our architecture.
To make the generated images spatial and semantic consistent with original input, it is necessary to take full advantages of the information from the encoder and fuse it into the decoder. For this purpose, we design a Skip Horizontal Connection (SHC) to connect encoder and decoder at each same level. By this way, the decoder can generate a prediction with strong regards to the input. Our experimental results prove that the proposed SHC can improve the smoothness and reality of the generated image.
Moreover, we propose Recurrent Content Transfer (RCT), to transfer the sequence from the encoder to the decoder to generate new contents. Compared to channel-wise full connection strategy in the previous work , RCT can facilitate our network to handle the spatial relationship in the horizontal direction more effectively. Besides, by adjusting the length of the prediction feature, RCT assists our architecture in controlling the prediction size conveniently, which is hard if utilizing full connection.
By integrating the proposed SHC and RCT into our designed encoder-decoder architecture, our method can successfully generate images with extra length outside the boundary of the given image. As shown in Figure. 2, it is a recursive process since the generation from the last step is utilized as the input for the current step, which, theoretically, can generate smooth, and realistic images with a very long size. Those generated images, although spatially far away from the given input and thus receiving little contextual information from it, still keep high qualities.
To demonstrate the effectiveness of our method, we collect a new scenery dataset with images, which consists of diverse, complicated natural scenes, including mountain with or without snow, valley, seaside, riverbank, starry sky, etc. We conduct a series of experiments on this dataset and not surprisingly beat all competitors [12, 10, 32].
Contributions. Our contributions are summarized in the following aspects:
(1) we design a novel encoder-decoder framework to handle image outpainting, which is rarely discussed before;
(2) we propose Skip Horizontal Connection and Recurrent Content Transfer, and integrate them into our designed architecture, which not only significantly improves the consistency on spatial configuration and semantic content, but also enables our architecture with an excellent ability for long-term prediction;
(3) we collect a new outpainting dataset, which has images containing complex natural scenes. We validate the effectiveness of our proposed network on this dataset.
2 Related Work
In this section, we briefly review the previous works relating to this paper in five sub-fields: Convolutional Neural Networks, Generative Adversarial Networks, Image Inpainting, Image Outpainting, and Image-to-Image Translation.
Convolutional Neural Networks (ConvNets) VGGNets and Inception models  demonstrate the benefits of deep network. To train deeper networks, Highway networks  employ a gating mechanism to regulate shortcut connections. ResNet  simplifies the shortcut connection and shows the effectiveness of learning deeper networks through the use of identity-based skip connections. Due to the complexity of our task, we employ a group of ”bottleneck” ResBlocks 
to build our network and utilize residual connections in Skip Horizontal Connection to improve the smoothness of the generated results.
Generative Adversarial Networks (GANs) GAN  has achieved success in various problems, including image generation [3, 18], image inpainting , future prediction , and style translation . The key to the success of GANs is the introduction of the adversarial loss, which forces the generator to captures the true data distribution. To improve the training of GAN, variants of GANs have been derived. For example, WGAN-GP  introduces a gradient penalty and achieves more stable training. And thus we utilize WGAN-GP in this work due to its advantages.
Image Inpainting The classical image inpainting [2, 16] approaches utilize local non-semantic methods to predict the missing region. However, when the missing region size becomes huge, or the context grows complex, the quality of the final results deteriorates [17, 30, 10, 32]. Compared to image inpainting, image outpainting is more challenging. To the best of our knowledge, there is NO other peer-reviewed published work utilizing ConvNets for image outpainting before our work.
Image Outpainting There are a few preliminary published works [14, 23, 34, 28] for image outpainting problems, but none of them utilized ConvNets. Those works employed image matching strategies to “search” image patch(es) from the input image or an image library, and treat the patch(es) as prediction regions. If the search fails, the final “prediction” result will be inconsistent with the given context. Unlike those previous work [14, 23, 34, 28], our approach does not need any image matching strategy but depends on our carefully designed deep network.
Image-to-Image Translation With the development of ConvNets, recent approaches [12, 5, 21, 35] for image-to-image translation design deep networks for learning a parametric translation function. After “Pix2Pix” 
framework, which use a conditional adversarial network to learn a mapping from input to output images, similar ideas have been applied to related tasks, such as translating sketches to photographs , style translation [35, 4], etc. Although image outpainting is similar to the image-to-image translation task, there is a significant difference between them: for image-to-image translation, the input and output keep the same semantic content but change details or styles; for our work, the style is shared between the input and output, the semantic contents are different but keep consistent.
We first provide an overview of the overall architecture, which is shown in Fig. 3, then provide details on each component.
3.1 Encoder-Decoder Architecture
|ResBlock3||1616256||stride of first block=2|
|ResBlock4||88512||stride of first block=2|
|ResBlock5||441024||stride of first block=2|
We design an encoder-decoder architecture for image outpainting. Our encoder takes an input image and extracts its latent feature representation; the decoder takes this latent representation to generate a new image with the same size, which has consistent content and the same style.
Encoder Our encoder is derived from the ResNet-50 . The difference is that we replace max pooling layers with convolutional layers, and remove layers after conv4_5. Given an input image I of size 128128, the encoder will compute a latent representation with the dimension of 441024.
As pointed out in , it is difficult only to utilize convolutional layers to propagate information from input image feature maps to predicted feature maps. The reason is that there is no one-to-one correspondence between them under this circumstance. In Context Encoders , this information propagation is handled by channel-wise fully-connected (FC) layers. One of the limitations in FC layers is they can only handle features of fixed sizes. In our practice, this limitation will make predicted results deteriorate when the input size is large (Fig. 7(b)). More than that, as illustrated by , FC layers occupy a huge amount of parameters, which makes the training inefficient or impractical. To deal with those problems, we propose a Recurrent Content Transfer (RCT) layer for information propagation in our network.
Recurrent Content Transfer RCT, which is shown in Figure. 4, is designed for efficient information propagation between feature sequences from input regions and prediction regions respectively. Specifically, RCT splits the feature maps from the input region to a sequence in the horizontal dimension, and then uses two LSTM  layers to transfer this sequence to a new sequence corresponding to the prediction region. After that, the new sequence is concatenated and reshaped into predicted feature maps. 11 convolutional layers are utilized to adjust the channel dimensions of input and output in RCT. Given input feature maps with a size of 441024, RCT outputs feature maps with the same dimension.
Benefit from the recurrent structure in RCT, we can control the size of the prediction region by setting the length of the prediction sequence in 1-step prediction. And by iterating the model, we can generate images with high-quality and very long range (Fig. 11, 12).
Decoder Decoder takes 441024 dimensional features, which are encoded from a 128128 image (I), to generate an image of size 128256. The left half of the generated image is the same as the input image I; the right half is predicted by our architecture. Similar to the most recent methods, we use five transposed-convolutional layers  in the decoder to expand the spatial size and reduce the channel number. However, unlike the previous work , before each transposed-convolutional layer, we propose to use our designed Skip Horizontal Connection (SHC) to fuse the feature from the encoder into the decoder.
Skip Horizontal Connection Inspired by U-Net , we propose SHC, which is shown in Fig. 5(a), to share information from the encoder to the decoder at the same level. The difference between SHC and U-Net  is that the spacial size of the encoder feature is different from the decoder in SHC. SHC focuses on the left half of the decoder feature which corresponds to the original input region.
As illustrated in Fig. 5(a), given a feature from decoder and a feature from encoder, SHC computes a new feature . The procedures are as follows: First, we concatenate the left half of , denoted as , with on the channel dimension; then, we pass this concatenated feature through three convolutional layers, which have kernels of 11, 33 and 11 size respectively, to get to a feature representation denoted as . To make the training more stable, we introduce a residual connection to make a element-wise addition between and . We denote the addition result as . We use to replace the left half of the input feature for SHC, , to get the final output for SHC, denoted as .111Specially, the SHC before first transposed-convolutional layer is different from above. In this layer, we just concatenate the input of RCT to the left of predicted feature map on width dimension, because the predicted feature doesn’t include any information from the input region to compute.
Besides, to keep a balance between the insufficient context due to small kernel sizes and the high computation cost introduced by large kernel dimensions, we propose to combine the advantage of Residual Block  and Inception into a novel block: Global Residual Block (GRB), which is shown in Figure. 5(b).
In GRB, a combination of 1n and n1 convolutional layers replace nn convolutional layers, the residual connection is introduced to connect the input to output, and dilated-convolutional layers  is utilized to “support exponential expansion of the receptive field without loss of resolution or coverage”. To strengthen the connection between the original and predicted region aligned on the horizontal direction, we set a bigger receptive field on the horizontal dimension in GRB. 222We only deploy GRB after first three SHC layers, because we found it fails to achieve good performance when setting GRB too close to the output layer. After GRB, we deploy some ResBlocks to compensate for the performance loss caused by Inception architecture and Dilated convolutions.
3.2 Loss Function
Our loss function consists of two parts: a masked reconstruction loss and an adversarial loss. The reconstruction loss is responsible for capturing the overall structure of the predicted region and logical coherence with regards to the input image, which focuses on low-order information. The adversarial loss[5, 1, 6] makes prediction look more real, which is due to high-order information capturing.
Masked Reconstruction Loss We use a L2 distance between ground truth image and predicted image as our reconstruction loss, denoted as ,
where is a mask used to reduce the weights of L2 along the prediction direction. Masked reconstruction loss is prevalent in generative image inpainting task [17, 10, 32], because less relation is between ground truth and prediction when far away from the border. Different from other mask methods, we use a function to decay the weight to zero. In the predicted region, let be the distance to the border between origin and predicted region and be the width of prediction in 1-step, we have:
The L2 loss can minimize the mean pixel-wise error, which makes the generator to produce a rough outline of the predicted region but results in a blurry averaged image . To alleviate this blurry problem, we add an adversarial loss to capture high-frequency details.
Global and Local Adversarial Loss Following the same strategy utilized in , we deploy one global adversarial loss and one local adversarial loss, to make the generated images indistinguishable from the real input image. We choose a modified Wasserstein GANs  for our global and local adversarial loss due to its advantages, the only difference between the global and the local adversarial loss is their input.
Specifically, by enforcing a soft version of the constraint with a penalty on the gradient norm for random samples , the final objective in  becomes:
Hence the adversarial loss for the discriminator, , is
And the adversarial loss for the generator, , is
In the global adversarial loss, and , the and are the ground truth images and the entire output (including original input on the left, and the predicted region on the right). In the local adversarial loss, and , the and are the right half of ground truth images and the right half of entire output (the predicted region).
In a summary, the entire loss for global and local discriminators, , is
And the entire loss for the generator, , is
In our experiments, we set , , , and .
3.3 Implementation Details
We prepare a new scenery dataset consisting of diverse, complicated natural scenes, including mountain with or without snow, valley, seaside, riverbank, starry sky, etc. There are about images in the training set and images in the testing set. Part of the dataset (about ) comes from SUN dataset , and we collect others on the internet. Fig. 8 shows some examples. We conduct a series of comparative experiments to test our model on 1-step prediction333In our experiment, we do natural scenery image outpainting only on horizontal directions because of the limitation of our collected data. But theoretically, our network can work on any directions after modifications.. And we will show the strong representation ability of our architecture on multi-step prediction.
4.1 One-step Prediction
To train our model, we use Adam optimizer  to minimize the loss functions defined in Equation 7 and Equation 6. We set base learning rate, and . Before the formal training, we set , and train generator for 1000 iterations. In the formal training, we set and . Same as the training method in , the disciminator updates parameters times but the generator once. When iterations is less than or a multiple of , we set . In other cases, we set . The batch size is , and the learning rate is divided by 10 after epochs. The epoch number in our training process is .
In training, each image is resized to 144432, and then a 128256 image is randomly cropped from it or its horizontal flip. In testing, we resize the image to 128256.
|Number of GRB||IS||FID|
Comparison with Previous Works We make comparisons with latest generative methods444We make some modifications on their implementation for image outpainting., including Pix2Pix , GLC , and Contextual Attention , which are originally designed for image inpainting. The comparison result is shown in Fig. 9. We can find that our method achieves the best generation quality due to our designed architecture.
We employ Inception Score  and Fréchet Inception Distance  to measure the generative quality objectively, and report them in Table. 3. Our method achieves the best performance of FID, but its IS is a bit lower than CA . This is because CA employs a contextual attention method, which uses the feature in the original region to reconstruct prediction. But as shown in Fig. 9, 10, the contextual attention makes predictions worse when far away from original inputs. This leads to poor FID score (, while ours ). The contextual attention is an effective method in small region prediction (such as inpainting), but is not suitable in long-range outpainting.
Ablation Study First, we conduct ablation studies to demonstrate the necessity of introduction of SHC and RCT. The qualitative result comparison is shown in Fig. 7, in which we compare our architecture with the models without SHC or RCT. According to the experimental results, SHC successfully mitigates the un-smoothness between the predicted and original region. And RCT effectively improves the representation ability of the model and make the details in the prediction more delicate. Second, we make an ablation study on GRB. As shown in Table.2, the performance improves when using more GRB modules, which demonstrates the effectiveness.
4.2 Multi-Step Prediction
In this section, we use the well-trained model in Section 4.1 for multi-step prediction experiments. To make multi-step predictions, we use the predicted output from the previous step as the input for the next step. By concatenating the results from each step, we can get a very long picture.
We experiment with the prediction on one side in a very long range (Fig. 11) and the prediction on both sides (Fig. 12). These two experiments both show the powerful representational capabilities of our architecture. By the benefit of RCT, our model allows for long-term predictions with only a small amount of noise increase.
Besides, we make a comparison between our method and previous works: Pix2Pix , GLC , and CA  on multi-step predictions. The comparison result is shown in 10. Again, the result consistency in Pix2Pix , GLC , and CA  drops dramatically under this circumstance. FC+SHC achieves a better consistency, but still suffers from a large blurry effect. Especially, when far away from original inputs, sharp edges occur in the prediction results. By replacing the FC module with RCT, our method achieves the best performance on both consistency and sharpness.
A Hard Case Example. We test our method on some difficult cases, which are hard for previous works based on image matching. We show one example in Fig. 13. As shown in Fig. 13, when a given input is nearly nonobservable due to its darkness, our method is still able to generate a highly realistic snow mountain.
5 Conclusion and Future Work
We design a novel end-to-end network to solve image outpainting problems, which is, to the best of our knowledge, the first approach to utilize a deep neural network for solving this problem. With the introduction of the graceful designed Recurrent Content Transfer, Skip Horizontal Connection, and Global Residual Block, our network can generate images with high quality and extra length. We collect a new natural scenery dataset and conduct a series of experiments on it. Not surprisingly, our proposed method achieves the best performances. More than that, the proposed method can successfully generate extremely long pictures by iterating the model, which is unprecedented.
In future work, we would like to explore how to extrapolate images on horizontal and vertical directions with one same model simultaneously. Besides, we plan to design a specialized training process for the multi-step prediction.
-  (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §3.2, §4.1.
-  (2000) Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424. Cited by: §1, §2.
-  (2015) Deep generative image models using a￼ laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494. Cited by: §2.
-  (2018-06) Style aggregated network for facial landmark detection. In , pp. 379–388. Cited by: §2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2, §2, §3.2.
-  (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §2, §3.2, §3.2, §3.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §3.1, §3.1.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, Cited by: §4.1, Table 2, Table 3.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36 (4), pp. 107. Cited by: §1, §2, §3.2, 9(b), §4.1, §4.2, Table 3.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.3.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §1, §2, 9(a), §4.1, §4.2, Table 3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2012) Quality prediction for image completion. ACM Transactions on Graphics (TOG) 31 (6), pp. 131. Cited by: §1, §2.
-  (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
-  (2005) An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation 4 (2), pp. 460–489. Cited by: §1, §2.
-  (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §1, §1, §2, §2, §3.1, §3.1, §3.2.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.
-  (2016) Improved techniques for training gans. In NIPS, Cited by: §4.1, Table 2, Table 3.
-  (2017) Scribbler: controlling deep image synthesis with sketch and color. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
-  (2008) Creating and exploring a large photorealistic virtual space. Cited by: §1, §2.
-  (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §2.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
-  Instance normalization: the missing ingredient for fast stylization. arxiv 2016. arXiv preprint arXiv:1607.08022. Cited by: §3.3.
-  (2014-11) BiggerPicture: data-driven image extrapolation using graph matching. ACM Trans. Graph. 33 (6), pp. 173:1–173:13. External Links: Cited by: §1, §1.
-  (2014) Biggerpicture: data-driven image extrapolation using graph matching. ACM Transactions on Graphics 33 (6). Cited by: §2.
Sun database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: §4.
-  (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 3. Cited by: §2.
-  (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §3.1.
-  (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514. Cited by: §1, §2, §3.2, §3.2, 9(c), §4.1, §4.1, §4.2, Table 3.
-  (2011) Adaptive deconvolutional networks for mid and high level feature learning. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2018–2025. Cited by: §3.1.
-  (2013) Framebreak: dramatic image extrapolation by guided shift-maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1171–1178. Cited by: §1, §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint. Cited by: §2, §2.