Deep Fusion Network for Image Completion

04/17/2019 ∙ by Xin Hong, et al. ∙ Megvii Technology Limited Institute of Computing Technology, Chinese Academy of Sciences 0

Deep image completion usually fails to harmonically blend the restored image into existing content, especially in the boundary area. This paper handles with this problem from a new perspective of creating a smooth transition and proposes a concise Deep Fusion Network (DFNet). Firstly, a fusion block is introduced to generate a flexible alpha composition map for combining known and unknown regions. The fusion block not only provides a smooth fusion between restored and existing content, but also provides an attention map to make network focus more on the unknown pixels. In this way, it builds a bridge for structural and texture information, so that information can be naturally propagated from known region into completion. Furthermore, fusion blocks are embedded into several decoder layers of the network. Accompanied by the adjustable loss constraints on each layer, more accurate structure information are achieved. We qualitatively and quantitatively compare our method with other state-of-the-art methods on Places2 and CelebA datasets. The results show the superior performance of DFNet, especially in the aspects of harmonious texture transition, texture detail and semantic structural consistency. Our source code will be avaiable at: <>



There are no comments yet.


page 1

page 3

page 7

page 9

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image completion, which aims to fill unknown region of an image, is a fundamental task in computer vision. It can be broadly applied to the fields of image editing, such as old photo recovering, object removal, and seamless inpainting for damaged image. For most such applications, it is a critical problem to generate perceptually plausible completion results, specifically with natural transition between known and unknown region.

Previous approaches based on deep learning have shown great progress in image completion task

[18, 36, 35, 11, 21, 24, 30, 37, 34, 10, 8, 26]. As mentioned in [2]

, these methods can be divided into two groups. One group of works focus on building a contextual attention architecture or applying effective loss functions to generate more realistic content in the missing area. They assume the gaps should be filled with similar content from background. A typical arrangement is applying Partial Convolutions

[18] to concentrate on the unknown region. Other methods regard structural consistency as more important thing. Context priors such as edges are the most frequently used in these methods to ensure structural continuity. For instance, [21] proposed the Edge Connect method which can recover images with good semantic structural consistency. These approaches is dedicated to infer the unknown region with visually realistic and semantically related content. However, realizing smooth transition is more critical than restoring texture-rich images in most scenarios, as shown in Figure 1.

Humans has an incredible ability to detect discontinuous transition region. Consequently, The filled region must be perceptually plausible in the transition zone with sufficiently similar texture and consistent structure. In order to achieve smooth transition, [25] proposed a method to iteratively optimize the pixel gradient in edge transitional region. Given two images, the fusion quality depends on the consistency of the gradient changes of these two images, which is similar with the relationship between the restored content and the known region in image completion. This inspires us to build a network to simulate the composition process.

In this work, we design a learnable fusion block to implement pixel level fusion in the transition region. As shown in Figure 2, the fusion block is introduced that can be embedded to an encoder-decoder structure. Different from the previous methods, we develop an extra convolutional block to generate an alpha map, which is similar to the hole mask but has smoother weights especially on the boundary region. In the process of gradient descent optimization, the alpha composition map adjusts the balance between restored image and ground truth content to make the transition smoother. Similar ideas have also been used in image matting [5, 22]. However, The purpose of these method is to extract the smooth coefficients from background and foreground images, while the proposed fusion block is to combine them together.

In detail, we propose a Deep Fusion Network (DFNet). Firstly, a fusion block is adopted as an adaptable module to combine the restored part of image and original image. In addition to providing a smooth transition, the fusion block avoids learning unnecessary identity mapping for pixels in unknown region, and provides an attention map to make network focusing more on the missing pixels. With fusion block, structural and texture information can be naturally propagated from known region into unknown region. Secondly, we embed this module into different decoder layers. We find out that by considering the prediction of different fusion blocks with multi-scale constraints, the deep fusion network outperforms the network with only one fusion block embedded to the final layer. Furthermore, while different layers provides different feature presentations, we selectively switch on and off structure and texture loss, to recover the structural information from lower layers and refine texture details in high layers. The whole architecture of DFNet is displayed in Figure 4.

The proposed DFNet is evaluated on two standard benchmarks, Places2 and CelebA. In order to better verify the proposed method, we define Boundary Pixels Error to measure the transition performance near the boundary of unknown region. Also, and FID are applied to verify global texture and consistency. Experiments demonstrate the superior performance of DFNet while compared with other state-of-the-art methods both in quantitative and qualitative aspects. It achieves better results in not only smooth texture transition but also structural consistency and more detailed textures. As conclusion, the main contributions can be summarized as follows:

  • We investigate the image completion problem with the perspective of better transition region and propose fusion block which predicts an alpha composition map to achieve smooth transition.

  • Fusion block avoids learning unnecessary identity mapping for known region and provides an attention mechanism. In this way, structure and texture information can propagate from known region to completion more naturally.

  • We propose Deep Fusion Network, a U-Net architecture embedded with multiple fusion blocks to apply multi-scale constraints.

  • A new measurement Boundary Pixels Error (BPE) is introduced to measure the transition performance near the boundary of missing hole.

  • The results on Places2 and CelebA show that our method outperforms state-of-the-art methods in both qualitative and quantitative aspects.

2 Related Work

Context Aware Context aware based image completion methods imagine the semantic content can be filled based on the overall scene. Context Encoders[24] introduces a encoder-decoder network to restore images from damaged inputs and holes. It applies a discriminator to increase the authenticity of restored images. Yang et al.[33] takes its result as input and then propagates the texture information from unknown region to fill the missing area. Li et al.[17] and Iizuka et al.[11] extends Context Encoders by defining both global and local discriminators to pay more attention on the missing areas. Iizuka et al. applies Poisson Blending[25] as post-processing. Liu et al. [18] introduces partial convolution layers to avoid capturing too many zeros from unknown region. These methods depend entirely on the training image to generate semantically relevant structures and texture confidence.

Texture Generation In the field of texture generation, perceptual loss is adopted to fill in visually realistic content for missing regions. Liu et al.[18] applies perceptual loss[6, 14] which uses a VGG[29] network as a feature extractor. It computes loss use extracted high level features to achieve higher resolution textures in completion. Other methods usually rely on GAN[7] loss to obtain better details. For instance, Yu et al.[36] replaces the post-processing with a refinement network powered by the contextual attention layers.

Figure 2: Illustration of Fusion Block. A fusion block extracts raw completion from feature maps by learnable function , and also predicts an alpha composition map with function . Finally it combines the raw completion with scaled input image by blending function . The detail of blocks can be found in Section 3.1.

Structure constraints To better control the completing behaviour of networks, other works[30, 35, 21] explore providing extra information for inpainting. Song et al.[30] uses a DeepLabv3+[3] model to first predict a segmentation map, and then completes the unknown region with predicting segmentation map as prior. Yu et al.[35] proposes gated convolution which generalize partial convolution and the new structure is compatible with user guides, usually strokes to indicate edges. Like Song et al.[21] uses a two staged networks for completion. It first completes edges corresponding to the input image and then use completed strokes to guide the full color images. In some extent, those methods can manually control the completion result of network by replacing the priors with custom one or giving extra edge information.

Image Embedding As a similar work with image completion, image embedding and matting are also studied in the past decades. [25] proposes a method to iteratively optimize the pixel gradient in edge transitional region. Then poisson matting[31] firstly introduces a Poisson blending method into alpha matting by solving a poisson equation, which proves the effectiveness of alpha composition. Deep Image matting[22] also generates an alpha map with a encoder-decoder network. Cho et al[5] takes the matting results of [1] and normalized RGB colors as inputs and learn an end-to-end deep network to predict a new alpha matte. These methods prove that alpha matting based on deep learning is more realistic for image embedding and matting.

Figure 3: Corresponding results in a fusion block.
Figure 4: Overview of our Deep Fusion Network (DFNet). DFNet is based on a U-Net, like the one used in [13, 18]. The difference to traditional U-Net is that we embed fusion blocks to the last few decoder layers. During training, each fusion block will produce a completion result from corresponding feature maps, which also has the same resolution with the feature maps. So that different constraints can be provided to each completion result as needed. During testing, only the completion result from last layer need to be produced.

3 Deep Fusion Network

Deep Fusion Network is built on a U-Net[27] like architecture, which is widely used in recent image segmentation[20] task and image to image translation[13, 32, 4, 16] tasks. The difference between our DFNet and original U-Net is that we embed fusion blocks to several layers of decoder. Fusion blocks help us to achieve smoother transition near the boundary and is the key components for our multi-scale constraints. In this section, we first introduce the fusion block and then discuss our network architecture and loss functions.

3.1 Fusion Block

The task of image completion is to restore the missing area with visually plausible content from a damaged image and a binary mask which represents the location of the unknown region.

Recently deep learning based methods usually predict the whole image which even includes known region and use it to calculate loss during training. However, they take (, denotes Hadamard product) rather than for testing. The composition process replaces known region in with corresponding pixels in . Only a few methods[18] use both and to compute loss.

This training strategy has problems. Firstly, the mission of image completion is to complete the unknown region only. It is actually hard to complete missing hole while keeping a strict identity mapping for known area. Secondly, the inconsistent use of and during training and testing, along with the rigid composition method, usually produces visible artifacts around the boundary of missing area. As shown in the first case of Figure 1, the result of Edge Connect[21] has a clear edge at the boundary of completion.

To remove the artifacts around the boundary and avoid neural networks learning unnecessary identity mapping, we propose

Fusion Block. As shown in Figure 2, a fusion block feed with two elements, an input image with unknown region and feature maps form layer ( layer is the last decoder layer of U-Net). The fusion block first extracts raw completion from feature maps, and then predicts an alpha composition map to combine them. The final result is obtained by:

We resize to obtain . The raw completion extracted from feature maps by a learnable function :

transforms channel feature maps into a 3 channel image with the resolution unchanged which is exactly the raw completion. Actually, we use a

convolutional layer following with a sigmoid function to learn


The alpha composition map is produced by another learnable function from raw completion and the scaled input image:

has two choices in the number of channels, either single channel for image-wise alpha composition or 3 channels for channel-wise alpha composition. In practice, we find channel-wise alpha composition performs better. As for

, we use three convolutional layers and the kernel size of them are 1, 3, 1. First two convolutional layers are followed with a Batch Normalization


layer and a leaky ReLU function. And we apply sigmoid function to the output of last convolutional layer.

The fusion block enables network to avoid learning unnecessary identity mapping while completing unknown region with soft transition near the boundary. We also give an example of corresponding images in a fusion block in Figure 3. Completion performance can be further improved with multi-scale constraints by embedding fusion blocks to the last few decoder layers of U-Net.

3.2 Network Architecture

It’s intuitive that when completing an image, constructing structures is easier in lower resolution for algorithms, while recovering texture is more feasible in higher resolution. We embed fusion blocks to the last few decoder layers of the U-Net and obtain completion results in different resolution. And then we can apply structure and texture constraints to different resolution as we want. The overview of our DFNet is shown in Figure 4. We choose U-Net[27] like the one used in [13, 18] as our backbone architecture. The difference is that the last few decoder layers are embedded with fusion blocks. Each fusion block outputs a completion result with the same resolution as the input feature maps . According to their resolution, we can provide different constraints as we want during training. We will discuss these constraints in Section 3.3. During testing, only the completion result from last layer is needed.

3.3 Loss Functions

The target of image completion is to generate visually plausible results in both aspects of structure and texture. Reconstruction loss, which is mean absolute error of each pixel between prediction and ground truth, is usually used to guarantee accurate structures in completion results. However, high resolution textures is beyond the capability of reconstruction loss. Previous works use GAN loss [7] or perceptual loss along with style loss [14] to obtain vivid textures. These two loss have same drawback which is known as producing checkerboard and fish scale artifacts[18]. Total variation loss is usually used to counter this drawback. Results from [18] shows that this artifact can be reduced more obviously by increasing the weight of style loss.

Reconstruction Loss Reconstruction loss is defined as mean absolute error of completion result and target image :

The number of channels is , the height is and the width is .

Perceptual Loss and Style Loss Perceptual loss and style loss are first used in style transfer[6, 14]. They use a pre-trained VGG network to extract high level features. The errors are computed between these features rather than original images. Let be the features of th layer in a VGG network when given image . The size of is . Perceptual loss is defined as the error of these features:

is selected VGG layers. Gram matrix is a matrix, whose elements are defined as:

And then style loss is mean absolute error between corresponding Gram matrices of the output and target image:

Style loss doesn’t consider the position of pixels but cares about how high level features appear simultaneously[14], so that it’s better for constraining the entire style of an image.

Total Variation Loss Total variation loss is errors computed only use predictions. Each pixel will compute errors with top pixel and left pixel respectively. This can be implemented more easily by using a convolution layer with a fixed kernel.

Total Loss. We group loss functions into Structure Loss and Texture Loss. Structure Loss is represented as weighted reconstruction loss:

And texture loss is a combination of three loss:

Our final loss is sum of structure loss and texture loss from different resolution completion results:

is the set of layers which consider structure loss while is for texture loss. And for brevity, we use to represent the choice of , which takes last layers as and last layers as . For example, represent and , which means completion results from last two layers of U-Net will be used to compute structure loss and only last one for texture loss. Corresponding part will be ignored if the total number of layers or equal to zero. We will discuss the choice of and in Section 4.3.

Figure 5: Effectiveness of Fusion Block, Multi-scale constraints, and Loss Ablation. (a) compares the results of the proposed network without and with a fixed mask, and with fusion block. (b) depicts the results with 1, 3, 6 fusion blocks respectively. (c) shows the effects of structure loss and texture loss, and proves the effectiveness of combination loss.
Figure 6: Evaluation of different and . Detailed description for and can be found in Section 3.3. We separately compare models with different area range of mask. The results have been normalized by subtracting the minimal value of corresponding comparisons. For these three metrics, lower means better performance. In (a), (b) and (c), we compare 6 choices of and which gradually add fusion blocks but keeping . Results show performance gets better with multi-scale constraints. We further compare other 4 choices of which and choose as our best model. More analysis can be found in Section 4.3.2 and Section 4.3.3.

4 Experiments

4.1 Experiment Details

We evaluate DFNet on two public datasets: Places2[38] and CelebA [19]. For Places2, we use the original train, test, and val splits. For CelebA, we randomly partition into 27K images for training and 3Kimages for testing. Images in Places2 and CelebA are respectively resized to and during training and testing. We randomly generate 1000 masks according to the method in [35] and perform augmentation to these masks during training. To analysis the influence of unknown region range, these masks are categorized into five classes, including [0 10%), [10%, 20%), [20%, 30%), [30%, 40%), [40%, 50%).

Models are separately trained on each dataset. Our proposed model is implemented in PyTorch

[23] and trained in a single machine with 8 GeForce RTX 2080 Ti. We use Horovod[28] as our distributed training framework. With a batch size of 6 for each GPU, it usually takes about 3 days to train a model. Forwarding is extremely fast, it only takes to complete an image. As a common configuration, Adam[15] is applied for optimization. The learning rate reduced from to

in 20 epochs, with a decay rate of

and step size of .

4.2 Evaluation Metrics

Different from the tasks as image classification, detection and segmentation, image generation usually don’t have strict targets. The basic rule is visually plausible. For image completion, it requires the completion not only looks real but also transit naturally from known region. So we apply , and Fréchet Inception Distance (FID) [9]

as evaluation metrics both in perspective of pixels and features to quantitatively analysis the performance of DFNet.

Furthermore, we observe that pixels in unknown region that near the boundary have very small variance while these pixels play the most important role in structure and texture transition. To measure the transition performance of models, we propose

Boundary Pixels Error (BPE) which only consider pixels error near the boundary. For boundary area , which is pixels narrow band adjacent to the boundary of unknown region, BPE is mean absolute error of those pixels between ground truth and prediction :

Methods BPE FID
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
DeepFill 2.79 6.75 10.63 15.35 28.38 1.33 1.81 2.39 2.91 5.13 24.04 56.55 98.25 173.90 324.97
PConv 1.51 4.22 7.01 10.52 12.83 0.17 0.37 0.62 0.87 1.60 14.98 41.21 84.60 166.72 217.48
EdgeConnect 1.43 3.94 6.41 9.64 11.38 0.33 0.69 1.11 1.48 2.55 19.24 35.91 68.29 131.16 147.51
DFNet 1.40 3.91 6.50 9.89 11.96 0.15 0.33 0.55 0.74 1.42 12.27 34.64 65.25 127.58 136.22
Table 1: Quantitative comparison with other methods on Places2.

4.3 Analysis of DFNet architecture

In this section, we investigate the performance of the proposed modules in DFNet. First, we show the effectiveness of fusion blocks. Then we focus on the effect of multi-scale constraints by gradually increasing fusion blocks to DFNet and evaluating it. Finally, we discuss how to apply structure loss and texture loss on different resolution completion results to achieve the best results.

4.3.1 Effectiveness of Fusion Block.

We compare our results with predictions from a normal U-Net and predictions from a DFNet but directly using mask to replace alpha composition map. We use only one fusion block for fair comparison, which means for .(Section 3.3).

As can be seen in the 1st row of Figure 5, fusion block leads to the best transition near the boundary. Although most of semantic information has been restored, there exists obvious color transition inconsistent in the result of standard network without mask constraints. This is because global semantic consistency constraints can only leads to similar texture in the missing areas, but structural consistency can not be guaranteed. Based on the mask constraints, the pixel transition in filling area becomes more natural, which proves the effect of the proposed method on the propagation of structural and texture information. As mentioned above, the alpha composition map is a attention mechanism to enhance the structural consistency. Furthermore, the result of learned alpha mapping is even better in the edge transition to eliminate the visible artifacts near the boundary.

The same detailed conclusion can be seen in Figure 3. Based on the proposed fusion block, the structure between the known and unknown areas are well preserved, even beyond the mask area. The sharp edge of the roof is retained into the reconstructed image with other useless part discarded.

4.3.2 Multi-scale constraints.

We compare DFNets with different number of fusion blocks from one to six. Formally speaking, and in Section 3.3 both increase from to . In this section, and are equal to only analyze the role of multi-scale fusion.

As can be seen in the 2nd row of Figure 5, the structure of building is more clear and accurate based on more fusion blocks. Also the shapes of houses are depicted in the result of 3 fusion blocks instead of the noises in the result based on 1 fusion block. While high level layers in encoder have bigger receptive field and global context, the structure information can be more easily reconstructed with more layers in decoder. Nevertheless, although the result of 6 fusion blocks retains these structural information, its texture is not very stable compare to 3 fusion blocks. We guess this is because we shouldn’t apply texture constraints for low resolution completion result. In the next section, we will go into more detail about how to choose the number of blocks layers.

We also give the quantitatively comparison in the second row of Figure 6. Results are separated according to the range of mask in each evaluation metric. With fusion blocks increased, FID gets lower and lower. This means multi-layer constraints helps to capture contextual information and makes the whole image looks more real. The BPE increases slightly with fusion blocks increased. This can be explained that finer texture and smoother transition is a trade-off. However, globally visual effect is more important and the change in BPE actually is very small.

4.3.3 Loss Ablation and Tuning.

Firstly, the effect of structure loss and texture loss are showed by respectively trained DFNet only applies only one of them. As seen in the 3rd Figure 5, the result without texture loss is blurry but with accurately structure consistency, while the other one completely destroys the structure, it fails to recover edges of object although they have finer textures. This provides strong evidence for loss design in this paper.

We further discuss the dynamic loss design in each layer. Based on the visualization results in 4.3.2, we make a comprehensive comparison of the loss design in different layers. As shown in Figure 6, the performance depicts the same trend with different ranges of hole size. We choose to compute structure loss and for texture loss in the final architecture. This can be explained that, although the structure information is more and more abundant with higher and higher encoder layers, the high-level features will lead to texture noise due to the loss of global semantic constraints.

(a) Input image
(b) DeepFill[36]
(c) PConv[18]
(d) Edge Connect[21]
(e) Ours DFNet
(f) Ground Truth
Figure 7: comparison results on Places2 and CelebA. More results can be found in the supplementary materials.

4.4 Comparisons with Other Methods

We quantitatively and qualitatively compare our DFNet with 3 recently methods, including DeepFill [36], PConv [18] and Edge Connect[21]. Results of DeepFill and Edge Connect are obtained by using their pre-trained models 111 222 However, we don’t find the official implementation of PConv, so we implement one with the same settings described in the original paper.

4.4.1 Quantitative Comparisons.

Table 1 shows the comparison results on Places2[38]. We use three metrics including , BPE and Fréchet Inception Distance (FID) [9]. Results from ours outperforms others on both boundary transition and realistic on overall image. Our predictions on BPE is significantly lower than those from Edge Connect[21] and other methods. This means completion from our methods have better transitional area near the boundary, which also proves the effectiveness of proposed fusion blocks.

Edge Connect works well on maintaining structural consistency by applying additional edge constraints. However it doesn’t pay much attention to smooth transition. The constraints on the structure of the whole image can’t lead to natural image restoration, especially in detail. Results of Edge connect shows lower than ours while the missing hole is large. But this only state results of Edge Connect is more similar to original images. Because completion can be more diverse while hole is larger.

PConv use partial convolution to progressively reduce missing region, which can be considered as providing a hidden attention map gradually enlarged from boundary area to full known region. This enhance the learning ability near the boundary, which have the similar effects with the proposed DFNet when considering transition performance. However, this architecture is not good at large hole because information can’t be transmitted effectively to inner area. When comparing PConv and Edge Connect on BPE and FID, we can find PConv has better transition near the boundary than Edge Connect and comparable FID when missing hole is small, however, when missing hole becomes larger, Edge Connect will have more realistic results.

4.4.2 Qualitative Comparisons.

Figure 9 shows the comparison on Places2 and CelebA without any post-processing. As shown in the figure, we can see our model has the best performance in texture consistency near the boundary, and also good at keeping the structure consistency even better than Edge Connect. Results from different datasets shows the generalization ability of our methods.

There is one thing should be noticed, as shown in the 1st case of Figure 9, we find PConv and Edge Connect sometimes fail to complete the missing hole when the missing hole cover the border of an image. For PConv, we think this is the limit of partial convolution, which can’t transmit information into a very large hole. While for Edge Connect, it always produces clouds like completion in similar situation. We couldn’t figure it out the reason.

5 Conclusion

In this paper, we analysis the image completion technology from a new perspective. We propose Deep Fusion Network by designing a fusion block to predict an alpha composition map for combining completion and existing content and embedding it on multi-scale layers. Results of experiments on Places2 and CelebA dataset shows our method achieves state-of-the-art results, especially in the filed of harmonious texture transition, texture detail and semantic structural consistency.


  • [1] D. Lischinski A. Levin and Y. Weiss. A closed-form solution to natural image matting. In Pattern Analysis and Machine Intelligence, page 228–242, 2008.
  • [2] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424. ACM Press/Addison-Wesley Publishing Co., 2000.
  • [3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–818, 2018.
  • [4] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo.

    Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.


    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [5] I. Kweon D. Cho Y.-W. Tai D. Cho, Y.-W. Tai and I. Kweon. .

    Natural image matting using deep convolutional neural networks.

    In European Conference on Computer Vision, pages 1–8. Springer, 2016.
  • [6] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
  • [7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [8] Changzhi Luo Wangmeng Zuo Meng Wang Haoran Zhang, Zhenzhen Hu. Semantic image inpainting with progressive generative networks. In ACM Multimedia, 2018.
  • [9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [10] Patrick Perez Huy V. Vo, Ngoc Q. K. Duong. Structural inpainting. In ACM Multimedia, 2018.
  • [11] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and Locally Consistent Image Completion. ACM Transactions on Graphics (Proc. of SIGGRAPH 2017), 36(4):107:1–107:14, 2017.
  • [12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors,

    Proceedings of the 32nd International Conference on Machine Learning

    , volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
  • [13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European Conference on Computer Vision, 2016.
  • [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [16] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [17] Yijun Li, Sifei Liu, Jimei Yang, and Ming-Hsuan Yang. Generative face completion. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [18] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In The European Conference on Computer Vision (ECCV), 2018.
  • [19] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [20] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [21] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. 2019.
  • [22] Scott Cohen Thomas Huang Ning Xu, Brian Price. Deep image matting. arXiv preprint arXiv:1703.03872, 2017.
  • [23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [24] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In Computer Vision and Pattern Recognition (CVPR), 2016.
  • [25] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Transactions on graphics (TOG), 22(3):313–318, 2003.
  • [26] Teck Yian Lim Alexander G. Schwing Mark Hasegawa-Johnson Minh N. Do Raymond A. Yeh, Chen Chen. Semantic image inpainting with deep generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [27] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [28] Alexander Sergeev and Mike Del Balso.

    Horovod: fast and easy distributed deep learning in tensorflow, 2018.

  • [29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [30] Yuhang Song, Chao Yang, Yeji Shen, Peng Wang, Qin Huang, and C-C Jay Kuo. Spg-net: Segmentation prediction and guidance network for image inpainting. arXiv preprint arXiv:1805.03356, 2018.
  • [31] Jian Sun, Jiaya Jia, Chi-Keung Tang, and Heung-Yeung Shum. Poisson matting. In ACM Transactions on Graphics (ToG), volume 23, pages 315–321. ACM, 2004.
  • [32] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [33] Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [34] Xiaojuan Qi Xiaoyong Shen Jiaya Jia Yi Wang, Xin Tao. Image inpainting via generative multi-column convolutional neural networks. In Conference and Workshop on Neural Information Processing Systems(NeurIPS), 2018.
  • [35] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589, 2018.
  • [36] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892, 2018.
  • [37] Zhe Lin Xiaofeng Liu Qin Huang Hao Li C.-C. Jay Kuo Yuhang Song, Chao Yang. Contextual-based image inpainting: Infer, match, and translate. In The European Conference on Computer Vision (ECCV), 2018.
  • [38] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.

    Places: A 10 million image database for scene recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.