Image inpainting is a technique that aims to restore the damaged images or fill in the missing parts of images with visually plausible contents [1, 4]. It allows to remove distracting objects or retouch undesired regions in photos [5, 6]
. It can be extended to other image generative tasks such as super-resolution[8, 24]. And it is an important step in many graphics algorithms, e.g., generating a clean background plate or reshuffling image contents. Due to the inherent ambiguity and complexity of natural images, image completion is a challenging problem.
Recent years, Deep learning has been applied to this troublesome task attributed to the outstanding performance. Some image generative networks are designed for hole-filling based on Convolutional Neural Networks (CNN). trained an encoder-decoder CNN (Context Encoder). It is a pioneering model.  designed global and local context discriminators to distinguish real images from completed ones using Generative Adversarial Network (GAN) framework. However, a simple fact currently is that compared with various types of networks in the domain of image classification and recognition where the models vary from squeezing AlexNet for FPGA and embedded deployment  to very large scale networks with several hundred layers , networks for image inpainting are few. Moreover, some models even come from the domain of image segmentation, super resolution, and image style transfer.  used the DCGAN model architecture from . The generative model used in  and  is based on that in 
which is for image-to-image translation task.
In addition, there are two main problems which current image completion faces to. First, these methods often create boundary artifacts, distorted structures and blurry textures inconsistent with surrounding areas because of insufficient cognitive understanding and ineffectiveness of convolutional neural networks in modeling long-term correlations between contextual information and the hole regions [20, 34, 10]. For example, the target is a dog, but the completed result does not follow the vision cognition. It does not look like a dog visually and much details are artifacts. The filter used by traditional CNNs is a generalized linear model (GLM). Therefore, it is implicitly assumed that the features are linearly separable for extraction, but the actual case is often difficult to be linearly separable. Furthermore, most generative networks give up pooling and are limited with
convolutional kernels. This is obviously not possible to fully utilize its learning ability and cognitive understanding due to using only a single type of receptive fields. Given a dog in an image, from our human vision cognition, it is always a dog no matter where the target is, big or small it is, and rotated or not it is. Vision cognition keeps invariable. Specifically, deep inception learning is adopted to utilize more complex structures to abstract the data within diverse receptive fields and explore enough cognitive understanding. Micro neural networks are build within a layer and the inpainting network can be constructed by stacking multiple of these layers for efficient high-level feature extraction by means of the ability of nonlinearity of inception layer.
Another problem is that previous deep learning approaches focused on rectangular regions located around the center of the image[20, 32, 31]. Several works for random inpainting are still limited to regular shape masks and rectangle region is the most used form.  created an irregular mask dataset for train inpainting networks using the research work in . However it is insufficient for arbitrary completion and free-style inpainting. It is necessary to include regular, special, and any other style shapes apart from irregular masks. To address this limitation, methods to create diver masks for model robustness of arbitrary completion are introduced in this paper. Combined with the above constructed network, free-style inpainting is finally realized.
Our contributions are summarized as follows. First, a novel generative network architecture using inception modules is proposed to enhance the abstraction ability of feature. The constructed network significantly improves completion results. To the best of our knowledge, we are the first to adopt inception learning for image inpainting. Moreover, approaches for generating diverse masks are provided and a relevant mask dataset is created. A variety of comparative experiments are performed on benchmark datasets. High-quality inpainting results are achieved and results demonstrate that our model is robust for arbitrary completion including on regular, irregular, and custom masks as shown in Figure1.
2 Related Work
A variety of approaches have been proposed for the image completion task. There are two mainstreams.
2.1 Examplar-based Inpainting
One category of traditional image completion methods are exemplar-based. They have been the main method for a long time. This task is a patch-based optimization problem with an energy function [4, 30, 1, 5, 6, 7]
. Matching-based methods explicitly match the patches in the unknown region with the patches in the known region, and copy the known content to complete the unknown region. The obvious limitation is that the synthesized texture only comes from the input image. This is a problem when a convincing completion requires textures that are not found in the input image. It is the same for diffusion-based methods which solve Partial Differential Equations (PDE)[3, 16] and propagate colors into the missing regions.
2.2 Learning-based Inpainting
More recently, deep neural networks are introduced for texture synthesis and image stylization [12, 36]. Deep learning methods provide encouraging results in filling large target regions [20, 32, 33, 19, 10]. In particular,  train an Encoder-Decoder CNN (Context Encoder) with combined l2 and adversarial loss to directly predict plausible image structures. It has been a classic work for image inpainting using CNN. Dilated convolutions are adopted in 
for increasing receptive fields of output neurons in inpainting network to replace channel-wise fully connected layer adopted in Context Encoder. In addition, global and local discriminators as adversarial losses based on GAN framework are introduced and Poisson Blending is applied as a post-process. Based on the architecture of the deep convolutional generative adversarial networks (DCGAN) from, semantic image inpainting is introduced in  to fill in the missing part conditioned on the known region for images from a specific semantic class. However, existing inpainting networks do not possess nonlinearity for a convolutional layer and these approaches are mainly trained on rectangular masks.
Mask samples are essential for training in image inpainting task. The mainstream practice is to generate a random-sized broken area at a random position of the image. While the limitation of previous works is that the damaged area is mostly regular, and the rectangular block is most adopted [20, 32, 31, 10]. A recent research in 
can generate masks of diverse shapes and stripes based on the results of mask image estimation between two consecutive frames of video. Combined with this research, partial convolution where the convolution is masked and re-normalized to utilize valid pixels is first proposed in to better handle irregular masks.
2.3 Network in Network
As stated above, the filter used by traditional CNNs is a generalized linear model. A solution is stacking convolution filters to generate higher-level feature representations to deal with practical problems and deeper models improve the abstract ability. Generally, the deeper the network is, the stronger the nonlinearity is. In general, the most direct way to improve network performance is increasing the depth and width of the network. But this way has the following problems: it brings too many parameters, and if the training data set is limited, it is easy to produce over-fitting; The larger the network is, the greater the computational complexity is; deeper network is prone to leads to gradient diffusion and the model is difficult to optimize.  thinks that special design can be done in the convolutional layer so that the network can extract better features apart from stacking network convolutional layers.  first introduces the idea of ’Network in Network’ (NIN). A micro neural network which is a potent function approximator inside a convolutional layer is instantiated.
Afterwards, inception modules are designed based on the framework of ’Network in Network’[28, 11, 29]. The main idea of inception is that an optimized local sparse structure in a convolutional neural network can be approximated and covered by a series of readily available dense substructures. Inception maintains the sparseness of the network and takes advantage of the high computational performance of dense matrices at the same time. It helps network to be deeper and wider without increasing the computation dramatically. Performance is improved significantly with the ability to extract more features under the same amount of computation. Combined with the design of the Residual Block Network (ResNet), which helps to mitigate the difficulty of training very deep networks, a novel inception framework of Inception-ResNet which utilizes the advantages of both Inception and ResNet is proposed in .
Our approach is based on deep inception learning, partial convolution, and random masks generating.
3.1 Deep Inception Learning
Inspired by the above mentioned ’Network in Network’ and ’Inception’, deep inception learning is adopted for building our completion network. Inception is a micro network inside a layer and the NIN structure can utilize strong nonlinearity than vanilla convolutions in the same receptive field.
There are several different types of kernels in one inception unit. Generally, a small filter, a medium-sized filter, a large filter, and a pooling filter are consisted. It increases the width of the network. On the other hand, it also increases the adaptability of the network for multi-scale processing. The network in the convolutional layer is able to extract information from each detail of the input, and large filter can also cover larger regions of the receiving layers. Pooling is performed to reduce the size of the space and over-fitting. The topology of inception analyzes the relevant statistics of the upper layer and aggregates them into a highly related unit group. All the results are concatenated into a very deep feature map and the stitching means the fusion of different features.
There are many different inception modules and different combinations of filters in inceptions. Generally, there are a lot of stacks in the structure of inception and they express a dense and compressed form of information. To express it more sparsely and only aggregate the information in large quantities, convolution is processed before general convolutions such as , . This helps reduce feature map thickness. For example, the output of the previous layer is , the output is after a convolutional layer with. If the output of the previous layer passes through a convolutional layer with channels first and then trace the convolutional layer with channels, the output is still , but the convolution parameter has been reduced to which is a reduction of times. At the same time, more features can be extracted by superimposing more convolutions in the same receptive field.
Different from the general generative networks where or convolutional filter is the most frequently applied form, deep inception models can embed larger convolutional kernels. To satisfy spatial support, dilated convolution which allows to compute each output pixel with a much larger input area while still using the same amount of parameters and computational power is adopted in . They believe that by using dilated convolutions at lower resolutions, the model can effectively “see” a larger area of the input image when computing each output pixel than with standard convolutional layers. A larger receptive field can be obtained by adopting inception learning. And in this way, not only spatial support is satisfied, but also different views of images is obtained and different features are learned. In order to avoid the expansion of parameters and calculations, large convolution kernels such as filter are decomposed into and forms. This helps save parameters, speed up calculations, and avoid over-fitting. Considering that pooling is embedded in Inception, this not only helps keep the ability of effective learning characteristics, but also avoids blurring the details and reducing the resolution caused by using pooling alone.
Apart from parallel combination of filters, there are also cascade forms for constructing inception. By cascading convolutions, more nonlinear features are acquired. Take a two convolutions cascade combination for example, suppose the first
convolution + activation function approximates, and the second convolution+activation function approximates , obviously, the nonlinearities of is stronger than that of , the cascaded form of convolutions can simulate nonlinear features better than vanilla convolutions. Figure 2 shows one kind of inception in our model. This structure stacks convolutions ( , , ) and pooling () which are commonly used in CNNs. The dimensions of convolution and pooling are kept same and their channels are concatenated. The learned features of the examples of different kernel in the micro network are extracted for display. The visualized results are shown in Figure 3. Here, we choose the first 15 channels’ outputs of one layer in the decoder stage. For a given input image shown in Figure 3(a), the feature maps of the inception are shown in Figure 3(b).
Such a network layer with an excellent local topology performing multiple convolution or pooling in parallel promotes learning of multiple features. Networks constructed by such layers can better abstract image characterization. And the intuition of effective multi-scale feature representation helps bring enough cognitive understanding. Based on the aforementioned analysis, image generative networks based on deep inception learning incline to improve cognitive logicality, make higher-quality inpainting results and ameliorate the of problem of artifacts, content discrepancy, color non-consistency and discrepancy which exit in previous works due to the lack of high level context representations and non-logic cognition.
3.2 Partial Convolution
 proposed the concept of partial convolutions. It has shown outstanding performance for images inpainting with irregular holes. The convolution is masked and the outputs are conditioned on only valid pixels. Let be the weight of the convolution filter and be the corresponding deviation. are the feature values of the current sliding window, is the corresponding binary mask. The partial convolution at each position is expressed as:
where denotes element-wise multiplication. It can be seen that the output values depend only on the unmasked region. Partial convolution has a better effect than standard convolution in correctly processing irregular masks. Different from image classification and object detection, where all pixels of input image are valid, while there are invalid pixels in the holes or the masked regions for the task of image inpainting. The pixel values of the masked regions are set 0 or 255 generally. This would lead to color discrepancy, edge responses if these invalid pixels take part in convolution as reported in . To utilize the advantages of the recent work, we replace standard convolution with partial convolution in our model.
3.3 Network Design
We follow Encoder-Decoder architecture to design the generative network. In our model, there are 16 layers totally, and 8 layers in encoder and 8 layers decoder respectively. The Encoder stage is to learn image features and it is a process of characterizing images. The Decoder stage is the process of restoring and decoding previously learned features to real images. On some issues, the information provided by the pixels around a pixel is always taken into account. This information generally consists of two categories: one is the overall environmental field information, and the other is the detailed information. The chosen form of window will bring a large uncertainty. If the selected size is too large, not only more pooling layers are required to make the environmental information appear, but also the local details are lost. On the contrary, field information is not accurate. Recent works demonstrate that U-net proposed in  has a good performance for image generative tasks. U-net uses a network structure that includes down-sampling and up-sampling. Down-sampling is used to gradually reveal the environmental information, and the up-sampling process merges the learned features which include the environmental information during down-sampling to restore more details. We combine the benefit of the approach in our model and the overview of the image generative architecture is shown in Figure 4.
In our model, middle layers are with micro networks which are inception modules. Each convolutional layer is followed with batch normalization and activation. ReLU is for encoding layers and LeakyReLU with alpha = 0.2 is used in the decoding stage. All convolutional layers are replaced with partial convolution. A key point worth mentioning is that feature map size does not vary linearly layer by layer in the architecture. This is different from previous generative networks where feature map size changes with a factor of 2 between layers. A instance is that 4th layer and 5th layer in the encoder have the same size of. It is the same for bottom 4th layer and bottom 5th layer in the decoding stage. The network details are in the supplementary materials.
3.4 Guided Objective Function
Given a ground truth image , generator produces an output . Let be a binary mask corresponding to the dropped image region with a value of 0 wherever a pixel was dropped and 255 for input pixels. Generally, a normalized L1 distance is used as reconstruction loss and is constructed on pixel-level as the follows:
Perceptual loss is first proposed in  for real-time style transfer and super-resolution reconstruction. It is based on pre-trained networks and defined as:
where is the feature map of the selected layer of the pre-trained model which is usually VGG . Perceptual loss is first applied to image inpainting in . Results indicate that it helps infer more accurate reconstruction of the hole content.
Style loss is widely used in image-to-image translation tasks such as real-time style transfer. It is also adopted in this paper. By computing gram matrix on each feature map, style loss takes the following form:
where is the computed gram matrix.
The above tow loss functions based on high-level features extracted from pre-trained networks are combined into the final guided objective function for generating high-quality images. Finally, the objective function takes the form: a low-level loss based on L1 computation is for pixel distance and a pre-trained VGG network is adopted to extract high-level feature maps for calculating the perceptual and style differences.
3.5 Random Mask Dataset
Masks play an important role in arbitrary completion and free-style inpainting. A key point is the generation of random mask images, since the deep learning-based method requires mask samples in advance other than unlabeled real images. A latest work demonstrates that basing on the results of the occlusion/non-occlusion mask image estimation method between two consecutive frames of the video can generate better masks of arbitrary shapes and stripes . Based on this method, a random mask dataset is created in . Two algorithms for automatically generating random masks are introduced in this paper. First, pictures with multiple shapes including rectangles, circles, ellipses, and strings are randomly draw. They are randomly generated with random size, rotation, and position. Another approach is that a point is first selected randomly, then a damaged region around the point is expanded wildly. Morphology process of dilation is applied to produce the final binary mask. A mask dataset for free-style inpainting is finally created. Samples of the dataset are shown in Figure 5. Note that holes are black and represented with 0 in the experiments. We inverse them for demonstrating comparisons. The created random mask dataset and the code will be released on github.
4 Experiments and Results
. The experiments are conducted with two NVDIA Geforce GTX-1080 Ti GPUs. The adopted deep learning platform is Pytorch. And we use Adam for optimization with a batch size of 16. Image samples are resized to before fed to the network.
We carry out quantitative comparisons with state-of-the art algorithms, IM , GL , GntIpt , Pconv . The implementations of all these methods were based on their released source codes, pre-trained models or results in their papers.
4.1 Regular Regions Inpainting
Considering that there is no released codes and pre-trained models for Pconv, we directly use the paper results in  for this comparison. Comparisons of different methods are shown in Figure 6. It can be seen that GL and GntIpt fail to achieve plausible results even through post-processing or refinement network. The two create distorted structures inconsistent with surrounding areas because of insufficient understanding of the image characteristics and ineffectiveness of convolutional neural networks in modeling long-term correlations between contextual information and the hole regions. There are many checkerboard artifacts in Pconv while ours is less compared with it.
4.2 Irregular Regions Inpainting
For the comparison, we used the released codes in  for IM and pre-trained models in . From the results, we can see that the results of IM are receivable for filling narrow or small holes. However, its result is blurry for filling large holes as shown in the last row of Figure 8. The completed contents are not consistent with the scene and the copied patches from somewhere else are disharmonious with the surroundings due to that it is unable to generate novel objects in the image as shown in Figure 7.
4.3 Free-style Inpainting
The mask images used here are got manually. Some figures are draw by hand on canvas. Owing to insufficient cognitive understanding, GL and GntIpt create distorted structures and artifacts as shown in Figure 9
. Although GL works well in the first row and the second result of GntIpt is receivable, they fail in other instances. The results do not comply with our human vision cognition. Note that different from previous work where the mask samples in test have the same probability distribution with that in train because they are generated using the same way, the distribution and peculiarity of the used mask samples in free-style inpainting are previously unknown and the masks can be created by our freewill. Results demonstrate that our method generates much more natural image completion for kinds of mask samples. More results are in Figure10.
In this paper, deep inception learning is adopted for cognitive inpainting. Combined with the benefits of some recent approaches, a state-of-the-art generative network is designed. Furthermore, two approaches for generating random mask samples are introduced. We valid our methods on three benchmark datasets. Experiments show that superior inpainting results are obtained and our method is robust for diverse masks which are vital for free-style inpainting. As image inpainting has commonality with super-resolution tasks, we plan to extend our built model to image super-resolution reconstruction.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics-TOG, 28(3):24, 2009.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
-  M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424. ACM Press/Addison-Wesley Publishing Co., 2000.
-  A. Criminisi, P. Pérez, and K. Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing, 13(9):1200–1212, 2004.
-  S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen. Image melding: Combining inconsistent images using patch-based synthesis. ACM Trans. Graph., 31(4):82–1, 2012.
-  K. He and J. Sun. Image completion approaches using the statistics of similar patches. IEEE transactions on pattern analysis and machine intelligence, 36(12):2423–2435, 2014.
-  J.-B. Huang, S. B. Kang, N. Ahuja, and J. Kopf. Image completion using planar structure guidance. ACM Transactions on graphics (TOG), 33(4):129, 2014.
-  J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In , pages 5197–5206, 2015.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.arXiv preprint, 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  A. Levin, A. Zomet, and Y. Weiss. Learning how to inpaint from global image statistics. In null, page 305. IEEE, 2003.
-  M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
-  G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723, 2018.
-  P. Liu, X. Qi, P. He, Y. Li, M. R. Lyu, and I. King. Semantically consistent image completion with fine-grained details. arXiv preprint arXiv:1711.09345, 2017.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by gpu-accelerated large displacement optical flow. In European conference on computer vision, pages 438–451. Springer, 2010.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  Y. Wexler, E. Shechtman, and M. Irani. Space-time completion of video. IEEE Transactions on pattern analysis and machine intelligence, 29(3), 2007.
-  Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift-net: Image inpainting via deep feature rearrangement. arXiv preprint arXiv:1801.09392, 2018.
-  C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
-  R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5485–5493, 2017.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. arXiv preprint, 2018.
B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.
Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2018.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.