PEPSI++: Fast and Lightweight Network for Image Inpainting

05/22/2019 ∙ by Yong-Goo Shin, et al. ∙ Korea University 0

Generative adversarial network (GAN)-based image inpainting methods which utilize coarse-to-fine network with a contextual attention module (CAM) have shown remarkable performance. However, they require numerous computational resources such as convolution operations and network parameters due to two stacked generative networks, which results in a low speed. To address this problem, we propose a novel network structure called PEPSI: parallel extended-decoder path for semantic inpainting network, which aims at not only reducing hardware costs but also improving the inpainting performance. The PEPSI consists of a single shared encoding network and parallel decoding networks with coarse and inpainting paths. The coarse path generates a preliminary inpainting result to train the encoding network for prediction of features for the CAM. At the same time, the inpainting path results in higher inpainting quality with refined features reconstructed using the CAM. In addition, we propose a Diet-PEPSI which significantly reduces the network parameters while maintaining the performance. In the proposed method, we present a Diet-PEPSI unit (DPU) which effectively aggregates the global contextual information with a small number of parameters. Extensive experiments and comparisons with state-of-the-art image inpainting methods demonstrate that both PEPSI and Diet-PEPSI achieve significant improvements in qualitative scores and reduced computation cost.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image inpainting techniques which attempt to remove an unwanted object or synthesize missing parts of an image have attracted wide-spread interest in computer vision and graphics communities 

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Recent studies used the generative adversarial network (GAN) to produce appropriate structures for the missing regions, hole regions [8, 9, 15]. Among the recent state-of-the-art inpainting methods, the coarse-to-fine network has shown remarkable performance [4, 16]. This network is composed of two stacked generative networks including the coarse and refinement networks. The coarse network roughly fills the hole regions using a simple dilated convolutional network trained with reconstruction loss. A contextual attention module (CAM) first generates feature patches of the hole regions by borrowing information from distant spatial locations, and then the refinement network improves the quality of the roughly completed image. Despite the promising results, the coarse-to-fine network requires high computational resources and consumes considerable memories.

In the previous work [17]

, we introduced a novel network structure, called PEPSI: parallel extended-decoder path for semantic inpainting, which aims at reducing the number of convolution operations as well as improving the inpainting performance. The PEPSI is composed of a single encoding network and a parallel decoding network which has coarse and inpainting paths. The coarse path produces a roughly completed result with which the encoding network is trained to predict features for the CAM. At the same time, the inpainting path generates a high-quality inpainting result using the refined features reconstructed by the CAM. To make a single encoding network handle two different tasks, which are feature extraction for both a roughly completed and high-quality results, we propose a joint learning technique that jointly optimizing two different paths. This learning scheme facilitates the generation of high-quality inpainting image by PEPSI without the stacked generative networks.

Although the PEPSI exhibits faster operation speed compared with the conventional methods, it still needs substantial memory owing to a series of dilated convolutional layers in the encoding network, which retain nearly 67 percent of network parameters. The intuitive way to save memory consumption is to prune channels in the dilated convolutional layers; however, it often results in inferior results. To address this challenge, this paper presents an extended version of PEPSI, called Diet-PEPSI, which significantly reduces the network parameters almost by half with comparable inpainting performance. In this paper, we introduce a Diet-PEPSI unit (DPU) which provides a large receptive field with a small number of parameters. More specifically, the DPU assembles the contextual information via dilated group convolutional layer [18] and ensembles that information with input feature maps using the projection shortcut technique [19]. By replacing the multiple dilated convolutional layers with DPUs, the Diet-PEPSI covers the same size of the receptive field with a small number of parameters.

Furthermore, we investigate an obstacle of a discriminator in traditional GAN-based image inpainting methods [20, 14]. In general, conventional methods employ global and local discriminators trained with a combined loss, the pixel-wise reconstruction loss and adversarial loss, which assists the networks in generating a more natural image by minimizing the difference between the reference and the inpainted image. More specifically, the global discriminator takes the whole image as input to recognize global consistency, whereas the local one views only at a small region around the hole in order to judge the quality of more detailed appearance. However, the local discriminator can only deal with a single rectangular hole region, while holes can appear with arbitrary locations, shapes, and sizes in real-world applications; it is difficult to adopt the local discriminator to train the inpainting network for holes with irregular shape. To resolve this problem, we propose an region ensemble discriminator (RED) to integrate of the global and local discriminators. The RED individually computes a  hinge

loss on each feature vector of the last layer, with a different receptive field, to deal with the various holes with arbitrary shapes.

In summary, this paper has three major contributions. (i) We propose a novel network architecture called PEPSI that achieves superior performance as compared to conventional methods as well as significantly reduces the operation time. (ii) We design the DPU to further reduce the hardware costs while maintaining the overall quality of the results, which makes the proposed method compatible with the hardware. (iii) The novel discriminator, called RED, is proposed to handle both squared and irregular hole regions for real applications. In the remainder of this paper, we first introduce related work and preliminaries in Section II and Section III, respectively. The discussion of PEPSI and Diet-PEPSI in Section IV. In section V, extensive experimental results are presented to demonstrate that the proposed method outperforms conventional methods on various dataset such as Celeb-a [21, 22], place2 [23]

, and ImageNet 

[24]. Finally, conclusions are provided in Section VI.

Ii Related work

The image inpainting techniques can be divided into two groups [4]. The first group includes diffusion-based and patch-based methods. The diffusion-based method typically propagates the local image appearance around the holes to fill them in [1, 2, 4, 5]. This method performs well with small and narrow holes but often fails to fill in the complex hole region such as face and objects with non-repetitive structures. In contrast to the diffusion-based method, the patch-based technique results in a better performance in filling the complicated images with large hole regions [4, 25]. This method samples texture patches from the existing regions of image, i.e. background regions, and pastes them into the hole region. Barnes et al. [3] proposed a fast approximate nearest neighbor patch search algorithm, called Patch-Match, which has shown notable results for image editing applications including image inpainting. However, the PatchMatch often produces poor results by filling the holes regardless of the visual semantics or the global structure of an image.

Fig. 1:

The illustration of the CAM. The conventional CAM reconstructs foreground patches by measuring the cosine similarities with background patches. In contrast, the modified CAM uses the Euclidean distance to compute similarity scores.

The second group is a generation-based method that employs the convolutional neural network (CNN) to generate structures for the hole regions 

[8, 9, 15]. The CNN-based image inpainting methods employing an encoder-decoder structure have shown superior performance on inpainting the complex hole region compared with the diffusion- or patch-based methods [8, 15]. However, these methods often generate an image with visual artifacts such as boundary artifacts and blurry texture inconsistent with surrounding areas. To alleviate this problem, Pathak et al. [10] adopted the GAN [20] to enhance the coherenc between background and hole regions. They trained the entire network using a combined loss, the pixel-wise reconstruction loss and adversarial loss, which drives the networks to minimize the difference between the reference and inpainted image as well as to produce a plausible new contents in highly structured images such as faces and scenes. However, this method can only fill square holes at the center of an image.

To inpaint the images with the square hole in arbitrary locations, Iizuka et al. [7] proposed an improved network structure which can sample features in wider receptive fields using multiple dilated convolutional layers. In addition, they use two sibling discriminators: global and local discriminators. The local discriminator focuses on the inpainted region to distinguish local texture consistency while the global discriminator inspects if the result is coherent in a whole image. Yu et al. [4] have extended this work using the coarse-to-fine network and the contextual attention module (CAM). The CAM learns the relation among background and foreground feature patches by computing the cosine similarity. To collect the background features involved with the missing region, this method requires the features at the missing region encoded from roughly completed images. To this end, two stacked generative networks (coarse and refinement) were used to generate an intermediate result of roughly restored image. This method achieved a remarkable performance compared with the recent state-of-the-art inpainting methods; however, it requires considerable computational resources due to the use of the two-stage network structure.

Fig. 2: The topy example about coase network. (a) The masked input image (b) The original image (c) The result from the coarse-to-fine network (d) The result without the coarse reuslt (e) The result with LR coarse path
Square mask Free-form mask Time
GC 24.67 0.8949 27.78 0.9252 21.39ms
GC 23.50 0.8822 26.35 0.9098 14.28ms
GC 23.71 0.8752 26.22 0.9026 13.32ms
TABLE I: The experimental results with GatedConv (GC) [16] using different coarse path. * means a model without coarse results and † indicates a model with simplified coarse path.
Fig. 3: An architecture of PEPSI. The coarse path and inpainting path share their weights to improve each other. The coarse path is trained only with the reconstruction loss while the inpaiting path is trained with both of and adversarial loss.

Iii Preliminaries

Iii-a Generative adversarial networks

The GAN was first introduced by Goodfellow  et al. [20] for the image generation. In general, the GAN consists of a generator G and a discriminator D which are trained with competing goals. The generator is trained to produce a new image which is indistinguishable from real images, while the discriminator is optimized to differentiate between real and generated images. Formally, the G (D

) tries to minimize (maximize) the loss function,

adversarial loss, as follows:


where z and x denote a random noise vector and a real image sampled from the noise and real data distribution , respectively. Recently, the GAN has been applied to several semantic inpainting techniques [4, 10, 7] in order to fill the holes naturally.

Iii-B Coarse-to-fine network

Yu et al. [4, 16] proposed a two-stage network, called a coarse-to-fine network, which performs a couple of tasks separately. This method first generates an initial coarse prediction using the coarse network, and then refines the results by extracting features from the roughly filled prediction with the refinement network. To effectively refine the coarse prediction, they introduced CAM which generates patches of hole region using the features from distant background patches. As depicted in Fig. 1, the CAM first divides the input feature maps into a target foreground and its surrounding background, and extracts patches. The similarity score between the foreground patch at , , and the background patch at , , can be obtained by normalized inner product (cosine similarity) as follows:


where is a hyper-parameter for scaled softmax. By using as weights, the CAM reconstructs features of foreground regions by a weighted sum of background patches to learn the relation between them. Although the CAM effectively learns where to borrow or copy the feature information from known background feature patches for generating missing feature patches, it needs the roughly complete image, coarse result, to explicitly attend on related features at distant spatial locations. To justify this assumption, we conducted experiments that measure the performance of coarse-to-fine network with/without the coarse path. As shown in Table I and Fig. 2, the refinement network without the coarse result shows worse results than the full coarse-to-fine network (these results were obtained by training the refinement network using raw masked images as an input). This means that, if the coarse feature of the hole region is not encoded well, the CAM produces the missing features using unrelated feature patches, yielding contaminated results as shown in Fig. 2(d). In other words, the coarse-to-fine network must pass through a two-stage encoder-decoder network which requires massive computational resources. To reduce the operation time of the coarse-to-fine network with another way, furthermore, we conducted an extra experiment by simplifying the coarse network. In our experiments, we generate the coarse result with low resolution () and fed it to the refinement network by resizing its resolution to the original size. However, as depicted in Fig. 2(e) and Table I, the simplified coarse network exhibits worse performance. These observations indicate that the simplified coarse network could produce the roughly completed image with fast speed, but this image is not suitable for the refinement network.

Iv Proposed method

Iv-a Architecture of PEPSI

As shown in Fig. 3, the PEPSI unifies the stacked networks of the coarse-to-fine network into a single generative network with a single shared encoding network and a parallel decoding network called coarse and inpainting paths. The encoding network aims at jointly learning to extract features from background regions as well as to complete the features of hole regions without the coarse results. As listed in Table II, the encoding network consists of a series of convolutional layers except for the first layer which employs a kernel to fully extract the latent information from the input image. In addition, the encoding network enlarges the receptive field by employing dilated convolutional layers with different dilation rates in the last four convolutional layers.

Type Kernel Dilation Stride Outputs
Convolution 1 32
Convolution 1 64
Convolution 1 64
Convolution 1 128
Convolution 1 128
Convolution 1 256
Dilated 2 256
Dilated 4 256
Dilated 8 256
Dilated 1 256
TABLE II: Detail architecture of encoding network.

A parallel decoding network consists of coarse and inpainting paths. The coarse path produces a roughly completed result from the encoded feature map. On the other hand, using the encoded features as an input, the inpainting path first reconstructs the feature map by using the CAM. Then, the reconstructed feature map is decoded to produce a higher-quality inpainting result. Table III shows a detailed architecture of the decoding network. By sharing the weight parameters of the two paths, we regularizes the inpainting path of the decoding network. Furthermore, since two different paths use the same encoded feature maps as their input, this joint learning strategy encourages encoding network to produce valuable features for two different image generation tasks. In order to jointly train both paths, we explicitly employ the reconstruction loss to the coarse path, whereas the inpainting path is trained with the loss as well as the adversarial losses. Additional information about the joint learning scheme will be described in Section IV-E. Note that we employ only the inpainting path during the tests, which substantially reduces the computational complexity.

Similar to conventional image inpainting methods [4, 10, 7, 16]

, the PEPSI uses input pairs of a masked image and a binary mask indicating the background regions. The masked image includes holes with the variable numbers, sizes, shapes, and locations randomly sampled at every iteration. In terms of layer implementations, we use reflection padding for all convolution layers and utilize the exponential liner unit (ELU) 


as an activation function instead of ReLU except the last layer. Also, we utilize [-1, 1] normalized image with

pixels as an input image in the network, and generate an output image with the same resolution by clipping the output values into [-1, 1] instead of using functions.

Type Kernel Dilation Stride Outputs
Convolution 1 128
Upsample - - - -
Convolution 1 64
Upsample - - - -
Convolution 1 32
Upsample - - - -
Convolution 1 16
Convolution 1 3
TABLE III: Detail architecture of decoding network. The output layer consists of a convolution layer clipped value to the [-1, 1].
Fig. 4: The structure of the Diet-PEPSI unit (DPU). The DPU focuses on integrating the global context information with small number of parameters.

Iv-B Architecture of Diet-PEPSI

Although the PEPSI effectively reduces a number of convolution operations, it still needs a similar number of network parameters with coarse-to-fine network. As mentioned in Section IV-A, the PEPSI aggregates the contextual information using a series of dilated convolutional layers, which requires numerous network parameters. The intuitive way to reduce hardware costs is to prune the channels of these layers, but it often yields inferior results in practice. With the consideration of the large receptive field and number of parameters simultaneously, we propose a lightweight model of PEPSI called Diet-PEPSI. The Diet-PEPSI employs a novel DPU, which aggregates the global contextual information with a small number of parameters, instead of the dilated convolutional layer.

Inspired by ShuffleNet unit [18], we design the DPU as shown in Fig. 4. In its residual branch, the contextual information is first assembled via dilated group convolution (DGC), which splits the input features into multiple groups and conducts dilated convolution operations for each group individually. In the DGC layer, each group produces the output feature maps having the same dimension as the input group. However, since the receptive field of DGC layer and input feature maps are substantially different, the identity shortcut may not be the best option to combine the contextual information with input feature maps. Thus, we apply the projection shortcut technique which projects both features using convolutional layer and perform the element-wise addition. In the proposed method, we empirically project the DGC layer via standard convolution while the group convolution is used to embed the input feature maps followed by channel shuffling operation. While the dilated convolutional layer needs parameters, the DPU requires parameters where , , and indicate the number of input channels, output channels, and groups in DGC, respectively. Therefore, by using the DPU, the Diet-PEPSI covers the same size of the receptive field with fewer number of parameters than the PEPSI. The validity of the DPU will be discussed in Section V-B.

Fig. 5:

The overview of the RED. The RED aims to classify hole regions which may appear any region with any sizes in an image.

Iv-C Region Ensemble Discriminator (RED)

Traditional image inpainting networks [4] utilized both global and local discriminators to determine whether or not an image has been completed consistently. However, the local discriminator can only handle the hole region with the fixed size of the square shape; it is difficult to employ the local discriminator to train the inpainting network for the irregular hole. To solve this problem, we propose a RED inspired by the region ensemble network [27] which detects a target object appearing anywhere in images by handling multiple feature regions individually. As described in Fig. 5 and Table IV, six stried convolutions with a kernel size of and stride 2 are stacked to captures the feature of the whole image. Then, we apply the different fully-connected layers to a pixel-wise block of the last layer for individually differentiating that each block is real or fake. Since the RED tries to classify each feature block which has different receptive fields separately, it can cover the whole image as well as the local regions. In other words, the RED acts as global and local discriminator at the same time.

Iv-D Modified CAM

The conventional CAM [4] measures similarity scores by applying the cosine similarity. However, normalizing the feature patch vector in (2) can distort the semantic feature representation. Thus, we propose a modified CAM which directly measures distance similarity scores using the Euclidean distance. Since the Euclidean distance considers not only the angle between two vectors of feature patches but also their magnitudes, it is more appropriate for reconstructing the feature patch. Since the distance similarity scores, which the output range of , are hard to be applied softmax, we define the truncated distance similarity score as follows:

where . Since has limited values within , it operates like a threshold which sorts out the distance score less than the mean value. It means that helps to divide background patches into two groups which are related to the foreground patch or not. Similar to the conventional CAM, modified one also weigh them with scaled softmax and reconstruct the foreground patch by a weighted sum of background patches at last. Consequently, it supports the module to reconstruct foreground patches from a related patch vector group. The superiority of the modified CAM will be explained in Section V-B.

Type Kernel Stride Outputs
Convolution 64
Convolution 128
Convolution 256
Convolution 256
Convolution 256
Convolution 512
FC 1
TABLE IV: Detailed architecture of RED. After each convolution layer, except last one, there is a leaky-ReLU as the activation function. Every layer is normalized by a spectral normalization. The fully-connected layer is applied to every pixel-wise feature block.

Iv-E Loss function

In order to train the PEPSI and Diet-PEPSI, we jointly optimize two different paths: the inpainting path and the coarse path. For the inpainting path, we employ the GAN [7] optimization framework in (1), which is described in Section III-A. However, this loss functions often fail to generate satisfactory results owing to gradient vanishing problem in the generator. To address this problem, inspired by [28], we employ the hinge version of adversarial loss instead of (1), which is expressed as


where and denote the data distributions of inpainting results and input images. We apply the spectral normalization [29] to all layers in the RED in order to further stabilize the training of GANs. Since the goal of image inpainting is not only to complete the hole regions naturally but also to restore the missing part of the original image accurately, we add a strong constraint using norm to (5) as follows:


where and represent the -th image pair of the generated image through the inpainting path and its corresponding original image in a mini-batch, respectively, is the number of image pairs in a mini-batch, and and are hyper-parameters to balance between the two loss terms.

On the other hand, the coarse path is designed to accurately complete the missing features for the CAM. Therefore, we simply optimize the coarse path using a loss function defined as follows:


where and are the -th image pair of the generated image via the coarse path and its corresponding original image in a mini-batch, respectively. Finally, we define the total loss function of the generative network of PEPSI as follows:


where is a hyper-parameter controlling the contributions from each loss term, and and represent the iteration of the learning procedure and the maximum number of iterations, respectively. In the proposed method, as the training progresses, we gradually decrease the contribution of the for the decoding network to focus on the inpainting path.

Fig. 6: Examples of (a) the original image, (b) its square masked image, and (c) the free-form masked image.

V Experiments

V-a Implementation details

Free-Form Mask As shown in Fig. 6(b), traditional methods [4, 7, 10] usually adopt the regular mask (e.g. hole region with rectangular shape) during the training procedure. Thus, the network trained with regular mask often yields visual artifacts such as color discrepancy and blurriness when the hole is irregular in shape. To address this problem, Yu et al. [16] adopt the free-form mask algorithm during the training procedure, which automatically generates multiple random free-form holes as depicted in Fig. 6(c). More specifically, this algorithm first produces the free-form mask by drawing multiple different lines and erasing pixels closer than an arbitrary distance from these lines. For a fair comparison, we adopt the same free-form mask generation algorithm to train networks.

Fig. 7: Comparison of our method and conventional methods on randomly square masked CelebA-HQ datasets. (a) The ground truth (b) The input image of the network (c) Results of the Context Encoder [10] (d) Results of the Globally-Locally [7] (e) Results of the gated convolution [16] (f) Results of the proposed method (g) Results of the Diet-PEPSI.
Fig. 8: Comparison of our method and conventional methods on free-form masked CelebA-HQ datasets. (a) The ground truth (b) The input image of the network (c) Results of the Context Encoder [10] (d) Results of the Globally-Locally [7] (e) Results of the gated convolution [16] (f) Results of the PEPSI (g) Results of the Diet-PEPSI.

Training Procedure The PEPSI and Diet-PEPSI are trained for one million iterations using a batch size of 8 in an end-to-end manner. As all the parameters in the PEPSI and Diet-PEPSI can be differentiate, we performed an optimization employing the Adam optimizer [30]

, which is a stochastic optimization method with adaptive estimation of moments. We set the parameters of Adam optimizers

and to 0.5 and 0.9, respectively. Based on [31], we applied the two-timescale update rule (TTUR) where the learning rates of the discriminator and generator were and , respectively. In addition, we reduced the learning rate to 1/10 after 0.9 million iterations. The hyper-parameters in our model are set to , , and

. Our experiments were conducted using CPU Intel(R) Xeon(R) CPU E3-1245 v5 and GPU TITAN X (Pascal), and implemented in TensorFlow v1.8.

Fig. 9: Comparison of our method and conventional methods on Place2 datasets. (a) The ground truth (b) The input image of the network (c) Results of the non-generative method, PatchMatch (d) Results of the GatedConv [16] (e) Results of the PEPSI (f) Results of the Diet-PEPSI.

For our experiments, we use the CelebA-HQ [22, 21], ImageNet [24], and Place2 [23] datasets comprising of human faces, things, and various scenes, respectively. In the CelebA-HQ dataset, we randomly sample the 27,000 images as a training set and 3,000 ones as a test set. We also train the network with all the images in the ImageNet dataset and test it on Place2 dataset to measures the performance of trained deep learning models on other datasets to confirm the generalization ability of the proposed method. To demonstrate the superiority of PEPSI and Diet-PEPSI, in addition, we compared its qualitative, quantitative, and operation speed with those of the conventional generative methods: CE [10], GL [7], GCA [4], and GatedConv [16].

V-B Performance Evaluation

Qualitative Comparison We compare the qualitative performance of the proposed method with the conventional approaches using the image with the free-form mask as well as the squared mask. The conventional methods are implemented by following the training procedure in each paper. As shown in Fig. 7 and 8, CE [10] and GL [7] show obvious visual artifacts including blurred or distorted images in the masked region, especially in the case of the free-form mask. Although GatedConv [16] shows a fine performance, it shows lack of relevance between the hole and background regions such as symmetry of eyes. Compared with the conventional methods, PEPSI shows visually appealing results and high relevance between hole and background regions. In addition, we produce the output image with Diet-PEPSI by setting the number of groups as 4. As shown in Fig. 7(g) and Fig. 8(g), the results of Diet-PEPSI were comparable to PEPSI while saving a significant number of parameters.

Moreover, we show the real application of PEPSI by testing on the challenging datasets, ImageNet and Place2 datasets. We compare the proposed method with GatedConv and the widely available non-generative me-thod, PatchMatch [3] on the Place2 dataset with image resolution. As depicted in Fig. 9, PatchMatch shows visually poor performance especially on the edge of images because it fills the hole region without understands of the global contexts of scenes. GatedConv generates more realistic results without color discrepancy or edge distortion compared to PatchMatch technique. However, it often produces the images with wrong textures as shown in the first and thrid rows in Fig. 9. In contrary, the PEPSI and Diet-PEPSI generates the most natural images without artifacts or distortion on various contents and complex scenes for real applications.

Fig. 10: Illustration of units to reduce the number of parameters while aggregating the gloabl contextual information (a) Dilated convolution layer with pruning channel, (b) Dilated convolution layer with group convolution, (c) Modified structure of ShuffleNet unit [18] (d) Proposed Diet-PEPSI unit.
Square mask Free-form mask Time (ms) Number of
Local Global Local Global Parameters
CE[10] 17.7 23.7 0.872 9.7 16.3 0.794 5.8 5.1M
GL[7] 19.4 25.0 0.896 15.1 21.5 0.843 39.4 5.8M
GCA[4] 19.0 24.9 0.898 12.4 18.9 0.798 22.5 2.9M
GatedConv[16] 18.7 24.7 0.895 21.2 27.8 0.925 21.4 4.1M
PEPSI 19.5 25.6 0.901 22.0 28.6 0.929 9.2 3.5M
PEPSI 19.2 25.2 0.894 21.6 28.2 0.923
Diet-PEPSI () 19.4 25.4 0.898 21.9 28.5 0.928 10.7 2.1M
Diet-PEPSI () 19.3 25.3 0.897 21.9 28.5 0.928 12.0 1.8M

TABLE V: Results of global and local PSNR, SSIM and operation time with both of square and free-formed masks on CelebA-HQ dataset. * means a model without coarse results.
Mask Method PSNR SSIM
Local Global
Square GatedConv[16] 14.2 20.3 0.818
PEPSI 15.2 21.2 0.832
Diet-PEPSI 15.4 21.5 0.839
Free-form GatedConv[16] 17.4 24.0 0.875
PEPSI 18.2 24.8 0.882
Diet-PEPSI 18.6 25.2 0.889

TABLE VI: Results of global and local PSNR and SSIM on the Places2 dataset.

Quantitative Comparison We evaluate the performance of the proposed and conventional methods by measuring the peak signal-to-noise ratio (PSNR) of the local and global regions, the hole region and the whole image, and structural similarity (SSIM) [32]. Table V provides the comprehensive performance benchmarks between the proposed methods and conventional ones [4, 7, 10, 16] in the CelebA-HQ datasets [21]. As shown in Table V, CE [10], GL [7], and [4] effectively fill the hole regions with a square shape, resulting in inferior performance when filling the hole region with an irregular shape. Since these methods mainly aims at filling rectangular holes, they could not be generalized for the free-form masks. GL [7] shows a comparable performance with the PEPSI only in the square mask since it uses an image blending technique for post-processing; however, it yields blurred results as shown in Fig. 7 and needs additional computation time than the other methods. GatedConv [16] shows fine performance in both square and free-form holes, but also needs further computation time than PEPSI. Compared with conventional methods, the PEPSI can be used to fill anly hole shape, while reducing the operation time significantly. In addition, the Diet-PEPSI retains the ability of PEPSI while significantly reducing the network parameters. Although the Diet-PEPSI needs slightly more computation time owing to DPU in our implementation, it still results in faster operation as well as superior performance compared to the conventional methods.

For further study, we conduct an extra experiment in which the PEPSI are trained without using the coarse path learning. The PEPSI exhibits the better performance than PEPSI without using coarse results in terms of all the quantitative metrics, which indicates that the coarse path drives the encoding network to produce missing features properly for the CAM. In other words, the single-stage network structure of PEPSI can overcome the limitation of the two-stage coarse-to-fine network through a joint learning scheme.

To demonstrate the generalization ability of PEPSI, we conduct another experiment using the challenging datasets, ImageNet [24] and Place2 [23]. Table VI shows the experimental results of the test using the input image with the resolution of . We compare the performance of the proposed method with GatedConv [16], which exhibits superior performance compared to other conventional methods in Celeb-A dataset. As shown in Table VII, the PEPSI and Diet-PEPSI achieves better performance than GatedConv in Place2 data-set, which indicates that the proposed method can consistently generate highquality results using various contents and complex images.

Square mask Free-form mask
Pruning 25.21 0.8961 28.28 0.9270
DGC 25.33 0.8974 28.38 0.9270
MSU 25.22 0.8956 28.41 0.9273
DPU 25.32 0.8973 28.46 0.9276
TABLE VII: The experimental results using different lightweight units.

DPU analysis To demonstrate the ability of the DPU, we conduct additional experiments that reduce the network parameters with different techniques. Fig. 10 shows the models used in our experiments. Fig. 10(a) and (b) illustrate the convolution layer with a pruning channel and DGC, respectively, which are an intuitive approach to decrease the number of parameters. Also, Fig. 10(c) and (d) represent a modified ShuffleNet unit (MSU) [18] and the proposed DPU, respectively. For a fair comparison, we adjust the pruning channel and the number of groups to make models using almost a similar number of parameters. In our experiments, we set the channels of pruned convolution layers to 113, and the group numbers of DGC as four. The number of groups in MSU and DPU is set to eight. As shown in Table VII, the pruning strategy shows inferior results in both square and free-form masks. The DGC results in fine performance with the square mask but weak with the free-form mask, whereas the MSU exhibits inferior performance with the square mask. Compared with these models, the DPU shows a fine performance both in the square and free-form mask, and therefore, we apply this model to our Diet-PEPSI instead of the dilation convolution layer.

Square mask Free-form mask
SNM-Disc [16] 25.68 0.901 28.71 0.932
RED 25.57 0.901 28.59 0.929

TABLE VIII: The experimental results using different discriminators.
Fig. 11: Comparison of RED and SNM-Disc on CelebA-HQ datasets. (a) Input image, (b) Results of PEPSI trained with RED (c) Results of PEPSI trained with SNM-Disc [16].

RED analysis In this paragraph, we demonstrate the superiority of RED by comparing the performance with SNM-discriminator which is newly introduced in Gatedconv [16]. For fair comparison, we employ each discriminator on PEPSI with the modified CAM as a base network. As shown in Table VIII, SNM-discirminator exbihits slightly better performance in terms of PSNR and SSIM compared to RED. However, as illustrated in Fig. 11, we found that SNM-discrimiantor could not generate visually plausible image even though having high PSNR value; the PEPSI trained with SNM-discr-iminator produces the results with visual artifact such as blurred or distorted images in the masked region. These observations indicate that SNM-discriminator could not effectively compete with generative networks, which makes the generator mainly focus on minimizing loss in the objective function of the PEPSI. Therefore, even the PEPSI trained with SNM-Discriminator has fine quantitative performance, it is not suitable for applying to inpainting in practice.

On the other hand, we thought that the reason why RED could effectively drive the generator to produce visually pleasing inpainting results. The RED follows the inspiration of the Region Ensemble Network [27] which classifies objects in any region of the image. In adversarial learning, as a result, the generator tries to produce every region of the image to be indistinguishable from real images. This procedure makes the generator more improved in free-form masks including irregular holes. Thus, we expect that RED can be applied to various image inpainting networks for generating visually plausible images.

Modified CAM analysis To demonstrate the validity of modified CAM, we perform toy examples comparing the cosine similarity and the truncated distance similarity. We reconstruct the hole region by the weighted sum of existing image patches where the weights are obtained by using the cosine similarity scores or the truncated distance similarity scores. As depicted in Figure 12, reconstruction applying the truncated distance similarity can collect more similar patches than the cosine similarity. Furthermore, we evaluate the results between PEPSI with conventional and modified CAMs to confirm the improvement of the modified CAM. As shown in Table IX, the modified CAM increases the performance as compared to the conventional CAM, which means that the modified CAM is more suitable to express the relationship between background and hole regions.

Fig. 12: A comparison of the image reconstruction between the cosine similarity and the truncated distance similarity: (a) The original image, (b) masked image, (c) image reconstructed by using the cosine similarity and (d) image reconstructed by using the truncated distance similarity.
square mask free-form mask
Cosine similarity 25.16 0.8950 27.95 0.9218
Euclidean distance 25.57 0.9007 28.59 0.9293
TABLE IX: Comparison of the performance between the cosine similarity and the Euclidean distance applying on the PEPSI.

Vi Conclusion

In this paper, a novel image inpainting model, called PEPSI, has been proposed. As shown in the experimental results, the PEPSI not only achieves superior performance as compared with conventional techniques, but also significantly reduces the hardware costs via a parallel decoding path and an effective joint learning scheme. Furthermore, we have introduced the Diet-PEPSI preserving the performance of PEPSI while significantly reducing the network parameters almost by half, which facilitates hardware implementation. Both networks are trained with the proposed RED and show visually plausible results in square holes as well as holes with an irregular shape. Therefore, it is expected that the proposed method can be widely employed in various applications including image generation, style transfer, and image editing.


  • [1] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques.   ACM Press/Addison-Wesley Publishing Co., 2000, pp. 417–424.
  • [2] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis and transfer,” in Proceedings of the 28th annual conference on Computer graphics and interactive techniques.   ACM, 2001, pp. 341–346.
  • [3] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Transactions on Graphics (ToG), vol. 28, no. 3, p. 24, 2009.
  • [4] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” arXiv preprint, 2018.
  • [5] H. Noori, S. Saryazdi, and H. Nezamabadi-Pour, “A convolution based image inpainting,” in 1st International Conference on Communication and Engineering, 2010.
  • [6] H. Li, G. Li, L. Lin, and Y. Yu, “Context-aware semantic inpainting,” arXiv preprint arXiv:1712.07778, 2017.
  • [7] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 107, 2017.
  • [8] R. Köhler, C. Schuler, B. Schölkopf, and S. Harmeling, “Mask-specific inpainting with deep neural networks,” in

    German Conference on Pattern Recognition

    .   Springer, 2014, pp. 523–534.
  • [9] C. Li and M. Wand, “Combining markov random fields and convolutional neural networks for image synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2479–2486.
  • [10] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
  • [11] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 2, 2017, p. 3.
  • [12] A. Fawzi, H. Samulowitz, D. Turaga, and P. Frossard, “Image inpainting through neural networks hallucinations,” in Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), 2016 IEEE 12th.   Ieee, 2016, pp. 1–5.
  • [13] N. Cai, Z. Su, Z. Lin, H. Wang, Z. Yang, and B. W.-K. Ling, “Blind inpainting using the fully convolutional neural network,” The Visual Computer, vol. 33, no. 2, pp. 249–261, 2017.
  • [14] R. A. Yeh, C. Chen, T.-Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do, “Semantic image inpainting with deep generative models.” in CVPR, vol. 2, no. 3, 2017, p. 4.
  • [15] L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network for image deconvolution,” in Advances in Neural Information Processing Systems, 2014, pp. 1790–1798.
  • [16] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” arXiv preprint arXiv:1806.03589, 2018.
  • [17] M.-C. Sagong, Y.-G. Shin, S.-W. Kim, S. Park, and S.-J. Ko, “Pepsi: Fast image inpainting with parallel decoding network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, to be published.
  • [18] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [21] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
  • [22] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3730–3738.
  • [23]

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”

    IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–1464, 2018.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [25] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, “Summarizing visual data using bidirectional similarity,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.   IEEE, 2008, pp. 1–8.
  • [26] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
  • [27] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang, “Region ensemble network: Improving convolutional network for hand pose estimation,” in Image Processing (ICIP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 4512–4516.
  • [28] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018.
  • [29] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
  • [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [31] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.
  • [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.