Image inpainting is the process of reconstructing lost or deteriorated parts of images. The reconstruction quality of both high and low-frequency information in an image dramatically affects the quality of image restoration. The low-frequency information of an image usually more complanate, such as color gradient. The high-frequency information in an image typically includes edges, texture, and other details. Recently, many studies about image inpainting have focused on deep convolutional neural networks[7, 8, 9, 10], and have achieved remarkable progress. However, these methods ignore the reconstruction of both high-frequency and low-frequency information in the image. They process the missing regions in the same way as the non-missing regions using common convolution. The missing regions contain invalid data. Hence, some invalid information is introduced to the repaired areas in the inpainting process, which leads to an image with an unnatural color transition is generated. Moreover, the generated missing regions usually are over-smoothed and blurry due to the lack of constraints for the high-frequency information reconstruction.
It has been proved that the edge information dramatically affects the ability to reconstruct the high-frequency information and generate fine details for the image. Nazeri et al. proposed an image inpainting model based on a deep convolution neural network. It repairs the missing regions according to edge information, which is generated by an edge generation model. However, this approach ignores the reconstruction of low-frequency information. Compared to the boundary area, it often fails to generate accurate content for the center area in the missing regions, which usually leads to the generated complete image with a weak color convergence and uncomfortable visual artifacts.
Inspired by Liu et al., we propose the mask shrinking strategy to deal with the issue as mentioned above. This strategy is applied to the image completion model. It includes a special convolution (SConv) as well as a mask updating mechanism. The SConv fills the missing regions using only the valid contents of the damaged image. This means the invalid data of missing areas are discarded in the image inpainting process. Therefore, the missing regions are completed with more visually-realistic content, and the color transition of the repaired image is more natural and realistic. A mask updating mechanism is used after SConv, which records the areas to be restored by shrinking the masked regions.
In summary, the proposed approach restores the damaged regions with edge information as well as the mask shrinking strategy, in which the edge information is beneficial to the reconstruction of high-frequency information while the mask shrinking strategy is good for rebuilding the low-frequency information. We evaluate our model over the public dataset Places2 and explain the details of the experiments in Section 5. The result shows that our approach achieves higher-quality inpainting effects. The contributions of this paper are as follows:
An adversarial learning approach is proposed to repair both high-frequency information and low-frequency information simultaneously in the image.
The complete edge information generated by the edge generation model is good for repairing the high-frequency information of missing regions.
The mask shrinking strategy is conducive to restoring the low-frequency information of missing regions.
2 Related Work
Image inpainting is the process of reconstructing the missing regions of an image. Traditional image inpainting approaches are mainly based on mathematical and physical methods[t2, t3, t5]. In these studies, Kokaramet et al.[t2]interpolated the defects in a movie with adjacent frames using motion estimation and autoregressive models.Bertalmio et al.[t5] introduced a new static image restoration algorithm that automatically populates the missing regions with the information surrounding them. However, these approaches usually produce an over-smoothed image and fail to repair excellent results with more details.
Recently, deep learning has been gradually adopted for image inpainting and achieved remarkable progress. Pathak et al. repaired images using an encoder-decoder structure. However, it fails to obtain the fine details of missing regions because of the information bottleneck in the channel-level fully connected layer. Yang et al. proposed a postprocessing approach for context encoders to propagate the texture information from valid regions. Liu et al. proposed a “partial convolution” to repair the irregular mask holes of images, but this method fails to consider the high-frequency data. Nazeri et al. introduced edge information to constrain the high-frequency information, but they neglect the low-frequency information of images. The above methods use convolutional neural networks, which do not fully consider both the high-frequency and low-frequency information.
3.1 Network structure
As shown in Figure 1, the proposed approach contains two models, the edge generation model and the image completion model. They are both GAN based models, which means each of them consists of a generator and a discriminator. The edge generation model is used to generate the complete edge information for the damaged image. and represent the generator and discriminator of the edge generation model, respectively. generates a fake edge map whereas aimed to distinguish the fake edge map from the ground truth.
The image completion model repairs the damaged image to a complete image, condition on the complete edge information generated by . and respectively express the generator and discriminator of the image completion model. generates the fake inpainting result, whereas distinguishes the fake inpainting result from the ground truth. The mask shrinking strategy is applied in each layer of to reconstruct delicate low-frequency information.
We introduce the edge generation model in Section 3.2, illustrate the mask shrinking strategy in Section 3.3, and describe the image completion model in Section 3.4.
3.2 Edge generation model
Because of the good performance of the edge generation model in , our edge generation model uses the same structure. The generator consists of encoders, residual blocks, and decoders. Encoders down-sample images twice, followed by eight residual blocks, and the decoders then up-sample the feature maps twice to the original image size. Spectral normalization is applied after the convolution in each layer, and dilated convolutions with a dilation factor of two are used in the residual blocks. Discriminator uses a PatchGAN[14, 15]. Instance normalization is used across all layers of the network.
3.3 Mask shrinking strategy
The mask shrinking strategy is used in each layer of to reconstructs the low-frequency information. It includes a special convolution (SConv) followed by a mask updating mechanism. SConv performs a calculation on the damaged image, and the mask updating mechanism updates the mask regions on the mask image. Different from the common convolution process in Figure 2(a), the SConv process in Figure 2(b) penalizes the masked area. This means that SConv repairs the missing regions using only the valid information in the damaged image. SConv can be expressed as follows:
where denotes the th layer input pixel values that are consistent with the current convolution window, and is the corresponding mask. In addition, denotes element-wise multiplication, and and are the weights and bias for the convolution window, respectively. The term is used to adjust the effective pixel values.
The mask updating mechanism shrinks the masked regions of the mask image, which track the repairing status of the damaged image in every layer of to guide the repairing of the next layer. It can be considered as another special convolution that can avoid the introduction of noise, in which all the value of the convolution filters is 1. The value would be marked as valid information after the mask updating mechanism when existing least one valid pixel in the current convolution window . It is expressed as follows:
where is a pixel value after the mask updating mechanism: denotes a valid pixel value, and denotes an invalid pixel value.
3.4 Image completion model
As shown in Figure 3, the generator
includes encoders, residual blocks, and decoders. The encoders down-sample the original images four times with modules in the form of convolution-BatchNorm-ReLu, followed by six residual blocks. And the decoders then upsample the feature maps to the original image size with four modules in the form of deconvolution-BatchNorm-LeackyReLu. In the residual blocks, spectral normalization is applied after the convolution in each layer. Each of the convolution in is replaced by the mask shrinking strategy. Skip links are used to concatenate the feature maps and masks in the encoders to the feature maps and masks with the same size in the decoders to create the feature map and mask inputs for the next layer of decoders. Similar to , uses a patchGAN[14, 15] to distinguish the generated images from the ground truth.
In the image completion network, the damaged image = and the binary mask M (1 for the masked regions) are inputs, which conditional on . . is the ground truth of the edge map, and is the complete edge information generated by . The output of the image completion network is a complete image, which has the same size as the input image.
The loss function consists of anloss, an adversarial loss, a perceptual loss, and a style loss. The adversarial loss is
Perceptual loss constrains the similarity to ground truth by calculating the distance between activation maps. It is defined as
where is the activation map of the i’th layer of the network.
Assuming that the size of feature maps is , style loss is expressed as follows:
where is a Gram matrix constructed from activation maps . In summary,the objective funciton is
We set the parameters in the loss function as follows: =1, = = 0.1, = 250.
Our datasets include an image dataset, a mask dataset, and an edge maps dataset. For the image dataset, we choose 5,000 images in Places2, in which 4,000 images were used for training and 1,000 images were used for testing. The mask dataset is from Liu et al.. We employ 20,000 irregular mask images, in which 19,000 mask images are used for the training phase and the remaining 1,000 mask images are used in the testing phase. The edge ground truth is generated by the Canny edge detector
. We set the sensitivity of the Canny edge detector as 2, which controlled by the standard deviation of the Gaussian smoothing filter.
The metrics include peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and the Frechet inception distance (FID)[FID]. The PSNR and SSIM indices are two indicators for evaluating the similarity of images, which are widely used in the field of image processing. The FID is commonly used to evaluate the results of a generative adversarial network (GAN). It is expressed as
The FID also reflects the quality of the images that are generated by GAN.
The results of a qualitative and quantitative comparison for the proposed approach with current state-of-the-art methods are given in this section. The results show that the proposed approach obtains the best effects of qualitatively and quantitatively. We denote the common convolution as and represent the convolution using mask shrinking strategy as . means the structure consists of downsampling with followed by residual blocks with and upsampling with .
5.1 Selection of best network structure
To determine the optimal network structure of the image completion model, we compare the performance of different structures. The results are listed in Table 1. As structure obtains the best score for all metrics as shown in Table 1, we use it as the default structure of our model in the following experiments.
5.2 Comparision with baselines
Table 2 compares the results of the proposed model with those of the current state-of-the-art approaches. Our approach obtains the best results with a PSNR of 32.74, SSIM of 0.717, and FID of 8.12. Hence, our approach that combines the edge information with the mask shrinking strategy is more effective than other image inpainting approaches.
Figure 4 displays the inpainting results of different approaches. It is clearly that the images generated by our approach are closer to the ground truth than that generated by other methods.
5.3 Effect of edge information
We evaluate the impact of edge information. The results list in Table 3, in which and represent the results with and without edge information, respectively. The result shows the former obtains more competitive effects than the latter. The results with edge information obtain more competitive effect than that without edge information, indicating that joining the edge information can improve the quality of the repaired image. Hence, the edge information has a strong impact on the quality of the repaired image.
5.4 Effect of mask shrinking strategy
To verify the validity of the mask shrinking strategy, we compare the performance resulting from three different structures, , and . As shown in Table 4, the result shows that obtains best results.
5.5 Effect of skip links
Skip links connect the downsampling and upsampling in our network. We evaluate the effect in Table 5. and represent the network with and without skip links, respectively. The result shows that overall repair performance is improved by using skip links.
This paper proposes an edge information and mask shrinking based image inpainting approach. It consists of two GAN based models, i.e., an edge generation model and an image completion model. The edge generation model repairs the edge information for the damaged image. The edge information is used in the image completion model, which repairs the damaged image to a complete image. In the image completion model, we introduce a mask shrinking strategy that repairs images with a special convolution(SConv) and tracks the repairing process with a mask updating mechanism. Experiments on the public dataset Places2 and a comparison with current state-of-the-art methods demonstrate that the proposed approach achieves the best performance.
This work was funded by National Natural Science Foundation of China (Grant No. 61762069, 61773224), Natural Science Foundation of Inner Mongolia Autonomous Region (Grant No. 2019ZD14, 2017BS0601) and Science and technology program of Inner Mongolia Autonomous Region (2019).