Image inpainting refers to generating alternative structures and textures for missing regions of corrupted input images and obtaining visually realistic results. It has a wide range of applications. For example, users can remove unwanted objects or edit contents of images by using inpainting techniques. A major challenge of image inpainting tasks is to generate correct structures and realistic textures. Some early patch-based works attempt to fill missing holes with image patches from existing regions [barnes2009patchmatch, hays2007scene]. By nearest-neighbor searching and copying relevant patches, these methods can synthesize vivid textures for background inpainting tasks. However, since these methods cannot capture high-level semantics, it is hard for them to generate realistic structures for images with non-repetitive patterns (faces). With the advent of deep neural network techniques, some recent works [pathakCVPR16context, IizukaSIGGRAPH2017, yu2018free, yu2018generative, liu2018image] model the inpainting task as a conditional generation problem, which learns mapping functions between the input corrupted images and the ground truth images. These methods are able to learn meaningful semantics, so they can generate coherent structures for missing holes. However, since these methods do not effectively separate the structure and texture information, they often suffer from either over-smoothed boundaries or texture artifacts. To solve this problem, some two-stage networks [yu2018generative, song2018spg, nazeri2019edgeconnect] are proposed. These methods recover missing structures in the first stage and generate the final results using the reconstructed information in the second stage. The method proposed in [yu2018generative] uses ground truth images as the labels of structure recovery. However, ground truth images contain high-frequency textures. These irrelevant details may mislead the structure reconstruction. Spg-net [song2018spg] predicts the semantic segmentation labels of the missing areas as structural information. However, regions with similar semantic labels may have different textures (e.g. the windows and walls of the same building), which creates difficulties for the final recovery. Using edge images as the structural guidance, EdgeConnect [nazeri2019edgeconnect] achieves good results even for some highly structured scenes. However, the distribution of edge images differs greatly from the distribution of the target images. In other words, the edge extractor discards too much useful information, such as image color, making it difficult to generate vivid textures. In this paper, we propose a novel two-stage network StructureFlow for image inpainting. Our network consists of a structure reconstructor and a texture generator. To recover meaningful structures, we employ edge-preserved smooth images to represent the global structures of image scenes. Edge-preserved smooth methods [l0smoothing2011, xu2012structure]
aim to remove high-frequency textures while retaining sharp edges and low-frequency structures. By using these images as the guidance of the structure reconstructor, the network is able to focus on recovering global structures without being disturbed by irrelevant texture information. After reconstructing the missing structures, the texture generator is used to synthesize high-frequency details. Since image neighborhoods with similar structures are highly correlated, the uncorrupted regions can be used to generate textures for missing regions. However, it is hard for convolutional neural networks to model long-term correlations[yu2018generative]. In order to establish a clear relationship between different regions, we propose to use appearance flow [zhou2016view] to sample features from regions with similar structures, as shown in Figure 1. Since appearance flow is easily stuck within bad local minima in the inpainting task [yu2018generative]
, in this work, we made two modifications to ensure the convergence of the training process. First, Gaussian sampling is employed instead of Bilinear sampling to expand the receptive field of the sampling operation. Second, we introduce a new loss function, called sampling correctness loss, to determine if the correct regions are sampled. Both subjective and objective experiments compared with several state-of-the-art methods show that our method can achieve competitive results. Furthermore, we perform ablation studies to verify our hypothesis and modifications. The main contributions of our paper can be summarized as:
We propose a structure reconstructor to generate edge-preserved smooth images as the global structure information.
We introduce appearance flow to establish long-term corrections between missing regions and existing regions for vivid texture generation.
To ease the optimization of appearance flow, we propose to use Gaussian sampling instead of Bilinear sampling and introduce a novel sampling correctness loss.
Experiments on multiple public datasets show that our method is able to achieve competitive results.
2 Related Work
2 .1 Image Inpainting
Existing image inpainting works can be roughly divided into two categories: methods using diffusion-based or patch-based techniques and methods using deep neural networks. Diffusion-based methods [bertalmio2000image, efros2001image] synthesize textures by propagating the neighborhood region appearance to the target holes. However, these methods can only deal with small holes in background inpainting tasks. They may fail to generate meaningful structures. Unlike the diffusion-based methods using only neighborhood pixels of missing holes, patch-based methods can take advantage of remote information to recover the lost areas. Patch-based methods [barnes2009patchmatch, hays2007scene, darabi2012image] fill target regions by searching and copying similar image patches from the uncorrupted regions of the source images. These methods can generate photo-realistic textures for relatively large missing holes. In order to find suitable image patches, bidirectional similarity [bidirectional] is proposed to capture more visual information and introduce less visual artifacts when calculating the patch similarity. To reduce the computational cost, PatchMatch [barnes2009patchmatch]
designs a fast nearest neighbor searching algorithm using natural coherence in the imagery as prior information. However, these patch-based methods assume that the non-hole regions have similar semantic contents with the missing regions, which may not be true in some tasks such as face image inpainting. Therefore, they may work well in some images with repetitive structures but cannot generate reasonable results for images with unique structures. Recently, many deep learning based methods have been proposed to model the inpainting task as a conditional generation problem. A significant advantage of these methods is that they are able to extract meaningful semantics from the corrupted images and generate new content for images. Context Encoder[pathakCVPR16context] is one of the early attempts. It uses an encoder-decoder architecture to first extract features and then to reconstruct the outputs. However, this network struggles to maintain global consistency and often generate results with visual artifacts. Iizuka [IizukaSIGGRAPH2017] solve this problem by using both local and global discriminators which are responsible for generating realistic alternative contents for missing holes and maintaining the coherency of competed images respectively. Yu [yu2018generative] find that convolutional neural networks are ineffective in building long-term correlations. To solve this problem, they propose contextual attention to borrow features from remote regions. Liu [liu2018image] believe the substituting pixels in the masked holes of the inputs introduce artifacts to the final results. Therefore, they propose partial convolutions to force the network to use valid pixels (uncorrupted pixels) only. Gated convolution [yu2018free]
further generalizes this idea by extending the feature selecting mechanism to be learnable for each location across all layers. EdgeConnect proposed in paper[nazeri2019edgeconnect] has a similar motivation to our paper: generating reasonable structures by using additional prior information. EdgeConnect first recovers edge maps and then fills the missing regions in fine details. However, due to the limited representation ability of edge maps, this method may generate wrong details in the boundaries of objects.
2 .2 Optical Flow and Appearance Flow
Optical flow is used to describe the motion of objects, surfaces, and edges between consecutive video frames. It has been widely used in video frame synthesis [liu2017voxelflow, wang2018vid2vid], action recognition [simonyan2014two, wang2016temporal]horn1981determining, sun2008learning] have been proposed to accurately estimate optical flow between consecutive frames. Recently, some methods [dosovitskiy2015flownet, ilg2017flownet] solve this problem by training deep neural networks. However, these techniques require sufficient ground truth optical flow fields which are extremely difficult to obtain. Therefore, some synthetic optical flow datasets [dosovitskiy2015flownet] are created for training. Some other methods [liu2017video, meister2018unflow] solve this problem by training the network in an unsupervised manner. However, many existing unsupervised optical flow estimation methods struggle to capture large motions. Some papers [liu2017video, ranjan2017optical] manage to use multi-scale approaches to improve the results. We believe it is due to the limited receptive field of Bilinear sampling. In this paper, we use Gaussian sampling as an improvement. Appearance flow proposed by [zhou2016view]
is used to generate target scenes (objects) from source scenes (objects) using a flow-based method. It calculates the correlations between sources and targets to predict the 2-D coordinate vectors (appearance flow fields). This idea can be used in image inpainting tasks. To generate realistic alternative contents for missing holes, one can reasonably ”flow” pixels (features) from source regions to missing regions. In this paper, we improve the appearance flow in[zhou2016view] to make it suitable for image inpainting tasks.
3 Our Approach
The framework of our StructureFlow inpainting network is shown in Figure 2. Our model consists of two parts: the structure reconstructor and the texture generator . The structure reconstructor is used to predict missing structures, thereby generating the global structure image . The texture generator draws details according to the reconstructed structures and outputs the final results .
3 .1 Structure Reconstructor
A major challenge of image inpainting tasks is to generate meaningful structures for missing regions. Therefore, we first design a structure reconstructor to recover global structures of the input images. The edge-preserved smooth methods [l0smoothing2011, xu2012structure] aim to remove high-frequency textures while retaining the sharp edges and low-frequency structures. Their results can well represent global structures. Let be the ground-truth image and be the edge-preserved smooth result of . The processing of our structure reconstructor can be written as
where is the mask of the input image
. It is a binarized matrix where 1 represents the missing region and 0 represents the background.is the structures of . Here, denotes element-wise product. is the predicted structures. The reconstruction loss of is defined as the distance between the predicted structures and the ground-truth structures .
Meanwhile, to mimic the distributions of the target structures , we apply generative adversarial framework [goodfellow2014generative] to our structure reconstructor. The adversarial loss of can be written as
where is the discriminator of the structure reconstructor. We jointly train the generator and discriminator using the following optimization.
where and are regularization parameters. We set and in all experiments.
3 .2 Texture Generator
After obtaining the reconstructed structure image , our texture generator is employed to yield vivid textures. The processing of the texture generator can be written as
where denotes the final output result. We use loss to calculate the reconstruction error.
To generate realistic results, we employ adversarial loss in our texture generator.
Since image regions with similar structures are highly related, it is possible to extract these correlations using the reconstructed structures for texture generation to improve the performance. However, convolutional neural networks are not effective for capturing long-term dependency [yu2018generative]. In order to establish a clear relationship between different regions, we introduce the appearance flow to our . As shown in Figure 2, the appearance flow is used to warp the extracted features of the inputs. Thus, features containing vivid texture information can ”flow” to the corrupted regions. However, training the appearance flow in an unsupervised manner is a difficult task [liu2017video, ranjan2017optical]. The networks may struggle to capture large motions and stuck in a bad local minima. To tackle this problem, we first propose to use Gaussian sampling instead of Bilinear sampling to expand the receptive field. Then, we propose a sampling correctness loss to constraint the possible convergence results. The sampling process calculates the gradients according to the input pixels (features). If the receptive field of the sampling operation is limited, only a few pixels can participate in the operation. Since the adjacent pixels (features) are often highly correlated, a large receptive field is required to obtain correct and stable gradients. Therefore, Bilinear sampling with a very limited receptive field may not be suitable for tasks requiring establishing long-term correlations. To expand the receptive field, we use Gaussian sampling instead of Bilinear sampling in the appearance flow operation. The process of Gaussian sampling operation with kernel size can be written as
where is the features around the sample center and is the output feature. The weights is calculated as
where and is the horizontal and vertical distance between the sampling center and feature respectively. Parameter
is used to denote the variance of the Gaussian sampling kernel. The proposed sampling correctness loss is used to constraint the appearance flow fields. It determines whether the current sampled regions are ”good” choices. We use the pre-trained VGG19 to calculate this loss. Specifically, we first calculate the VGG features of the input corrupted imageand the ground truth image . Let and be the features generated by a specific layer of VGG19. Symbol denotes a coordinate set containing the coordinates of missing areas, is the number of elements in set
. Then, our sampling correctness loss calculate the relative cosine similarity between the ground truth features and the sampled features
where is the sampled feature calculated by our Gaussian sampling and denotes the cosine similarity. is a normalization term. For each feature where , we find the most similar feature from and calculate their cosine similarity as .
where denotes a coordinate set containing all coordinates in . Our texture generator is trained using the following optimization
where , and
are the hyperparameters. In our experiments, we set, and .
4 .1 Implementation Details
Basically, autoencoder structures are employed to design our generatorsand . Several residual blocks [he2016deep] are added to further process the features. For the appearance flow, we concatenate the warped features with the features obtained by convolutional blocks. The architecture of our discriminators is similar to that of BicycleGAN [zhu2017toward]. We use two PatchGANs [isola2017image] with different scales to predict real fake for overlapping image patches with different sizes. In order to solve the notorious problem of instability training of generative adversarial networks, spectral normalization [miyato2018spectral] is used in our network. We train our model on three public datasets including Places2 [zhou2018places], Celeba [liu2015deep], and Paris StreetView [doersch2012makes]. The most challenging dataset Places2 contains more than 10 million images comprising unique scene categories. Celeba and Paris StreetView contain highly structured face and building images respectively. We use the irregular mask dataset provided by [liu2018image]
. The mask images are classified based on their hole sizes relative to the entire image (). We employ edge-preserved smooth method RTV [xu2012structure] to obtain the training labels of the structure reconstructor . In RTV smooth method, parameter is used to control the spatial scale of smooth windows, thereby controlling the maximum size of texture elements. In section 4 .3, we explore the impact of on the final results. We empirically find the best results obtained when we set . We train our model in stages. First, the structure reconstructor and the texture generator are trained separately using the edge-preserved image . Then, we continue to fine-tune using the reconstructed structures . The network is trained using images with batch size as 12. We use the Adam optimizer [kingma2014adam] with learning rate as .
4 .2 Comparisons
We subjectively and objectively compare our approach with several state-of-the-art methods including Contextual Attention (CA) [yu2018generative], Partial Convolution (PConv) [liu2018image] and EdgeConnect [nazeri2019edgeconnect]. Objective comparisons
Image inpainting tasks lack specialized quantitative evaluation metrics. In order to compare the results as accurately as possible, we employ two types of metrics: distortion measurement metrics and perceptual quality measurement metrics. Structural similarity index (SSIM) and peak signal-to-noise ratio (PSNR) assume that the ideal recovered results are exactly the same as the target images. They are used to measure the distortions of the results. Fréchet Inception Distance (FID)[heusel2017gans] calculates the Wasserstein-2 distance between two distributions. Therefore, it can indicate the perceptual quality of the results. In this paper, we use the pre-trained Inception-V3 model to extract features of real and inpainted images when calculating FID scores. The final evaluation results over Places2 are reported in Table 1. We calculate the statistics over random images in the test set. It can be seen that our model achieves competitive results compared with other models. Subjective comparisons We implement a human subjective study on the Amazon Mechanical Turk (MTurk). We ask volunteers to choose the more realistic image from image pairs of real and generated images. For each dataset, we randomly select images and assign them random mask ratios from for the evaluation. Each image is compared times by different volunteers. The evaluation results are shown in Table 2. Our model achieves better results than the competitors in the highly-structured scenes, such as face dataset Celeba and street view dataset Paris. This indicates that our model can generate meaningful structures for missing regions. We also achieve competitive results in dataset Places2.
Figure 3 shows some example results of different models. It can be seen that the results of CA suffer from artifacts, which means that this method may struggle to balance the generation of textures and structures. EdgeConnect is able to recover correct global structures. However, it may generate wrong details at the edges of objects. Our method can generate meaningful structures as well as vivid textures. We also provide the reconstructed structures of EdgeConnect and our model in Figure 4. We find that the edge maps loss too much useful information, such as image color when recovering the global structures. Therefore, EdgeConnect may fill incorrect details for some missing areas. Meanwhile, edges of different objects may be mixed together in edge maps, which makes it difficult to generate textures. In contrast, our edge-preserved smooth images can well represent the structures of images. Therefore, our model can well balance structure reconstruction and texture generation. Photo-realistic results are obtained even for some highly structured images with large hole ratios.
4 .3 Ablation Studies
In this section, we analyze how each component of our StructureFlow contributes to the final performance from two perspectives: structures and appearance flow. Structure Ablation In this paper, we assume that the structure information is important for image inpainting tasks. Therefore, we first reconstruct structures and use them as prior information to generate the final results. To verify this assumption, we remove our structure reconstructor and train an inpainting model with only the texture generator. The corrupted images along with its masks are directly inputted into the model. Please note that we also keep appearance flow in the network for fair comparisons. The results are shown in Table 3. It can be seen that our structure reconstructor can bring stable performance gain to the model.
Then we turn our attention to the edge-preserved smooth images. We believe the edge-preserved smooth images are able to represent the structures since the smooth operations remove high-frequency textures. To verify this, we train a model using ground truth images as the labels of the structure reconstructor. The results can be found in Table 3. Compared with StructureFlow, we can find that using images containing high-frequency textures as structures leads to performance degradation. However, it is difficult to accurately distinguish the textures and the structures of an image. What is the appropriate degree of smooth operation? We find there exists a trade-off between the structure reconstructor and the texture generator. If very few textures are removed, the structure reconstruction will be more difficult, since it needs to recover more information. However, the texture generation will be easier. Therefore, we need to balance the difficulties of these two tasks to achieve better results. We use in RTV [xu2012structure] smooth method to control the maximum size of texture elements in . Smoother results are obtained with larger value. We train our StructureFlow using smooth images generated from . The evaluation results over dataset Paris are shown in Table 4. It can be seen that the best results are obtained when . Both too small and too large values lead to model performance degradation. An example can be found in Figure 5. When , the structure reconstructor fail to generate reasonable structures, as it is disturbed by irrelevant texture information. The texture generator fails to yield realistic images when trained with since some useful structural information is removed.
Flow Ablation In this ablation study, we first evaluate the performance gain bought by our appearance flow. Then, we illustrate the effectiveness of Gaussian sampling and the sampling correctness loss. To verify the validity of our appearance flow, we train a model without using the appearance flow blocks in the texture generator. The evaluation results can be found in Table 3. It can be seen that our StructureFlow has better performance than the model trained without the appearance flow operation, which means that our appearance flow can help with the texture generation and improve model performance. Next, we test our Gaussian sampling and the sampling correctness loss. Two models are trained for this ablation study: a model trained using Bilinear sampling in the warp operation of appearance flow and a model trained without using the sampling correctness loss. Figure 6 shows the appearance flow fields obtained by these models. It can be seen that the model trained without using the sampling correctness loss is unable to sample correct features for large missing regions. Bilinear sampling also fails to capture long-term correlations. Our StructureFlow obtains a reasonable flow field and generates realistic textures for missing regions.
4 .4 User case
Our method can be used for some image editing applications. Figure 7 provides some usage examples. Users can remove the unwanted objects by interactively drawing masks in the inputs. Our model is able to generate realistic alternative contents for the missing regions. In addition, by directly editing the structure images, users can copy or add new objects and contents to images.
In this paper, we propose an effective structure-aware framework for recovering corrupted images with meaningful structures and vivid textures. Our method divides the inpainting task into two subtasks: structure reconstruction and texture generation. We demonstrate that edge-preserved smooth images can well represent the global structure information and play an important role in inpainting tasks. As for texture generation, we use appearance flow to sample features from relative regions. We verify that our flow operation can bring stable performance gain to the final results. Our method can obtain competitive results compared with several state-of-the-art methods. Our source code is available at: https://github.com/RenYurui/StructureFlow.
Acknowledgements. This work was supported by National Engineering Laboratory for Video Technology-Shenzhen Division, Shenzhen Municipal Science and Technology Program (JCYJ20170818141146428), and Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467). In addition, we thank the anonymous reviewers for their valuable comments.