RGB-D Image Inpainting Using Generative Adversarial Network with a Late Fusion Approach

10/14/2021 ∙ by Ryo Fujii, et al. ∙ Keio University 11

Diminished reality is a technology that aims to remove objects from video images and fills in the missing region with plausible pixels. Most conventional methods utilize the different cameras that capture the same scene from different viewpoints to allow regions to be removed and restored. In this paper, we propose an RGB-D image inpainting method using generative adversarial network, which does not require multiple cameras. Recently, an RGB image inpainting method has achieved outstanding results by employing a generative adversarial network. However, RGB inpainting methods aim to restore only the texture of the missing region and, therefore, does not recover geometric information (i.e, 3D structure of the scene). We expand conventional image inpainting method to RGB-D image inpainting to jointly restore the texture and geometry of missing regions from a pair of RGB and depth images. Inspired by other tasks that use RGB and depth images (e.g., semantic segmentation and object detection), we propose late fusion approach that exploits the advantage of RGB and depth information each other. The experimental results verify the effectiveness of our proposed method.



There are no comments yet.


page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Diminished Reality (DR), which allows removing objects from images and filling in the missing regions with plausible textures, plays an important role in many mixed and augmented reality applications. Previous methods for DR can be divided into two groups. The first group takes advantage of multi-view observations to obtain actual background pixels. Pre-observation [4], active self-observation [7], and real-time observation with multiple cameras [16] are representative of this group of methods.

They can provide accurate restoration since multi-view-based methods utilize pixels directly observed from other views. Mori et al .[13] presented 3D PixMix, which addresses non-planar scenes by using both color and depth information in the inpainting process. However, if the background of the target object is also occluded in other views, a multi-view based method does not work. In the second group, inpainting-based methods can handle this problem, beacause they use pixels in the image to replace pixels that have been removed. Therefore, they do not require multiple cameras and pre-recorded observation.

Image inpainting is the task of synthesizing alternative content in missing regions. It can be used for many applications, such as photo editing, image-based rendering, and computational photography. Inpainting of RGB image can restore the region’s texture, but it cannot restore the geometric structure of missing regions. In this paper, we aim to perform texture and geometry restoration of the missing region.

Traditional image inpainting works make use of low-level features from surrounding images. They work well on background inpainting tasks. However, they are unable to create novel image content not found in the source image. They often fail to restore complex missing regions and objects (e.g., faces) with none-repetitive structures. Moreover, they cannot consider high-level semantics.

Thanks to the rapid development of deep convolutional neural networks (CNNs) and generative adversarial networks (GANs)

[8], convolutional encoder-decoder architectures jointly trained with adversarial networks have been used for inpainting tasks. Iizuka et al .[10] improved the consistency of image inpainting results by introducing a global and local discriminator. In this paper, we employ this global and local discriminator.

While GANs-based works show promising visually realistic images, it is quite difficult to make the training of GANs stable. Many methods are proposed on this subject, but stable training of GANs remains unresolved. Arjovsky et al .[1] proposed WGAN to handle this problem and prevent model collapse. Gulrajani et al .improved WGAN and proposed an alternative way of clipping weight seeking for a more stable training method [9]. WGAN-GP has been used for image generation tasks. On this type of task, it is well known that WGAN-GP exceeds the performance of existing GAN losses and works well when combined with reconstruction loss. Yu et al .[18] proposed to utilize WGAN-GP loss for both outputs of the global and local discriminator. We employ WGAN-GP to make training stable.

One simple solution that restores both the texture and geometry of missing regions is to train two networks independently; one restores textures with an input of RGB image and the other restores geometry with an input from an depth image. We call the no fusion approach. Another possible solution is to train one network which input is RGB-D four-channel image. We call the early fusion approach. Inspired by a recent object recognition method [17], we aim to construct an inpainting network that exploits the complementary relationship between RGB and depth information for RGB-D inpainting. Wang et al .[17] showed that the late fusion approach, which combines extracted features from RGB and depth images, improves the classification accuracy of objects in the images. Therefore, we also employ the late fusion approach using RGB and depth information.

We propose a method to jointly restore texture and geometry information in an image. Our method is based on GAN with the input of a pair of RGB. We employ the late fusion approach to fuse RGB and depth information. The experimental results show that our method successfully restored the missing regions of both RGB and depth images.

Our contributions are as follows:

  • We propose deep learning architecture that jointly restores the texture and geometry of scenes from RGB and depth images

  • We employ the late fusion approach to fuse RGB and depth information and show late fusion approach is nicer than early and no fusion approach.

The rest of this paper are organized as follows. We first provide preliminaries of our work in Section 2. Section 3 reviews related work on image inpainting. The proposed RGB-D inpainting method is presented in Section 4. Section 5 describes the experiment setting and evaluate results and performance comparisons. This paper is concluded in Section 6. Finally, we discuss the future work in Section 7.

2 Preliminaries

2.1 Generative adversarial networks

The concept of generative adversarial networks was introduced by Goodfellow et al .[8]. Two networks, a discriminator () and generator (

), are jointly trained in the GAN learning process. The generator network learns to map a source of noise to the data space. First, the generator samples input variables from a simple noise distribution such the uniform distribution or spherical Gaussian distribution

, then maps the input variables to data space . On the other hand, the discriminator network aims to distinguish a generated sample or a true data sample . This relationship can be considered as a minimax two-player game in which and

compete. The generative and discriminator can be trained jointly by solving the following loss function:


2.2 Wasserstein GANs

Minimizing the objective function of GAN is equal to minimizing the Jensen-Shannon divergence between the data and model distributions. GANs are known for their ability to generate high-quality samples, however, training the original version of GANs suffers from many problems (e.g., model collapse and vanishing gradients). To address these problems, Arjovsky et al .[1] proposed using the earth-mover (also called Wasserstein-1) distance for comparing the generated and real data distributions as follows:



denotes the set of all joint distributions

whose marginals are, respectively, and .

Its objective function is constructed by applying the Kantorovich-Rubinstein duality:


where is the set of 1-Lipschitz functions, and is the model distribution, defined by , .

To enforce the Lipschitz constraint in WGAN, Arjovsky et al .added weight clipping to . However, recent works suggest that weight clipping reduces the capacity of the discriminator model. Gulrajani et al .proposed an improved version of WGAN adds a gradient penalty term:


where is sampled from straight lines between pairs of points sampled from the data distribution and the generator distribution .

3 Related Work

3.1 Image Inpainting

In early works, two broad approaches to image inpainting exist. Traditional diffusion-based and patch-based methods belong to the first approach. The diffusion-based method [2] propagates pixel information from around the target missing region in an image. Diffusion-based methods can only fill plain and small or narrow holes. However, the patch-based method, which searches for the closest matching patch and pastes it into the missing region, works well on more complicated images. PatchMatch [3], which represents this method, has shown compelling results in practical image editing applications. However, as these methods are heavily based on low-level features (e.g., the sum of squared differences of patch pixel values) and do not consider the global structure, they often cause semantically inconsistent inpainting results. Moreover, they are unable to generate novel objects not found in the existing image.

The second approach is learning-based methods, which train CNNs to predict pixels for missing regions. At the beginning, CNN-based image inpainting approaches can only deal with very small and thin holes and often generate images with artifacts, resulting in blurry and distorted images. To handle these problems, Pathak et al .[14] introduced Context Encoder (CE), which is firstly trained with both Mean Square Error (MSE) loss and generative adversarial loss [8] as the objective function. This method allows larger mask ( mask in a image) to be restored. Iizuka et al .[10] improves this work by introducing global and local discriminators. The global discriminator judge whether the restored section is consistent with the whole image, and the local discriminator focuses on the inpainted region to distinguish local texture coherency. Furthermore, Iizuka et al .employed dilated convolutions, which allow each layer to increase the area of an input. Yu et al .[18] proposed an end-to-end image inpainting model consisting of two networks: a coarse network and a refinement network, which ensures the color and texture coherency of generated regions with surroundings. They also introduced a context attention module that allows networks to use information from distant spatial locations and applied a modified version of WGAN-GP loss [9] instead of existing GAN loss to ensure global and local consistency and make training stable. We also employ this WGAN-GP in the proposed method.

3.2 Learning-based RGB-D inpainting

Only one prior work [5], which discussed RGB-D image restoration. It focused on predicting depth map and foreground separation mask for hallucinating plausible colors and depths in the occluded area. In this work, they use the independent completion network, a discriminator for RGB and depth images and a pair discriminator for RGB-D image. We employ the idea of pair discriminator in our proposed method.

Figure 1: Illustration of the RGB-D image inpainting network. The network takes the input of RGB and depth images with missing regions, and it jointly outputs restored RGB and depth images.

4 Approach

Our proposed method predicts both missing RGB and depth pixels jointly. This approach restores not only the textures but also the geometries of a scene We propose using the GAN-based late fusion architecture to utilize each feature as complementary information. Our network consists of two parts, a completion network and a discriminator network.

4.1 The Network Architecture

4.1.1 RGB-D Inpainting Network

In Figure 1, our inpainting network is depicted. Our inpainting network is basically based on the architecture proposed upon the Iizuka et al .[10]. It consists of an RGB encoder-decoder, a depth encoder-decoder, and a fusion network. The inpainting network is applied to images with a rectangle-shaped region with missing pixel. Hole region is implied with white pixels filled in the holes and a binary mask. The hole size and location are randomly selected during training. RGB and depth features, which are extracted from individual encoders, are used as the input of the fusion network. We adopt dilated convolution layers with different dilation rates in the four convolution layers of the fusion part to increase receptive fields.

4.1.2 Discriminator Network

We employ a discriminator network architecture proposed by Iizuka et al .[10]. It consists of two networks: a local discriminator and a global discriminator. The global discriminator judges scene consistency, and the local discriminator assesses the quality of the small completed area. Following Dhamo et al .[5] method, to encourage inter-domain consistency between RGB and depth, we set the discriminator’s input as RGB-D data, four channels.

4.2 Loss Function

We combine content loss and generative adversarial loss for training. Unlike using the standard version of GAN loss, we adopt WGAN-GP, which makes training stable. We use reconstruction loss of RGB and depth images for content loss. WGAN-GP works well when combined with reconstruction as -1 distance in the WGAN is based on distance.


where and denote RGB and depth reconstruction loss, respectively. We set as .

Given a raw RGB image and depth image , we choose a random size and location for the binary image mask. We make holes by pixel-wise multiplication of the image and mask (, ). The RGB encoder and depth encoder takes concatenation of each image and binary mask. Therefore, the input of the RGB encoder and depth encoder are four channels (R, G, B color channels and the binary mask) and two channels (D, depth channels, and the binary mask), respectively. We utilize normalized image pixels as an input image of the network, and generate an output image with the same resolution . The details of training procedure are shown in Algorithm 1.

  while G has not converged do
     for  to  do
        Sample batch images , from training data;
        Generate random masks for , ;
        Construct inputs , ;
        Get predictions
        Sample and ;
        Update two critics with , and ;
     end for
     Sample batch images , from training data;
     Generate random masks for , ;
     Update inpainting network with RGB and depth reconstruction loss and two adversarial losses;
  end while
Algorithm 1 The training algorithm of our proposed method

5 Experiments

5.1 Dataset

We evaluate our network with the SceneNet RGB-D dataset [12], which consists of five million rendered RGB-D images from over trajectories in synthetic layouts. The poses of objects are randomly arranged and physically simulated with random lighting, camera trajectories, and textures. We train our model using about two million images taken from the SceneNet RGB-D dataset. Following the image inpainting task, we use images of size . We also conduct tests on images from the test data.

5.2 Implementation details

The input of the completion network is an RGB image, depth image, and a binary channel. The outputs are inpainted RGB and depth images. The models are implemented by Pytorch 1.2.0.

5.3 Baseline methods

To evaluate our late fusion approach, we also trained two baseline models using no fusion and early fusion approaches. In the no fusion approach, RGB and depth images are compensated via two individual networks. Therefore, one network uses the RGB image, and the other network uses the depth image as input. While the late fusion network has encoder-decoder architecture for RGB and depth, respectively, the early fusion approach only has encoder-decoder architecture for RGB-D data. For both the no fusion and early fusion approaches, other configurations are the same as that of late fusion.

5.4 Training Procedure

The sizes of holes are between and of the size of the image. We train the inpainting network for iterations using a batch size of images on an NVIDIA Titan GV 100 GPU. The total training time is roughly one week, and the inference is in real-time. We train our model with Adam optimizer [11] with the learning rate of .

Input Image
Output Image
Ground Truth
Figure 2: The results of the proposed inpainting network. Missing regions in the images of the first row are colored in white. The pixels with large depth values are colored in red and the pixels with small depth values are colored in blue.
Ground Truth Input Image Late fusion (Ours) Early fusion (baseline) No fusion (baseline)
Figure 3: Qualitative comparisons of our proposed method with two baselines.
Input Image Output Image Ground Truth
Figure 4: The result of the proposed method with the input of the object mask.

5.5 Performance Evaluation

5.5.1 Qualitative Comparison

In Figure 2, we show the inpainting result of three images from the dataset. We confirmed that our network restores the missing region of both RGB and depth images. This shows that the late fusion approach formed the edge of each restored region clearly.

We compare the qualitative performance of our late fusion approach with early and no fusion approaches. As shown in Figure 3, compared to baseline methods, late fusion successfully merges the depth clue into RGB results to enhance sharp edges. However, early fusion sometimes provides a discontinuous result. The no fusion approach shows obvious visual artifacts, including blurred or distorted images in the masked region in both RGB and the depth of the masked region.

Moreover, we use the RGB-D image masked to the object as the input of our proposed network. As depicted in Figure 4, our method generates a plausible depth inpainted image, while the RGB outcome includes some blurs. This result shows the possibility of applying our method to diminished reality (DR) applications.

5.5.2 Quantitative Comparison

5.5.3 RGB inpainting result

Table 1 shows the quantitative results of the RGB restoration of our proposed method. We present a comparison with baseline methods to verify the effectiness of our proposed method. Since many possible solutions different from the original image content exist, image inpainting task does not have perfect metrics. Nevertheless, we evaluate our proposed method by measuring

loss, signal-to-noise ratio (PSNR) and the structural similarity (SSIM) following the RGB inapainting task

[10]. Our proposed method slightly outperforms two baseline methods.

Early Fusion
No Fusion
Late Fusion (Ours)
Table 1: RGB quantitative result.

5.5.4 Depth inpainting result

Table 2

illustrates the quantitative evaluation of the depth inpainting result. We compute the absolute relative error (Abs rel), squared relative error (Sq rel), root mean square error (RMSE), and logged root mean square error (RMSE log) following the monocular depth estimation task

[6]. From the table, our proposed method significantly improves the accuracy of the depth inpainting.

Method Abs Rel Sq Rel RMSE RMSE log
Early Fusion
No Fusion
Late Fusion (Ours)
Table 2: Depth quantitative result.

6 Conclusion

In this work, we proposed a GAN-based RGB-D encoder-decoder inpainting network and evaluated late fusion on a synthetic dataset. Our network jointly restored the missing region of RGB and depth images. We discussed the fusion method enhancing each inpainted result complementary. We showed that the late fusion approach outperforms no fusion and early fusion approaches.

7 Future Work

First, we will compare our proposed method with [5], which also uses NYU depth v2 (real) for evaluation, in addition to SceneNet [12] (synthetic). Second, we will explore the fusion method. In this method, we show the advantage of the late fusion approach, of which fusion way is very simple. Fusion schemes for RGB-D image inpainting have not been proposed, but they have been explored for other tasks. Zeng et al .[15] focused on surface normal estimation from RGB-D data using a hierarchical network with adaptive feature re-writing, which are proposed to fuse color and depth features at multiple scales. In their work, Zeng et al .stated that single-scale fusion (our method) is inefficient for fusing RGB and depth when RGB and depth contain different type of noise.

Lastly, we will also study the validation of the discriminator. In this paper, we focus on the architecture of the generator and do not pay much attention to the discriminator model. Discriminators for depth images have not been studied thoroughly. Following the monocular depth estimation task, we will propose a discriminator that judges inpainted depth at multiple scales.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In ICML, pp. 214–223. Cited by: §1, §2.2.
  • [2] C. Ballester, M. Bertalmi, and V. Caselles (2001)

    Filling-in by joint interpolation of vector fields and gray levels

    IEEE Transactions on Image Processing 10, pp. 1200–11. Cited by: §3.1.
  • [3] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM SIGGRAPH. Cited by: §3.1.
  • [4] F. I. Cosco, C. Garre, F. Bruno, M. Muzzupappa, and M. A. Otaduy (2009) Augmented touch without visual obtrusion. In ISMAR, pp. 99–102. External Links: Document Cited by: §1.
  • [5] H. Dhamo, K. Tateno, I. Laina, N. Navab, and F. Tombari (2019) Peeking behind objects: layered depth prediction from a single image. Pattern Recognition Letters 125. Cited by: §3.2, §4.1.2, §7.
  • [6] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, Cited by: §5.5.4.
  • [7] A. Flores and S. Belongie (2010) Removing pedestrians from google street view images. In CVPR - Workshops, pp. 53–58. External Links: Document Cited by: §1.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. External Links: Link Cited by: §1, §2.1, §3.1.
  • [9] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NIPS, pp. 5767–5777. Cited by: §1, §3.1.
  • [10] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and Locally Consistent Image Completion. ACM SIGGRAPH 36, pp. 107:1–107:14. Cited by: §1, §3.1, §4.1.1, §4.1.2, §5.5.3.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Cited by: §5.4.
  • [12] J. McCormac, A. Handa, S. Leutenegger, and A. J.Davison (2017)

    SceneNet rgb-d: can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?

    In ICCV, pp. 2697–2706. Cited by: §5.1, §7.
  • [13] S. Mori, J. Herling, W. Broll, N. Kawai, H. Saito, D. Schmalstieg, and D. Kalkofen (2018) 3D pixmix: image-inpainting in 3d environments. In ISMAR, pp. 1–2. Cited by: §1.
  • [14] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Vol. 28. Cited by: §3.1.
  • [15] J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys (2019) DeepLiDAR: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In CVPR, Cited by: §7.
  • [16] Z. Siavash, E. Julien, G. Yakup, and N. Nassir (2003) Multiview paraperspective projection model for diminished reality. In ISMAR, pp. 217– 226. External Links: Document Cited by: §1.
  • [17] A. Wang, J. Lu, J. Cai, T. Cham, and G. Wang (2015) Large-margin multi-modal deep learning for rgb-d object recognition. IEEE Transactions on Multimedia 17, pp. 1887–1898. Cited by: §1.
  • [18] J. Yu, Z. L. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. CVPR. Cited by: §1, §3.1.