DeOccNet: Learning to See Through Foreground Occlusions in Light Fields

12/10/2019 ∙ by Yingqian Wang, et al. ∙ 25

Background objects occluded in some views of a light field (LF) camera can be seen by other views. Consequently, occluded surfaces are possible to be reconstructed from LF images. In this paper, we handle the LF de-occlusion (LF-DeOcc) problem using a deep encoder-decoder network (namely, DeOccNet). In our method, sub-aperture images (SAIs) are first given to the encoder to incorporate both spatial and angular information. The encoded representations are then used by the decoder to render an occlusionfree center-view SAI. To the best of our knowledge, DeOccNet is the first deep learning-based LF-DeOcc method. To handle the insufficiency of training data, we propose an LF synthesis approach to embed selected occlusion masks into existing LF images. Besides, several synthetic and realworld LFs are developed for performance evaluation. Experimental results show that, after training on the generated data, our DeOccNet can effectively remove foreground occlusions and achieves superior performance as compared to other state-of-the-art methods. Source codes are available at: https://github.com/YingqianWang/DeOccNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Seeing through foreground occlusions is beneficial to many computer vision applications such as detection and tracking in surveillance

[5, 13, 32, 33]. However, due to foreground occlusions, some rays cannot hit the sensors of traditional single-view cameras (, Digital Single Lens Reflex). Therefore, objects behind occlusions cannot be fully observed and reliably reconstructed. In recent years, camera arrays [26, 25, 18, 19, 8] have undergone a rapid development since they can record light fields (LFs) and provide a large number of viewpoints with rich angular information. The complementary information among different viewpoints is beneficial for the reconstruction of occluded surfaces since background objects occluded in some views can be seen by other views.

(a) (b)
(c) (d)
Figure 1: An illustration of LF-DeOcc using our rendered scenes Syn01. (a) Configuration of the scene. Yellow boxes with blocks represent camera arrays. (b) Occluded center-view SAI. (c) Results of our DeOccNet. (d) Occlusion-free groundtruth.

As illustrated in Fig. 1, light field de-occlusion (LF-DeOcc) aims at removing foreground occlusions using sub-aperture images (SAIs) captured by a camera array111In the area of LF-DeOcc, images captured by camera arrays are widely used due to their wide baselines. Therefore, we follow the existing work [18, 12, 34, 30, 11] and use camera arrays for LF-DeOcc.. The pioneering work on LF-DeOcc is proposed by Vaish [18] using a refocusing method. However, this method cannot recover a clean surface of occluded objects since rays from occlusions and background are mixed. In fact, it is important but challenging to correctly select pixels only belonging to occluded objects. To this end, existing methods [12, 34, 30, 11]

generally built different models to handle LF-DeOcc problem. Due to the highly complex structures of scenes in real world, these methods with handcrafted feature extraction and stereo matching techniques cannot achieve satisfactory performance. In recent years, deep learning has been successfully used in different LF tasks such as depth estimation

[15, 14]

, image super-resolution

[38, 23, 35], view synthesis [28, 22, 27], and LF intrinsics [2, 1]. These networks have achieved state-of-the-art performance in numerous areas. However, to the best of our knowledge, deep learning has not been used for LF-DeOcc due to several issues. In this paper, we design a novel and effective paradigm, and propose the first deep learning network (, DeOccNet) to handle LF-DeOcc problem. Specifically, we summarize three major challenges in deep learning-based LF-DeOcc, and provide solutions to these challenges using our proposed paradigm.

The first challenge is that, as compared to LF depth estimation networks [15, 14] and LF super-resolution networks [38, 23, 35], LF-DeOcc networks should use as much information from occluded surfaces as possible, while maintaining a larger receptive field to cover occlusions of different types and scales. We address this challenge by employing an encoder-decoder network to encode LF structures. We concatenate all SAIs along the channel dimension to fully use the information of occluded surfaces. Besides, we use a residual atrous spatial pyramid pooling (ASPP) module to extract multi-scale features and enlarge receptive fields.

The second challenge is that, as compared to single image inpainting networks

[10, 31, 37, 9], LF-DeOcc networks have to learn the scene structure to automatically recognize, label and remove foreground occlusions. We address this challenge by setting the occlusion-free center-view SAI as groundtruth, and train our DeOccNet in an end-to-end manner. In this way, our network can recognize occlusions from background through disparity discrepancy, and automatically remove foreground occlusions.

The third challenge is that, LF-DeOcc networks face an insufficiency of training data since large-scale LF datasets with removable foreground occlusions are unavailable. Moreover, test scenes are also insufficient for performance evaluation. We address this challenge by proposing a data synthesis approach to embed different occlusion masks into existing LF images. Using this approach, more than 1000 LFs are generated to train our network. Moreover, we develop several synthetic and real-world LFs for performance evaluation.

Experimental results have demonstrated the effectiveness of our paradigm. Our DeOccNet achieves superior performance on both synthetic and real-world scenes as compared to other state-of-the-art methods.

2 Related Works

2.1 Single image inpainting

Single image inpainting methods aim at filling holes in an image using both neighborhood information and global priors. The major challenge of single image inpainting lies in synthesizing visually realistic and semantically plausible pixels for missing regions. Recent deep learning based methods [10, 31, 37, 9] have achieved promising results for inpainting large missing regions in an image. Specifically, Yu [37] proposed a deep generative model-based inpainting method to synthesize novel image structures and textures. Liu [9] used partial convolutions for inpainting with irregular holes and achieved the state-of-the-art performance.

Compared to single image inpainting, LF-DeOcc can use complementary information provided by SAIs to generate improved results. The difference between single image inpainting and LF-DeOcc is significant. In single image inpainting, holes or masks are always pre-defined. However, LF-DeOcc requires automatical extraction of foreground occlusions by analyzing scene structures. That is, occlusions are closer to cameras than background objects, and thus have larger disparities. Due to the complex structures of real-world scenes, it is highly challenging for algorithms to correctly select rays originating from occluded objects.

Figure 2: An overview of our DeOccNet. (a) The overall architecture. (b) The structure of the residual ASPP module.

2.2 Light field de-occlusion

LF-DeOcc is an active research topic and has been investigated for decades [18, 17, 12, 34, 30, 11]. Vaish [18] proposed a refocusing method by warping each SAI by a specific value, and then averaging the warped SAIs along angular dimension. Due to the large equivalent aperture of camera arrays, when background is refocused on, occlusions are extremely blurred and the see through effect can be achieved. However, the resulting images are always blurred due to the indiscriminate use of rays from both occlusions and background. Vaish further proposed an improved version using both median cost and entropy cost [17]. Since these methods [18, 17] do not exactly exploit scene structures, their performance is limited for scenes with heavy occlusions.

To solve this problem, Pei proposed a pixel-labeling method [12] to remove occlusions. Specifically, pixels corresponding to occlusions are labeled by stereo matching and masked out during refocusing process, resulting in a clean image. However, this method can only generate images refocused on a specific depth, leaving objects in other depth ranges suffering from various degrees of blurs. Subsequently, they used an image-matting approach to perform all-in-focus synthetic aperture imaging [11]. Besides, Yang used visible layers [34] to address the all-in-focus imaging issue. Xiao [30]

used k-means clustering to classify pixels of occlusions and background. All these methods

[12, 34, 30, 11] use handcrafted feature extraction and stereo matching techniques, and cannot achieve a satisfactory performance in scenes with complex structures and heavy occlusions.

2.3 Deep learning in light field

Deep neural networks have been widely used in various LF tasks such as image super-resolution

[36, 23, 35, 38], view synthesis [28, 22, 27], and depth estimation [15, 14]. Compared to these tasks, networks for LF-DeOcc should have a larger receptive field and use more information of occluded surfaces. Currently, no existing work on deep learning based LF-DeOcc is available in literature. It is worth noting that works in [2, 1] are similar to ours. Specifically, a fully convolutional auto-encoder is proposed in [2] to separate diffuse and specular components of an LF. Both [2] and our work require high-level features and global priors of scene structures. Consequently, we built our network upon encoder-decoder architecture to encode LF structures. Note that, there are two significant differences between [2] and our network. First, only horizontal and vertical SAIs (, 9 cross-views in a 55 LF) are used in [2]. In contrast, all SAIs are used in our network to fully exploit the information of occluded objects. Second, our DeOccNet uses residual ASPP module to enlarge the receptive field, and uses multiple skip layers to have a holistic understanding of the scene while preserving fine details.

3 The Proposed Method

3.1 Network architecture

The task of our DeOccNet is to replace pixels of occlusions with pixels from the background. To achieve this task, our network is required to find correspondence and incorporate complementary information from SAIs. Note that, foreground occlusions generally have shallow depths and large disparities. That is, pixels of occlusions always have very large position variations among SAIs. Therefore, multi-scale features with large receptive fields are required for our network. In this paper, we first use a residual ASPP module for hierarchical feature extraction, and then use an auto-encoder to incorporate both spatial and angular information. The architecture of our DeOccNet is shown in Fig. 2. Different from existing LF networks [2, 15, 38] where only part of SAIs are stacked as inputs, we stacked all SAIs along the channel dimension (, channels for RGB SAIs) to use as much information as possible since LF-DeOcc highly depends on the information of occluded objects. Consequently, our DeOccNet takes the stacked SAIs as its input, and finally generates an occlusion-free center-view SAI.

Residual ASPP module. In our network, the input volume is first processed by a convolution layer to generate features with a fixed depth (, in this paper). Then, a residual ASPP module is used to generate hierarchical features. As shown in Fig. 2(b), similar to the module used in [20], we first combine six dilated convolution layers (with dilation rates of , , , , , and ) to form an ASPP group, then cascade three ASPP groups in residual manner to achieve high learning efficiency. The residual ASPP module can enlarge the receptive field and extract multi-scale information around occlusions. It is demonstrated in the ablation study (see Table 1) that our residual ASPP module is beneficial to the overall LF-DeOcc performance.

Figure 3:

The structure of the encoder and decoder blocks. Note that, the encoder and decoder blocks share mirrored structures. That is, strided convolution is used in the third unit in each encoder block, while de-convolution is used in the first unit in each decoder block.

(a) (b)
Figure 4: An illustration of our Mask Embedding approach. (a) Masks used in our approach. Note that, cropping and scaling are performed for better visualization. (b) The pipeline of our Mask Embedding approach. Here, a LF is used as an example.

Encoder pathway. Features generated by the residual ASPP module are then transferred to the encoder pathway, where encoder blocks are cascaded to incorporate both spatial and angular information. Specifically, as shown in Fig. 3

, each encoder block contains three cascaded units. In each unit, the batch-normalized features are given to two separated paths to achieve local residual learning. The first path includes a

convolution and a Leaky ReLU (with a leaky factor of

), while the second path either keeps the input unchanged or passes it through a strided convolution or de-convolution. Therefore, features produced by the two paths have the same resolution. Features from both paths are added to produce the output of this unit. For each encoder block, the first two units keep depth and resolution unchanged, while the third unit applies a strided convolution (with a stride of ) to halve the resolution and double the feature depth. Consequently, the final feature generated by the encoder (bottleneck of our DeOccNet) is time of the input feature in resolution and has channels in depths.

Decoder pathway. After passing through the bottleneck, features are decoded through the decoder pathway. Note that, decoder blocks have mirrored structures as encoder blocks. That is, the decoder block is also chained by three residual units, and the first unit uses a de-convolution to exactly revert the encoder block on the corresponding level. Moreover, to preserve fine details in the final output image, features of different resolutions on the encoder pathway are concatenated with their counterparts on the decoder pathway by skip connections. Consequently, the decoder can be guided to gradually add details onto restored images. Since feature depth is doubled by concatenation, an additional convolution layer is employed before the last three decoder blocks to halve the feature depth. Similarly, a convolution layer is also applied to the output feature to reduce its channel to .

3.2 Mask embedding for training data synthesis

It is important to provide sufficient data to train our DeOccNet. Although LFs with removable occlusions can be acquired by capturing real-world scenes with/without foreground occlusions, or by rendering synthetic scenes using softwares such as 3dsMax222https://www.autodesk.eu/products/3ds-max/overview and Blender333https://www.blender.org/, these approaches are significantly labour-intensive and even infeasible. Consequently, it is important to design an efficient approach to generate a large amount of data for network training. In this paper, we propose Mask Embedding, a training data synthesis approach to synthesize LFs with removable foreground occlusions. An illustration of our Mask Embedding approach is shown in Fig. 4.

As shown in Fig. 4(a), we manually collected mask images from Internet using tags such as salix leaves, grids, fences, and paper cuts. All these masks are common foreground occlusions in daily life. Meanwhile, we collected LFs from the Stanford LF dataset [16], the Old HCI 4D LF dataset [24], the New HCI 4D LF benchmark [4], and the MIT Synthetic LF Archive [7]. Note that, to improve the generalization capability of our network, the RGB channels of both LFs and mask images are randomly shuffled, and the masks are randomly selected to be embedded into shuffled LFs.

The pipeline of our Mask Embedding approach is illustrated in Fig. 4

(b). We randomly pick a mask from the mask set and then embed it into each SAI according to LF configurations. Specifically, a disparity with a shallow depth is randomly set and allocated to the selected mask. Then, the mask is warped according to the angular coordinate of the target view and the allocated disparity. Note that, bilinear interpolation is used when masks do not fall into integer coordinates. Next, warped masks are added to each SAI, resulting in LFs with foreground occlusions. Finally, refocusing is performed on each generated LF to check the LF configuration.

Although the Mask Embedding approach can easily generate a large number of LFs for training, the generated LFs only have occlusions in a single depth, which is significantly different from real-word scenarios. To address this issue, we use generated LFs as original LFs and repeat this process twice to synthesize LFs with occlusions at two and three depth layers. Using the proposed approach, we totally synthesize LFs to train our models. Although LFs synthesized by our approach only have fence-like and front-parallel occlusions, experimental results show that, our network trained on the synthetic data can generalize well to real-world cases (, CD scene in Fig. 6). That is, our network can successfully learn the scene structure through disparity discrepancy using LFs synthesized by our Mask Embedding approach.

3.3 Training details

We trained two models with and input SAIs. The model was used for the Stanford CD scene [16], and the model was used for our self-developed scenes. During the training phase, we used LFs synthesized by the Mask Embedding approach as the training data. All the LFs were used for the model. In contrast, due to the angular resolution limitation of existing LF datasets, only LFs from the Stanford LF dataset [16] were used to generate training data for the model.

It is worth noting that foreground and background in a scene are relative concepts. That is, some occlusions can also be considered as background objects in multi-occlusion situations as shown in Fig. 5. Consequently, both training and test scenes should be rectified to specific depths for LF-DeOcc. In this paper, we perform rectification by cropping each SAI accordingly to make occlusions have positive disparity values while backgrounds have negative disparity values. In this way, our DeOccNet can effectively achieve LF-DeOcc by simply removing objects with positive disparity values. Finally, we cropped occluded SAIs into pixel patches by a stride of , and performed upsampling for data augmentation. Meanwhile, occlusion-free center-view SAI was cropped and upsampled accordingly to generate groundtruths.

Our DeOccNet

was implemented in Pytorch on a PC with an Nvidia RTX 2080Ti GPU. All models were trained using an MSE loss and optimized using the Adam method

[6] with , , and a batch size of . The initial learning rate was set to and reduced to after epochs. The training took about days and was stopped at epochs.

Figure 5: Multi-occlusion situation in LF-DeOcc. (a) Occluded center-view SAI. (b) Result of our DeOccNet with inputs being rectified at a shallow depth (the blue dotted line in (a)). The front-most tree is considered as a foreground occlusion. (c) Result of our DeOccNet with inputs being rectified at a deep depth (the red dotted line in (a)). Three front trees are considered as foreground occlusions. Consequently, different results can be generated by our DeOccNet with the same inputs being rectified at different depth values.
(a) Occluded (b) Refocus [18] (c) Median [17] (d) Pei [11]
(e) Liu [9] (f) DeOccNet (75 center views) (g) DeOccNet (h) Groundtruth
Figure 6: Qualitative results achieved on the CD scene [16] (with an occlusion rate of ). (a) Occluded center-view SAI. (b)-(e) Comparative results achieved by different methods. (f) Result achieved by our DeOccNet using identical center-view SAIs as its inputs (discussed in Section 4.4). (g) Our result. (h) Occlusion-free center-view SAI.

4 Experiments

In this section, we first introduce the test scenes used in our experiments, and then compare our method to several state-of-the-art methods. Finally, we present ablation study and analyses.

4.1 Test scenes

Real-world scenes. We followed [11, 17, 34] and tested our method on the publicly available CD scene [16]. The original CD scene consists of 105 views distributed on a grid. We selected the central views for performance evaluation. Groundtruth image is provided by a second capture with occlusions being removed. Besides, we captured several real-world scenes using a moving Leica Q camera (with an , mm lens) mounted on a gantry. It is argued in [29, 21] that the scanning scheme is equivalent to a single shot by a camera array in static occasions. We shifted the camera to positions on a grid with cm baselines. The captured images were calibrated using the method in [39].

Synthetic scenes. Since the number of real-world test scenes is very small, we rendered synthetic scenes with removable foreground occlusions for further evaluation. All elements in our synthetic scenes were collected from Internet, and parameters (, lighting, depth range) were tuned to better reflect real scenes. The angular resolution of each scene was set to , while baselines and occlusion ranges were varied in different scenes. Occlusion-free center-view SAIs were also rendered for quantitative evaluation.

(a) Occluded (Bike01) (b) Refocus [18] (c) Liu [9] (d) Pei [11] (e) Ours
(f) Occluded (Bike02) (g) Refocus [18] (h) Liu [9] (i) Pei [11] (j) Ours
(k) Occluded (Handrail) (l) Refocus [18] (m) Liu [9] (n) Pei [11] (o) Ours
Figure 7: Qualitative results achieved on our self-developed real-world scenes. Note that, the occlusion rates of the three scenes are , , and , respectively.
(a) Overview (Syn02) (b) Liu [9] (c) Pei [11] (d) Ours (e) Groundtruth
(f) Overview (Syn03) (g) Liu [9] (h) Pei [11] (i) Ours (j) Groundtruth
(k) Overview (Syn04) (l) Liu [9] (m) Pei [11] (n) Ours (o) Groundtruth
Figure 8: Qualitative results achieved on our synthetic scenes. Sub-figures on the leftmost column represent the configurations of different scenes, yellow boxes with blocks represent camera arrays. Note that, the occlusion rates of these scenes are , and , respectively.

4.2 Comparison to the state-of-the-arts

We compared our DeOccNet to the state-of-the-art LF-DeOcc method [11]. We also used traditional refocusing method [18] and its improved version [17] as baselines. Moreover, to investigate the benefits of complementary information introduced by additional perspectives, we compared our method to the state-of-the-art image inpainting method [9]. Note that, the inpainting method [9] cannot automatically recognize occlusions in an image. Therefore, we manually labeled occlusions in each center-view image. Since the codes of [11, 18, 17] are unavailable, we used our own implementations with their default parameter settings. Following the state-of-the-art image inpainting methods [37, 9], mean

error, peak signal-to-noise ratio (PSNR) and structure similarity (SSIM) are used as quantitative evaluation metrics in this paper. Readers are referred to

[37, 9] for more details about these metrics. Qualitative results on real-world and synthetic datasets are shown in Figs. 6, 7, and 8, and quantitative results are listed in Table 1.

It can be seen from Fig. 6(a) that occlusions in the CD scene occupy a large portion. Consequently, even though views are provided, it is still highly challenging to reconstruct occluded objects. Method [18] removes occlusions by warping and averaging SAIs. As shown in Fig. 6(b), although occlusions are extremely blurred at the focused depth, occluded object is also unclear and its contrast is low. Besides, since results produced by [18] only focus on limited depth ranges, objects in other depth ranges are highly blurred and cannot be recognized. In contrast, Vaish [17] use median cost to achieve all-in-focus synthetic aperture imaging. However, the performance achieved by [17] is very limited.

Pei [11] achieve a better performance than method [17]. That is because, only rays belonging to occluded objects are used in [11]. However, due to various textures and shapes of occlusions and background, it is difficult to exactly select rays only from background, and the resulting images can be deteriorated by incorrect classification of occlusion and background. More specifically, if pixels from occlusions are misused, the resulting images will be blurred (, Fig. 6(d)). In contrast, if background are incorrectly considered as occlusions, the resulting images will have blank areas (, Figs. 7(d) and 7(i)) since all information in these areas is removed.

The single image inpainting method [9] tries to hallucinate missing part and render reasonable results using knowledge learned from a large number of daily scenes (, the ImageNet database [3]). It can generate promising results on scenes with simple structure and small occlusions (, Fig. 8(g)). However, since no additional information is used in the inpainting process, this method cannot handle scenes with rich textures/details or heavy occlusions (, Figs. 7(c)and 8(l)).

As compared to existing methods, our method achieves superior performance, especially on scenes with heavy occlusions (, scenes CD and Syn04). That is, DeOccNet successfully learns to discriminate occlusions from backgrounds, and incorporates information of occluded objects from different viewpoints. It can be also seen from Table 1 that our method achieves the best quantitative results on scenes CD, Syn01 and Syn02. Note that, our method is slightly inferior to [9] on scene Syn03. That is because, method [9] uses manually labeled groundtruth occlusions while our method do not rely on any groundtruth information. Since scene Syn03 has relatively small occlusions and simple textures, method [9] works well using learned spatial priors and labeled groundtruth occlusions.

CD Syn01 Syn02 Syn03 Syn04 Average
-error (Pei [11]) 0.233 0.188 0.240 0.204 0.156 0.204
-error (Liu [9]) 0.196 0.264 0.180 0.078 0.187 0.181
-error (ours_noASPP) 0.296 0.200 0.271 0.220 0.339 0.265
-error (ours_3skips) 0.222 0.202 0.346 0.319 0.207 0.259
-error (ours) 0.185 0.138 0.165 0.163 0.242 0.178
PSNR (Pei [11]) 16.75 19.78 18.10 19.60 21.57 19.16
PSNR(Liu [9]) 15.95 18.19 18.99 25.58 20.44 19.83
PSNR (ours_noASPP) 18.80 20.37 17.69 20.78 17.79 19.09
PSNR (ours_3skips) 19.95 20.41 16.25 17.74 21.20 19.11
PSNR (ours) 21.27 24.68 21.74 23.98 20.56 22.45
SSIM (Pei [11]) 0.508 0.636 0.568 0.656 0.569 0.587
SSIM (Liu [9]) 0.647 0.595 0.682 0.848 0.485 0.651
SSIM (ours_noASPP) 0.625 0.586 0.617 0.809 0.523 0.632
SSIM (ours_3skips) 0.621 0.613 0.560 0.742 0.530 0.613
SSIM (ours) 0.694 0.699 0.734 0.858 0.650 0.727
Table 1: Quantitative results achieved by different methods and different design choices of DeOccNet. Note that, for -error, smaller scores indicate better performance, and for PSNR and SSIM, higher scores indicate better performance.

4.3 Ablation study

We conducted ablation study to investigate the improvement introduced by the residual ASPP module and skip connections. Quantitative results are presented in Table 1.

First, we removed the residual ASPP module and retrained DeOccNet from scratch using the same training data. We can observe from Table 1 that the residual ASPP module significantly contributes to the overall LF-DeOcc performance. Specifically, it introduces more than dB improvement in average PSNR, and nearly improvement in average SSIM. That is because, the residual ASPP module can help the network to have a large receptive field to cover foreground occlusions of various shapes and scales.

Then, we investigated the benefits introduced by skip connections. Note that, during the training process, we discovered that the network cannot achieve a reasonable convergence without any skip connection. Therefore, we only removed the outermost skip connection in the ablation study, and infer the contribution of skip layers from experimental results. As shown in Table 1, the outermost skip layer introduces dB improvement in average PSNR and improvement in average SSIM. That is because, the high-frequency details are tend to be lost by strided convolutions, and it is difficult to recover these details from low-resolution features. Therefore, skip connections are necessary to provide a short path for low-level and high-frequency information.

4.4 Further discussion

It is worth noting that LFs generated by our Mask Embedding approach were used as the only training data of our DeOccNet, and there is no intersection between the training and test scenes. In fact, the styles of occlusion and background between the training and test scenes differ significantly. That is, all occlusions in our training data are fence-like and front-parallel masks. Although we ran the Mask Embedding approach three times to simulate occlusions at multiple depths, we cannot generate occlusions within a continuous range of depths as in the CD and Bike scenes. However, as shown in Figs. 6(g) and 7(e), our network can generalize well to these cases to handle slant occlusions. That is because, both training and test scenes share the same LF structures. Specifically, occlusions have positive disparity values while backgrounds have negative disparity values.

To investigate the intrinsic mechanism of our network in dealing with foreground occlusions, we stacked identical center-view SAIs of the CD scene, and fed them into our DeOccNet. It can be seen from Fig. 6(f) that our DeOccNet network cannot work with replicated inputs with identical information. That is, the mechanism of our DeOccNet is significantly different from that of the single image inpainting network [9]. Specifically, rather than using spatial information within one perspective in [9], our network uses complementary information from different viewpoints. It introduces significant performance improvements in scenes with heavy but pierced occlusions (, basket in scene Bike01) because useful information can be introduced by different perspectives. However, only focusing on complementary information from different views makes our network perform unsatisfactorily on scenes with solid block regions (, the right bottom corner in scene Bike02) since background is occluded in all perspectives and spatial information is need for LF-DeOcc. In the future, we will incorporate perspective information with neighborhood prior for LF-DeOcc, which is likely to introduce further performance improvement.

5 Conclusion

In this paper, we propose DeOccNet, the first deep learning based method for LF-DeOcc. We embed masks into existing LFs to generate a large training dataset. Experiments on both synthetic and real-world scenes show that, our DeOccNet can automatically remove foreground occlusions through disparity discrepancy, and achieve superior performance as compared to existing methods.

6 Acknowlegement

This work was partially supported by the National Natural Science Foundation of China (Nos. 61401474, 61602499, 61972435), and Fundamental Research Funds for the Central Universities (No. 18lgzd06).

References

  • [1] A. Alperovich, O. Johannsen, and B. Goldluecke (2018) Intrinsic light field decomposition and disparity estimation with deep encoder-decoder network. In 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2165–2169. Cited by: §1, §2.3.
  • [2] A. Alperovich, O. Johannsen, M. Strecke, and B. Goldluecke (2018) Light field intrinsics with a deep encoder-decoder network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 9145–9154. Cited by: §1, §2.3, §3.1.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.2.
  • [4] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke (2016) A dataset and evaluation methodology for depth estimation on 4d light fields. In Asian Conference on Computer Vision (ACCV), pp. 19–34. Cited by: §3.2.
  • [5] N. Joshi, S. Avidan, W. Matusik, and D. J. Kriegman (2007) Synthetic aperture tracking: tracking through occlusions. In International Conference on Computer Vision (ICCV), pp. 1–8. Cited by: §1.
  • [6] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR). Cited by: §3.3.
  • [7] D. Lanman, G. Wetzstein, M. Hirsch, W. Heidrich, and R. Raskar (2011) Polarization fields: dynamic light field display using multi-layer lcds. In ACM Transactions on Graphics, Vol. 30, pp. 186. Cited by: §3.2.
  • [8] X. Lin, J. Wu, G. Zheng, and Q. Dai (2015) Camera array based light field microscopy. Biomedical optics express 6 (9), pp. 3179–3189. Cited by: §1.
  • [9] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §1, §2.1, Figure 6, Figure 7, Figure 8, §4.2, §4.2, §4.2, §4.4, Table 1.
  • [10] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2536–2544. Cited by: §1, §2.1.
  • [11] Z. Pei, X. Chen, and Y. Yang (2018) All-in-focus synthetic aperture imaging using image matting. IEEE Transactions on Circuits and Systems for Video Technology 28 (2), pp. 288–301. Cited by: §1, §2.2, §2.2, Figure 6, Figure 7, Figure 8, §4.1, §4.2, §4.2, Table 1, footnote 1.
  • [12] Z. Pei, Y. Zhang, X. Chen, and Y. Yang (2013) Synthetic aperture imaging using pixel labeling via energy minimization. Pattern Recognition 46 (1), pp. 174–187. Cited by: §1, §2.2, §2.2, footnote 1.
  • [13] Z. Pei, Y. Zhang, T. Yang, X. Zhang, and Y. Yang (2012) A novel multi-object detection method in complex scene using synthetic aperture imaging. Pattern Recognition 45 (4), pp. 1637–1658. Cited by: §1.
  • [14] J. Peng, Z. Xiong, D. Liu, and X. Chen (2018)

    Unsupervised depth estimation from light field using a convolutional neural network

    .
    In 2018 International Conference on 3D Vision (3DV), pp. 295–303. Cited by: §1, §1, §2.3.
  • [15] C. Shin, H. Jeon, Y. Yoon, I. So Kweon, and S. Joo Kim (2018) Epinet: a fully-convolutional neural network using epipolar geometry for depth from light field images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4748–4757. Cited by: §1, §1, §2.3, §3.1.
  • [16] V. Vaish and A. Adams (2008) The (new) stanford light field archive. Computer Graphics Laboratory, Stanford University 6 (7). Cited by: Figure 6, §3.2, §3.3, §4.1.
  • [17] V. Vaish, M. Levoy, R. Szeliski, C. L. Zitnick, and S. B. Kang (2006) Reconstructing occluded surfaces using synthetic apertures: stereo, focus and robust measures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 2331–2338. Cited by: §2.2, Figure 6, §4.1, §4.2, §4.2, §4.2.
  • [18] V. Vaish, B. Wilburn, N. Joshi, and M. Levoy (2004) Using plane+ parallax for calibrating dense camera arrays. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. I–I. Cited by: §1, §1, §2.2, Figure 6, Figure 7, §4.2, §4.2, footnote 1.
  • [19] K. Venkataraman, D. Lelescu, J. Duparré, A. McMahon, G. Molina, P. Chatterjee, R. Mullis, and S. Nayar (2013) Picam: an ultra-thin high performance monolithic camera array. ACM Transactions on Graphics 32 (6), pp. 166. Cited by: §1.
  • [20] L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo (2019) Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
  • [21] Y. Wang, J. Yang, Y. Guo, C. Xiao, and W. An (2018) Selective light field refocusing for camera arrays using bokeh rendering and superresolution. IEEE Signal Processing Letters 26 (1), pp. 204–208. Cited by: §4.1.
  • [22] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, and T. Tan (2018) End-to-end view synthesis for light field imaging with pseudo 4dcnn. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 333–348. Cited by: §1, §2.3.
  • [23] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan (2018) LFNet: a novel bidirectional recurrent convolutional neural network for light-field image super-resolution. IEEE Transactions on Image Processing 27 (9), pp. 4274–4286. Cited by: §1, §1, §2.3.
  • [24] S. Wanner, S. Meister, and B. Goldluecke (2013) Datasets and benchmarks for densely sampled 4d light fields.. In Vision, Modelling and Visualization (VMV), Vol. 13, pp. 225–226. Cited by: §3.2.
  • [25] B. Wilburn, N. Joshi, V. Vaish, M. Levoy, and M. Horowitz (2004) High-speed videography using a dense camera array. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. II–II. Cited by: §1.
  • [26] B. Wilburn, N. Joshi, V. Vaish, E. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy (2005) High performance imaging using large camera arrays. In ACM Transactions on Graphics, Vol. 24, pp. 765–776. Cited by: §1.
  • [27] G. Wu, Y. Liu, Q. Dai, and T. Chai (2019) Learning sheared epi structure for light field reconstruction. IEEE Transactions on Image Processing. Cited by: §1, §2.3.
  • [28] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai (2018) Light field reconstruction using convolutional network on epi and extended applications. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.3.
  • [29] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, and Y. Liu (2017) Light field image processing: an overview. IEEE Journal of Selected Topics in Signal Processing 11 (7), pp. 926–954. Cited by: §4.1.
  • [30] Z. Xiao, L. Si, and G. Zhou (2017) Seeing beyond foreground occlusion: a joint framework for sap-based scene depth and appearance reconstruction. IEEE Journal of Selected Topics in Signal Processing 11 (7), pp. 979–991. Cited by: §1, §2.2, §2.2, footnote 1.
  • [31] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6721–6729. Cited by: §1, §2.1.
  • [32] T. Yang, Y. Zhang, X. Tong, X. Zhang, and R. Yu (2011) Continuously tracking and see-through occlusion based on a new hybrid synthetic aperture imaging model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3409–3416. Cited by: §1.
  • [33] T. Yang, Y. Zhang, X. Tong, X. Zhang, and R. Yu (2013) A new hybrid synthetic aperture imaging model for tracking and seeing people through occlusion. IEEE Transactions on Circuits and Systems for Video Technology 23 (9), pp. 1461–1475. Cited by: §1.
  • [34] T. Yang, Y. Zhang, J. Yu, J. Li, W. Ma, X. Tong, R. Yu, and L. Ran (2014) All-in-focus synthetic aperture imaging. In European Conference on Computer Vision (ECCV), pp. 1–15. Cited by: §1, §2.2, §2.2, §4.1, footnote 1.
  • [35] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung (2019) Light field spatial super-resolution using deep efficient spatial-angular separable convolution. IEEE Transactions on Image Processing 28 (5), pp. 2319–2330. Cited by: §1, §1, §2.3.
  • [36] Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. So Kweon (2015) Learning a deep convolutional network for light-field image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 24–32. Cited by: §2.3.
  • [37] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5505–5514. Cited by: §1, §2.1, §4.2.
  • [38] S. Zhang, Y. Lin, and H. Sheng (2019) Residual networks for light field image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11046–11055. Cited by: §1, §1, §2.3, §3.1.
  • [39] Z. Zhang (2000) A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22. Cited by: §4.1.