Non-local Recurrent Regularization Networks for Multi-view Stereo

10/13/2021 ∙ by Qingshan Xu, et al. ∙ 0

In deep multi-view stereo networks, cost regularization is crucial to achieve accurate depth estimation. Since 3D cost volume filtering is usually memory-consuming, recurrent 2D cost map regularization has recently become popular and has shown great potential in reconstructing 3D models of different scales. However, existing recurrent methods only model the local dependencies in the depth domain, which greatly limits the capability of capturing the global scene context along the depth dimension. To tackle this limitation, we propose a novel non-local recurrent regularization network for multi-view stereo, named NR2-Net. Specifically, we design a depth attention module to capture non-local depth interactions within a sliding depth block. Then, the global scene context between different blocks is modeled in a gated recurrent manner. This way, the long-range dependencies along the depth dimension are captured to facilitate the cost regularization. Moreover, we design a dynamic depth map fusion strategy to improve the algorithm robustness. Our method achieves state-of-the-art reconstruction results on both DTU and Tanks and Temples datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Multi-view stereo (MVS) has been a hot research topic in computer vision for decades. It aims to recover the 3D geometry of a scene from a set of images with known camera parameters. Nowadays, multi-view stereo reconstruction is typically decomposed into two separate steps: depth map estimation and fusion Galliani2015Massively; Schonberger2016Pixelwise; Xu2019Multi; Xu2020Learning; Xu2020Planar; Yao2018MVSNet; Gu2019Cas. For these two steps, accurate depth map estimation is often challenging due to a variety of real-world problems, e.g., low-textured areas, thin structures, occlusions and reflective surfaces.

Figure 1: Illustration of different regularization methods. 3D filtering methods require cubic memory complexity while recurrent methods only have quadratic complexity. Our non-local recurrent regularization can explicitly establish interactions for non-adjacent depth values to capture more global context along the depth dimension which previous recurrent regularization methods cannot achieve.

Recently, many learning-based MVS methods Yao2018MVSNet; Yao2019RMVSNet; Xu2020Learning; Gu2019Cas have been presented to achieve competitive or better results than traditional methods that rely on optimization of photometric consistency Furukawa2010Accurate; Schonberger2016Pixelwise. Inspired by plane-sweeping stereo Collins1996Space

, most learning-based methods first construct multi-channel cost volumes based on extracted deep image features, then perform deep cost volume processing to predict depth maps. The core of the successful learning-based MVS methods is the deep regularization of cost volumes to encode local and global information of a scene. The deep regularization converts the cost volume into a latent probability volume, directly influencing the final depth inference.

At present, the deep regularization of cost volumes can be categorized into two types: 3D cost volume filtering and recurrent 2D cost map regularization. The former one regularizes the entire 3D cost volume in one go, thus 3D convolutions can be seamlessly utilized to incorporate the context of a scene Yao2018MVSNet; Xu2020Learning; Luo2019P. In general, this kind of method is memory-consuming since its memory requirement is cubic to the model resolution (cf. Fig 1(b)). Although many follow-up works Chen2019Point; Gu2019Cas; Cheng2020UCSNet; Yang2020CVPMVSNet have been proposed to alleviate this problem, the scalability of 3D cost volume filtering is still limited when tackling high-resolution and large-scale 3D reconstruction. As another line of research, the latter one Yao2019RMVSNet; Yan2020Dense sequentially regularizes 2D cost maps along the depth direction. Therefore, this kind of method reduces the memory requirement from cubic to quadratic to the model resolution. Moreover, such methods can adaptively sample sufficient depth hypotheses for scenes of different scales. In order to aggregate spatial as well as temporal context information in the depth direction, recurrent networks, e.g.

, gated recurrent units (GRU) and long short-term memory (LSTM), are adopted in such methods. However, these methods only explicitly consider the information interaction between adjacent depth values, hence the long-range dependencies along the depth dimension cannot be fully captured (cf. Fig 

1(c)).

To tackle the above limitation of recurrent 2D cost map regularization, we propose a novel non-local recurrent regularization network for multi-view stereo, namely NR2-Net. Built upon the regular recurrent neural networks, we divide the depth sampling space into different blocks to model the non-local interactions for non-adjacent depth planes. Within each block, we design a depth attention module to distill latent high-level features. This module samples cost map features at every other depth planes to enable large receptive fields like dilated convolution 

Chen2018DeepLab. Based on these cost map features, the latent high-level features are distilled via attention mechanism. Then the high-level features between blocks are further interacted in a gated recurrent manner to capture global scene context, which is used to regularize the bottom-level cost map features in the next block. In this way, the long-range dependencies along the depth dimension are modeled and the global scene context is perceived to help the long-range cost map regularization (cf. Fig 1(d)). At last, in order to fuse depth maps into point clouds, existing learning-based methods usually first predefine a constant depth probability threshold to filter out unreliable depth estimates. However, due to the uncertainty of depth prediction networks Kendall2017Uncertainties, this constant threshold will discard many credible depth values, resulting in point clouds of low completeness. Thus, we adaptively associate the depth probability threshold with depth consistency to generate more complete point clouds.

Our contributions can be summarized as follows: 1) We present a novel non-local recurrent regularization framework for multi-view stereo. This allows to perceive more global context information in the depth direction to assist cost volume regularization. 2) We propose a depth attention module to distill latent high-level features along the depth dimension. This helps to model non-local interactions for non-adjacent depth hypotheses in the gated recurrent manner. 3) We develop a dynamic depth map fusion strategy to reconstruct point clouds. This strategy jointly considers the depth probability and depth consistency in a dynamic way, which is robust for different scenes. Our method, NR2-Net, achieves state-of-the-art performance on both DTU and Tanks and Temples datasets.

Related Work

Traditional multi-view stereo. In order to estimate depth maps for all input images, multi-view stereo needs to search correspondences across different images. Plane-sweeping methods Collins1996Space; Gallup2007Real sample depth plane hypotheses in the 3D scene. Then, they employ hand-crafted similarity metrics, e.g., sum of absolute differences (SAD) and normalized cross correlation (NCC), to construct cost volumes and extract the final depth maps via the winner-take-all strategy. Since these similarity metrics are ambiguous to some challenging areas, e.g., low-textured areas and reflective surfaces, some methods adopt engineered regularization technologies, such as graph-cuts Kolmogorov2002Multi and cost filtering Hosni2013Fast, to alleviate this problem. On the other hand, PatchMatch MVS methods Zheng2014PatchMatch; Galliani2015Massively; Schonberger2016Pixelwise; Xu2019Multi; Xu2020Planar adopt the sampling and propagation strategy Barnes2009PatchMatch to efficiently search continuous depth plane hypotheses from the whole depth interval. These methods impose implicit smoothness constraints based on the plane propagation in the 3D scene. Although they have greatly improved the performance of traditional methods, how to handle the ambiguity in the challenging areas is still an open problem.

Figure 2:

The NR2-Net architecture. Our network first extracts high-resolution deep features for all input images. Then cost maps at different depth planes are computed by differentiable homography warping and regularized by convolutional LSTM cells. To do non-local recurrent cost map regularization, the whole depth planes are divided into different blocks. The non-local interactions within each block are modeled by a depth attention module to distill latent high-level features. These high-level features are further interacted in a gated recurrent manner to capture global context information, which is used to help regularize the hidden states of LSTM cells in the next depth block.

Learning-based multi-view stereo.

Recently, some works have leveraged deep neural networks to learn deep similarity features and deep regularization for multi-view stereo 

Hartmann2017Learned; Ji2017Surfacenet; Kar2017Learning. MVSNet Yao2018MVSNet and DeepMVS Huang2018DeepMVS propose to construct cost volumes to learn depth maps for each input image. This makes learning-based MVS methods more scalable and applicable to scene reconstruction. To infer depth maps, MVSNet and many following works Xue2019MVSCRF; Luo2019P; Xu2020Learning utilize multi-scale 3D convolutions to regularize cost volumes. However, due to the limitation of cost volume resolution, these methods still cannot tackle high-resolution images. To predict high-resolution depth maps, cascade methods Cheng2020UCSNet; Gu2019Cas; Yang2020CVPMVSNet employ the coarse-to-fine framework to infer high-resolution estimation via thin cost volumes. However, these cascade methods heavily rely on coarse-scale estimation which often lacks important detail information. Consequently, this may result in blurred effects in high-resolution estimation. To achieve high-resolution prediction, R-MVSNet Yao2019RMVSNet recurrently regularizes cost maps through GRU cells. This dramatically reduces memory consumption. However, the generic GRU cells do not allow the regularization to consider enough context information. HC-RMVSNet Yan2020Dense absorbs both the merits of LSTM and 2D U-Nets to present a 2D U-Net architecture with convolutional LSTM cells. This architecture considers more context information in image domain and can directly operate on high-resolution image features. In contrast to our work, existing recurrent methods only explicitly consider interactions between adjacent depth planes and thus lack global context along the depth dimension.

Method

The network architecture of our method, NR2-Net, is depicted in Fig. 2. The input to our network is a reference image and source images , where is the total number of input images. Their camera intrinsic parameters and relative extrinsic parameters are known. Our NR2-Net first extracts high-resolution deep features for all input images. The cost maps at different depth planes are computed by homography warping. Our non-local recurrent regularization framework is built upon convolutional LSTM cells to regularize 2D cost maps. To achieve non-local interactions for non-adjacent depth planes, we divide the whole depth interval into different blocks. The different depth planes within each block are interacted by a depth attention module to distill latent high-level features. The global context information is captured by further interacting the high-level features in a gated recurrent manner. Then, the global context information is in turn used to regularize the hidden states of different LSTM cells in the next block. This helps each LSTM cell perceive more global context information in the depth direction. Finally, we convert these hidden states into regularized cost maps to predict the depth map of the reference image.

Cost Map Construction

Similar to previous works, we use the plane-sweeping algorithm Collins1996Space to construct cost maps at different depth planes. Specifically, after extracting high-resolution deep image features for all input images by stacked dilated convolutions Yan2020Dense, we compute the warped image features for source images at depth value by differentiable homography warping. Then the 2D cost map at depth value is calculated by:

(1)

where is the view weight of the -th source image and ‘’ represents element-wise multiplication. Herein,

is computed by applying one convolution layer followed by group normalization and rectified linear units, a residual block 

He2016Deep

and one convolution layer followed by a sigmoid function on the feature difference,

. In this way, a cost volume can be obtained by concatenating 2D cost maps of all depth planes in the depth direction, where is the total number of depth planes.

Non-local Recurrent Regularization

As the core of learning-based MVS methods, cost volume regularization is important to aggregate context information in both spatial and depth domain. Existing recurrent 2D cost map regularization methods cast the whole cost volume as a series of 2D cost maps in the depth direction and sequentially regularize 2D cost maps via stacked recurrent neural networks. Although such methods are memory-friendly, they only explicitly consider the local interactions between adjacent depth values.

Inspired by non-local methods Wang2018Non; Fu2019Non, we design a non-local recurrent regularization network to model long-range dependencies along the depth dimension, which helps to regularize cost volumes by aggregating more context information. As illustrated in Fig. 2, we divide the whole cost volume into blocks, each block contains cost maps. Each cost map is regularized by stacked convolutional LSTM cells to incorporate context information in both the spatial and the depth domain. As depicted in Fig. 3(a), our stacked convolutional LSTM cells are built upon 2D U-Net architecture to aggregate multi-scale context information in the spatial domain. Moreover, these LSTM cells contain one non-local LSTM cell and four vanilla LSTM cells. This non-local LSTM cell is beneficial to incorporate more context information in the depth direction.

Next we elaborate on the mechanism of our non-local LSTM cell. As illustrated in Fig. 3(b), besides using the current cost map , the previous regularized cost map and the previous cell state map as input like the vanilla LSTM cells, the non-local LSTM cell also incorporates the cell state map of the previous block, , to consider long-range dependencies in the depth direction. How to obtain through the depth attention module will be detailed in the next section. With these inputs, the forget gate map , candidate state map , input gate map and output gate map are respectively modeled as:

(2)
(3)
(4)
(5)

where ‘’ means convolution, ‘’ means concatenation, is a transformation matrix and is a bias term. Then, the current cell state map and the current regularized cost map are computed as,

(6)
(7)

where is the depth attention gate map given by

(8)

Note that, the last term in the right side of Eq. (6) reflects the non-local interactions in the depth direction. This is the difference between our non-local LSTM cell and the vanilla LSTM cell. Finally, the regularized cost maps

pass a softmax layer to produce the probability volume

. In this way, our regularization not only incorporates multi-scale context information in the spatial domain, but also aggregates long-range dependencies in the depth direction.

Figure 3: Illustrations of (a) structure of 2D U-Net with convolutional LSTM and (b) structure of non-local LSTM cell.

Depth Attention Module

In the previous section, the depth attention module plays an important role in distilling latent high-level features within each block. It captures discriminative depth context information for each block. Then, the high-level features between blocks are interacted in a gated recurrent manner to capture global depth context information. Finally, we can use this information to impose explicit long-range regularization constraints for non-adjacent depth values as described above.

Figure 4: Structure of the depth attention module.

The structure of the depth attention module is illustrated in Fig. 4. For each block, we only use every other depth value in the block to sample raw cost features and regularized cost features , where and denote the image height and width respectively. This enables the larger receptive fields in the depth direction. We first concatenate these two kinds of cost features into complex cost features . Such dense connection is good to explore their potential interactions and learn non-local information Huang2017Densely

. Then we use max-pooling and average-pooling along the depth dimension to aggregate discriminative depth information, generating two different depth context features,

and . As pointed out in Woo2018Cbam, we think that max-pooling is good to gather distinctive clues while average-pooling is beneficial to learn the extent of depth space. Using both can highlight the informative depth values as much as possible. These two features are further concatenated and forwarded into one convolution layer and a residual block to produce attention features .

To further model the global context information in the depth direction, the attention features between blocks are interacted by another recurrent neural network. Specifically, we reshape raw cost features to , where . The input gate map and the forget gate map are modeled as

(9)
(10)

where is a transformation matrix and is a bias term. Finally, we update the cell state map of the current block

(11)

By updating the cell states of different blocks in a gated recurrent manner, our method is able to capture global context information in the depth direction. These cell states of blocks are in turn used to update the cell states of the LSTM cells.

Loss Function

Following previous practices Yao2019RMVSNet; Yan2020Dense, we cast the depth inference task as a multi-class classification problem. We use the following cross-entropy loss to train our network:

(12)

where denotes the set of valid ground truth pixels,

is the one-hot vector generated according to the ground truth depth map at pixel

, and is the predicted depth probability at pixel . Note that, we do not need to store the whole classification probability volume during testing. Moreover, since the cell states of blocks are also sequentially updated, our designed depth attention module will not occupy too much memory.

Dynamic Depth Map Fusion

After estimating depth maps for all input images, it is necessary to filter out wrong depth estimates for each depth map and then fuse them into a consistent point cloud representation. There are two critical factors for filtering wrong depth predictions: depth probability and depth consistency. The depth probability is usually measured by the corresponding probability of the selected depth, i.e., . The depth consistency is usually measured by the geometric constraint: reprojection errors and relative depth errors Yao2018MVSNet. For a pixel in the reference image , we denote its reprojection error w.r.t the source image by and its relative depth error w.r.t. by . Then, its consistency view set is defined as:

(13)

where and are related thresholds. If and , the depth estimate of is reliable, where is the threshold of consistent view number and is the probability threshold. Previous methods Yao2018MVSNet; Yao2019RMVSNet; Gu2019Cas usually set fixed threshold parameters to remove unreliable depth estimates. However, these intuitively preset parameters make their fusion methods not robust for different scenes. Thus, HC-RMVSNet Yan2020Dense proposes a dynamic consistency checking algorithm to measure depth consistency. Its dynamic consistency view set is defined as:

(14)

The above definition makes the thresholds of geometric constraint be adaptively adjusted based on the consistent view number . This means that the estimated depth value is accurate and reliable when satisfying strict depth consistency in a small number of views, or relaxed depth consistency in the majority of views. However, this method still adopts a fixed depth probability threshold to filter depths with different depth consistency. Due to the uncertainty of depth prediction networks Kendall2017Uncertainties, the depth probability will sometimes filter credible depth values that meet the strict depth consistency, making reconstructed point clouds less complete. Thus, the depth probability threshold should also be adjusted according to the level of depth consistency. Based on these observations and our experiments, we define the dynamic probability threshold as:

(15)

By traversing different values, the estimated depth of will be deemed accurate and reliable if and . That is, when the reprojection error and relative depth error are strict, the number of consistent view and depth probability threshold are relaxed to judge if the depth prediction is reliable. Finally, the reliable depth estimates will be projected into 3D space to generate 3D point clouds.

Method Acc. Comp. Overall
Geometric Furu 0.613 0.941 0.777
Tola 0.342 1.190 0.766
Gipuma 0.283 0.873 0.578
COLMAP 0.400 0.664 0.532
Learning-based SurfaceNet 0.450 1.040 0.745
MVSNet 0.396 0.527 0.462
R-MVSNet 0.383 0.452 0.417
CasMVSNet 0.325 0.385 0.355
CVP-MVSNet 0.296 0.406 0.351
UCSNet 0.338 0.349 0.344
AttMVS 0.391 0.345 0.368
HC-RMVSNet 0.395 0.378 0.386
Ours 0.370 0.332 0.351
Table 1: Quantitative results on the DTU evaluation set using the distance metric [] (lower is better). Our method achieves the best completeness and the second best overall score. The best and second best results are in bold and underlined, respectively.
Figure 5: Qualitative results of reconstructed point clouds for the model scan 9 from the DTU evaluation set.

Experiments

We first describe datasets and implementation details. Then, we show benchmark results on the DTU dataset Aanes2016Large as well as the Tanks and Temples dataset Knapitsch2017TTB. Finally, we conduct ablation studies to analyze our proposed core components.

Method Acc. Comp.
Geo. COLMAP 43.16 44.48 42.14
ACMM 49.19 70.85 57.27
ACMP 49.06 73.58 58.41
Learning-based MVSNet 40.23 49.70 43.48
R-MVSNet 43.74 57.90 48.40
CVP-MVSNet 51.41 60.19 54.03
UCSNet 46.66 70.34 54.83
CasMVSNet 53.71 63.88 56.84
HC-RMVSNet 49.88 74.08 59.20
AttMVS 61.89 58.93 60.05
Ours 55.72 70.28 60.49
Table 2: Quantitative results on the Tanks and Temples Intermediate set. Our method achieves the best -score. The best and second best results are in bold and underlined.
Figure 6: Qualitative results of reconstructed point clouds on Tanks and Temples dataset.

Datasets and Evaluation Metrics

DTU dataset. This dataset is captured from or views under a controlled indoor environment. scenes are included with different lighting conditions. The ground truth camera parameters and 3D point clouds are both provided. Following Ji2017Surfacenet; Yao2018MVSNet, we divide this dataset into training, validation and evaluation sets.

Tanks and Temples dataset. This dataset contains both indoor and outdoor scenes, which are captured in the realistic environment. Moreover, unlike the object-centric scenes in DTU, these scenes are larger and more complex. This dataset is divided into Intermediate set and Advanced set. The latter one is more challenging due to scale, textureless regions and complicated visibility.

Evaluation metrics. To evaluate the performance of our method on different datasets, distance metric and percentage metric are used for the DTU and Tanks and Temples dataset, respectively. For the distance metric, the overall score defined as the mean of accuracy and completeness measures the overall performance of reconstructed point clouds. For the percentage metric, the

score defined as the harmonic mean of accuracy and completeness is adopted.

Implementation Details

Training. We train our network with ground truth depth maps on the DTU training set. The ground truth depth maps are generated through Poisson Surface Reconstruction Kazhdan2013SPS. Due to the limitation of GPU memory, the input images are resized to . The input image number is . The depth values are sampled from to with in an inverse depth manner. The block size is set to

. Our network is implemented using Pytorch 

NIPS2019PyTorch and trained with Adam Kingma2015Adam optimizer end-to-end for epochs on two NVIDIA RTX 2080Ti GPU cards. The initial learning rate is set to and the batch size is set to . Note that, to prevent estimated depth maps from being biased on the recurrent regularization order, we randomly use forward or backward pass to train samples.

Evaluation. During the evaluation, we only use the forward pass to infer depth maps. We sample depth planes in an inverse depth manner Yao2019RMVSNet; Xu2020Learning and set the input image number as 7. For the DTU dataset, the input image resolution is set to . For the Tanks and Temples dataset, we use the camera parameters provided by R-MVSNet Yao2019RMVSNet to evaluate our method with the image resolution . Following previous practices Yao2019RMVSNet; Yan2020Dense, we first use the maximum classification probability as the depth probability of estimated depths. Then, we apply the designed dynamic depth map fusion method to generate the final point clouds.

Benchmarking Results

In this section, we directly use our model trained on the DTU training set without any fine-tuning for benchmarking evaluations. We compare our method with other state-of-the-art MVS methods, including geometric and learning-based methods. The geometric methods include Furu Furukawa2010Accurate, Tola Tola2012Efficient, Gipuma Galliani2015Massively, COLMAP Schonberger2016Pixelwise, ACMM Xu2019Multi and ACMP Xu2020Planar. For the learning-based methods, SurfaceNet Ji2017Surfacenet, MVSNet Yao2018MVSNet, R-MVSNet Yao2019RMVSNet, CasMVSNet Gu2019Cas, CVP-MVSNet Yang2020CVPMVSNet, UCSNet Cheng2020UCSNet, AttMVS Luo2020Attention and HC-RMVSNet Yan2020Dense are compared.

Results on DTU. We evaluate our method on the DTU evaluation set. The comparison results are shown in Table 1. Our method achieves the best completeness and competitive overall performance among compared methods. In particular, our method outperforms the previous recurrent regularization methods, including R-MVSNet and HC-RMVSNet. Moreover, our method is very competitive with the 3D filtering methods, e.g., CVP-MVSNet and UCSNet. The qualitative comparisons in Fig. 5 show that our reconstructed point clouds are more complete than UCSNet and HC-RMVSNet.

Results on Tanks and Temples. We evaluate the generalization ability of NR2-Net on the Tanks and Temples dataset. Table 2 summarizes the evaluation results. As can be seen, our method yields the best mean score, , over all published methods, demonstrating that our method generalizes better than other methods. In addition, our method achieves much better performance than other MVS methods based on 3D cost volume filtering on this dataset. We believe this is because the scenes in Tanks and Temples are larger than the scenes in DTU, while the 3D cost volume filtering methods can only sample limited depth planes due to the GPU memory limitation. In contrast, our method can sample sufficient depth values and better considers the long-range dependencies in the depth direction, making it more feasible in practice. Fig. 6 shows the qualitative results of our reconstructed point clouds.

Model NLR DF Acc. Comp. Overall
Baseline 0.350 0.473 0.411
Model-A 0.378 0.405 0.391
Model-B 0.344 0.364 0.354
Full 0.370 0.332 0.351
Table 3: Ablation study of different components in our NR2-Net. NLR and DF mean non-local recurrent regularization and dynamic depth map fusion respectively.
Figure 7:

Visual comparison of predicted probability distribution at one pixel between Baseline (a) and Full model (b). The cell states of depth blocks indicate the circle region belongs to the background and constrain the probability distribution of cost map regularization to produce true estimates.

Method Input Size Output Size Mem. Time
MVSNet 800576 200144 8.9 0.78
R-MVSNet 800576 200144 1.2 0.89
DHC-RMVSNet 200144 200144 0.9 2.67
Ours 200144 200144 1.0 1.90
CVP-MVSNet 800576 800576 2.2 0.49
DHC-RMVSNet 800576 800576 2.4 20.94
Ours 800576 800576 3.4 16.65
Table 4: Evaluation of GPU memory [GB] and runtime [s] for different methods.

Ablation Study

Effect of different components. In this section, we conduct an ablation study on the DTU evaluation set to verify the effectiveness of non-local recurrent regularization and dynamic depth map fusion module in the proposed NR2-Net. The baseline model is created by removing the non-local recurrent regularization module and replacing the dynamic depth map fusion with dynamic consistency checking, which means that Eq. (15) becomes as Yan2020Dense suggested. The baseline model is also trained with the procedure as described in the previous section. Based on the baseline model, we add the non-local recurrent regularization module and use the dynamic depth map fusion step by step to construct model-A, Model-B and Full model. The comparison results are listed in Table 3. This clearly demonstrates the effectiveness of our proposed two core modules in NR2-Net. In particular, by comparing Baseline and Model-B, Model-A and Full model, we observe that our proposed non-local regularization improves accuracy and greatly boosts completeness, leading to significant overall performance improvement. By comparing Baseline and Model-A, Model-B and Full model, we see that our proposed dynamic fusion can produce more complete point clouds, resulting in better overall performance. This also demonstrates the generalization of our proposed fusion for different depth inference networks.

To further demonstrate the effect of the proposed non-local recurrent regularization, we try to visualize the cell states of different depth blocks. In fact, it is hard to give comprehensive visualization about these cell states. Instead, we show the probability distribution of these cell states to see whether they indicate the global scene context. To this end, we average the features of along the channel dimension and apply the softmax operation along the depth dimension for all depth blocks to obtain the probability distribution. Then, this probability distribution is aligned with the probability distribution of all depth planes by copying the value of each block to its covering depth planes. The visualization results are shown in Fig. 7. The circle region in Fig. 7 is challenging as it is close to the central object in the spatial domain but in fact it belongs to the background areas. Without non-local recurrent regularization, it is easy for the baseline model to propagate the depth information of the central object to the circle region, resulting in incorrect estimation for this region (cf. Fig. 7(a)). With our proposed non-local recurrent regularization, the cell states of depth blocks indicate the circle region belongs to the background areas and constrain the regularization of LSTM cells (cf. Fig. 7(b)). This demonstrates to some extent that the cell states of depth blocks capture the global scene context to help cost map regularization.

Runtime and GPU memory. We evaluate the runtime and GPU memory usage of our method to generate each depth map on the DTU evaluation set. Table 4 shows the comparison results with other methods. For pair comparison with MVSNet and R-MVSNet, we run different methods with the same depth sample number . We see that 3D filtering methods take up a lot of memory and cannot tackle high-res input. Recurrent regularization methods greatly reduce memory consumption at the cost of runtime and can tackle high-res input. Although cascade 3D filtering methods such as CVP-MVSNet employ the coarse-to-fine strategy to fit in high-res input, they can only sample limited depth planes in each stage. This prevents them from perceiving full-space global context information in the depth direction, hindering their performance in practical large-scale scenes, e.g., Tanks and Temples dataset.

Conclusion

We presented a novel non-local recurrent regularization network for multi-view stereo. A novel depth attention module is proposed to model non-local interactions within depth blocks. These non-local interactions are updated in a gated recurrent way to model global scene context. In this way, long-range dependencies along the depth direction are captured and in turn utilized to help regularize cost maps. In addition, we design a dynamic depth map fusion strategy to adaptively take both the depth probability and depth consistency into account. This further improves the robustness of the point cloud reconstruction. Ablation studies demonstrate the effectiveness of our proposed core components. Experimental results show that our method achieves competitive results on the indoor DTU dataset and exhibits excellent performance on the large-scale Tanks and Temples dataset.

References