## 1 Introduction

The objective of a Multi-View Stereo (MVS) system is the estimation of a dense depth map for a reference image, given one or multiple source images and all the camera poses. This involves computing dense matching costs between the reference image and one or more source images. In recent years, learning-based methods have shown promising results using learned input representations in the form of feature maps and learned similarity measures [mvsnet, casmvs, rmvsnet] for computing the matching costs. However, most learning-based methods discretize the depth space and compute the matching cost at each selected depth for each reference image pixel. The result is a cost volume whose size increases quadratically with respect to the image resolution for a given number of discretization steps. As a result, cost volume methods are subject to computational and memory bottlenecks. Newly proposed cascaded cost volume approaches [casmvs]

mitigate these disadvantages, but they discretize the depth space using a pre-determined heuristic, which typically needs to be adapted for different datasets.

We propose a novel learning-based MVS method that explores the continuous depth space iteratively, without relying on an explicit cost-volume. At each iteration, our method computes a pixelwise binary decision mask that estimates whether a given pixel actual depth is in front or behind its current depth hypothesis. The binary decision permits to compute a new depth hypothesis at each pixel and the hypothesis is refined further at the next iteration. Our work is inspired by the two-view stereo method in [bi3d_stereo]. However, [bi3d_stereo] assumes the same depth hypothesis for each pixel, estimating the depth by means of binary masks computed for a predefined set of depth values, thus building a cost volume. Differently from [bi3d_stereo], we do not construct a cost volume and rather propose a novel iterative architecture capable of estimating a binary decision mask for arbitrary depth hypotheses at individual pixels. Moreover, our method targets the MVS scenario and takes advantage of the multiple source images available. In particular, at each iteration, a binary decision mask is generated for each source image and the new resulting depth map hypotheses are fused with a learned weighting scheme inspired by [vismvsnet]. However, differently from [vismvsnet], we do not fuse cost-volumes. Instead, we employ the weights to fuse 2D maps within an iterative depth estimation scheme.

We describe the proposed algorithm in detail in Section 3. As our core contributions, we 1.) design a network architecture to estimate pixelwise depth dependent binary decision masks in the MVS setting, 2.) introduce a pixelwise depth inference algorithm based on the prediction from the previous network, 3.) implement a learning-based fusion strategy, inspired by [vismvsnet], for the depth maps predicted from different source images, 4.) verify our results on the popular benchmarks DTU [dtu], Tanks and Temples [tanksandtemples] and ETH3D [eth3d].

## 2 Related work

In this section, we discuss the previously published related work in the MVS field and compare it with our proposed method. Traditional MVS methods rely on hand-crafted similarity measures such as normalized cross-correlation [colmap_mvs]. The depth hypotheses space is typically explored via a plane-sweeping cost volume [plane_sweep_mvs] or the PatchMatch algorithm [colmap_mvs, gipuma, acmm, pcf_mvs]. These methods also introduce techniques for pixelwise source view selection, in order to suppress the influence of matching results from occluded source images [colmap_mvs, acmm]. The main limitation of these methods is represented by their hand-crafted similarity measures. On the other hand, the PatchMatch exploration strategy for the depth hypothesis space and the employed pixelwise view-selection techniques elevate them above learning-based methods.

In recent years, deep-learning-based methods for MVS have received significant attention from the research community. Earlier works focus on learning a feature representation and combine this with a learned similarity measure in the form of a 3D convolutional neural network

[deepmvs, mvsnet]. In order to reduce the computational cost of the regularization, recurrent [rmvsnet, dhcrmvsnet] and cascaded [casmvs, ucsnet] approaches were utilized. The addition of CRF-based cost volume regularization was also explored in several works [bp_mvsnet, mvscrf]. Other methods rely on a voxel-based representation of the input [surfacenet, surfacenet_plus] or refine an initial coarse estimate of a point cloud [pointmvsnet]. The attention [attention_neurips] mechanism has been incorporated by several works [attmvsnet, lanet] as well. More recent methods try to combine the benefits of traditional methods with the advantages of learned representation and similarity measures,*e.g*, by avoiding cost volumes and rather resorting to PatchMatch-based depth exploration strategies [patchmatchnet] or by leveraging pixelwise source view selection [vismvsnet].

Similarly, our proposed method does not employ a cost volume for depth estimation, but still incorporates beneficial concepts from learning-based methods. In fact, it benefits from learned input representations and similarity measures, like learning-based approaches, and it implements an efficient strategy for the exploration of the depth hypothesis space, similarly to traditional methods. Moreover, it employs a learning-based strategy to fuse estimates from different source images resembling the pixelwise view selection of traditional methods.

## 3 Method

In this section we present our MVS method, named IB-MVS due its Iterative approach and its relying on Binary decisions. Below, first we provide a system overview of IB-MVS, then we elaborate on the details of its novel depth inference algorithm and its network architectures.

### 3.1 System overview

The goal of our system is to estimate a dense depth map of a reference image given source images with . Hereafter denotes a pixel location. Our MVS method is iterative and traverses the continuous space of feasible depth values at each pixel in a binary decision fashion. Inspired by [bi3d_stereo], at each iteration , and for each source image , our method predicts a binary mask with the property:

(1) |

where is the current depth map hypothesis for the reference image and its ground truth depth map. The pixelwise mask

is predicted using a convolutional neural network named Decision Network (D-Net). In practice, we predict soft binary decision masks, hence

takes values in . Each entry permits to establish whether is in front or behind the current depth hypothesis . Our method offsets the current depth hypothesis and produces a new hypothesis compliant with . The new depth map hypothesis , one for each source image, are then fused using learned pixelwise weights from a network named Weight Network (W-Net), in order to produce the next iteration depth map hypothesis . We depict IB-MVS in Figure 1 and provide a visual overview of the entire system in Figure 2.### 3.2 Depth inference algorithm

The depth inference algorithm of IB-MVS assumes a feasible depth range as input and computes the reference image depth map iteratively. The algorithm operates in the inverse depth domain, as this yields improved results in scenes with large depth ranges. To this end, we introduce the inverse depth map hypothesis and define the inverse depth range bounds and .

The algorithm starts at iteration by setting at each pixel. It then uses D-Net with as inverse depth hypothesis in order to compute the first set of binary decision masks , one for each source image. The next pixelwise depth map hypotheses at iteration is then calculated as follows:

(2) |

where and is referred as the step size. The update in Eq. (2) is guided by the binary decision mask , estimated from the current inverse depth hypothesis . For , the sought ground truth depth is in front of the current hypothesis, therefore we step backwards. For , we step forward instead. Since takes values in , the step size performs a smooth update of the current depth hypothesis. The step size decreases at each iteration, which represents the halving of the search space. However, it is noteworthy that the magnitude of the hypothesis update is adaptive (in both directions) thanks to its dependence on , as modeled in Eq. (2). The update is sketched in Figure 1, where for the sake of simplicity one source image is assumed, hence omitting subscript . After computing the new inverse depth map hypothesis for each source image , the fused inverse depth map hypothesis is calculated as:

(3) |

where is the sum of the weights estimated by W-Net for each source image. In [vismvsnet] this normalization was argued to be more beneficial than thresholding. We perform iterations and set the final depth map estimate .

### 3.3 Binary decision network

The decision network D-Net exhibits a U-Net [unet]-like encoder-decoder structure and it is depicted in Figure 3. At the top level, the decoder is fed both with the feature maps and , obtained by applying a Feature Pyramid Network [casmvs] to the reference and source images and , respectively, as well as with a depth map hypothesis . At each level of the encoder, deformable convolutions [deform_convsv2] are used to convolve the source image feature maps along locations on the epipolar line determined by the depth hypothesis. Specifically, we deform a kernel such that each of the sampling locations of the kernel correspond to locations on the epipolar line. We use kernel size and can thus center the deformed kernel on the epipolar line at the location predicted by the current depth hypothesis

and distribute the samples on either side spaced with a unit vector in pixel coordinates. The resulting sampled features from the source feature map are then concatenated with the reference feature map, as suggested in

[bi3d_stereo]. This procedure is repeated at each resolution level, the resulting feature maps are further processed with standard convolutional layers and passed both to the next lower resolution level of the encoder and to the decoder, as depicted in Figure 3. At the decoder side, the feature maps are upsampled and further concatenated with the features from the encoder at the next higher resolution level. The output binary maskis generated using a sigmoid activation function. We include a detailed specification of the convolutional hyper-parameters in Section A of the supplementary material.

In practice, in order to predict the binary decision mask , we employ three D-Nets on full, half and quarter resolution inputs. In particular, as depicted in Figure 3, the output features from the previous scale D-Net are employed on the next scale. We observed that employing three resolution levels leads to higher quality binary decision masks. In fact, this choice increases the overall architecture receptive field and permits a coarse to fine refinement of the estimated masks, as each level employs the previous level output.

### 3.4 Fusion weights network

In the proposed architecture, each D-Net is followed by a W-Net, a network whose objective is to assign a confidence, in the form of a weight map, to the predicted binary decision mask. Figure 3 depicts W-Net in gray. Inspired by [vismvsnet], W-Net operates on the pixelwise entropy of the predicted binary mask . This design strategy, coupled with the choice to use the depth hypothesis to sample the source image, rather than as a network input, makes our overall algorithm scale independent. As suggested in [vismvsnet], the final weight at the pixel is predicted as where is the network output. The pixelwise weights are used by our depth inference algorithm during the fusion stage in Eq. (3). The ideally predicted weights are small in those areas where the binary decision mask is not reliable, such as in occluded regions, and large otherwise, such that inverse depth hypotheses from different source images can complement each other. This approach mitigates the negative effect of potentially erroneous estimates from the source views, when fusing them into the new depth hypothesis . This is crucial, as represents the starting point for the estimation of the next iteration binary decision masks.

## 4 Experiments

In this section, firstly we describe the training procedure adopted for IB-MVS, then we compare it to state-of-the-art works on popular MVS benchmarks.

### 4.1 Network training

First, the multi-level D-Net is pre-trained on random uniform inverse depth map hypotheses with for every and , as proposed in [bi3d_stereo]. The ground truth binary decision mask for a given depth map hypothesis is computed from the ground truth depth map using Equation (1). We employ a loss at each level of the multi-level D-Net, with being the full resolution. In particular, is defined as the average of the Binary Cross Entropy (BCE) at the valid pixels :

(4) |

where with equal to if the pixel has a valid ground truth depth, otherwise. The BCE loss in Eq. (4) is defined as follows:

(5) |

The overall loss is the weighted sum with . After the pre-training, we train the multi-level D-Net using the depth inference algorithm described in Section 3.2 with iterations. However, we do not consider the fusion step at this stage and work with a single randomly selected source image over the iterations. The random selection of source images was shown to be beneficial by [patchmatchnet]. We employ the previously introduced loss on each one of the binary decision masks , with , generated alongside the depth inference procedure. We refer to the loss at the iteration as . At iteration , the inverse depth hypothesis is used both by the multi-level D-Net to generate and in Eq. (1) to generate its ground truth. Now the loss can be computed and the next hypothesis generated. The procedure is iterated and the final loss is the sum over the losses . We do not back-propagate across iterations. Finally, in the third training stage, we train W-Net jointly with D-Net. At this stage we fuse from randomly selected source images using the weights from W-Net with the same approach used to fuse the inverse hypotheses in Section 3.2 to generate . We compute the loss of one iteration, where is randomly chosen in , as the sum of losses on the individual and fused

. We implemented IB-MVS in PyTorch

[pytorch] and trained with batch size using ADAM [adam_optimizer].### 4.2 Evaluation metrics

In our evaluation, we present experimental results on three popular MVS benchmarks, namely DTU [dtu], Tanks and Temples [tanksandtemples] and ETH3D high and low-res [eth3d]. These benchmarks compare the reconstructed point cloud against a dense ground truth and extract completeness and accuracy metrics (recall and precision, respectively, for Tanks and Temples [tanksandtemples]). Completeness and accuracy are aggregated into a single metric: their average for DTU [dtu]

and harmonic mean, denoted F-score, for Tanks and Temples

[tanksandtemples] and ETH3D [eth3d]. For DTU [dtu], accuracy and completeness are measured in mm, hence lower is better. For ETH3D [eth3d] and Tanks and Temples [tanksandtemples], these metrics are percentages, hence higher is better.### 4.3 Ablation study

We first investigate the influence of the iterations on the point cloud results for DTU [dtu]. In Table 1, we show that increasing the number of iterations leads to improved results, which is coherent with our iterative depth refinement. The larger performance difference between iterations, compared to and , is explained by the progressively shrinking step size of IB-MVS. Furthermore, we investigate the benefits of using our W-Net in the fusion step. In particular, we compare it to a naive strategy that simply averages the depth map hypotheses from the different source views. The results in Table 1 show that, for the same number of iterations , W-Net leads to a better completeness and an overall better quality (avg. metric) than the naive fusion strategy, while exhibiting a competitive accuracy. The improved accuracy of the naive fusion strategy is obtained at the cost of worse completeness, as the absence of W-Net leads to more inconsistent estimates in occluded regions.

### 4.4 Benchmark results

We use Tanks and Temples [tanksandtemples] and ETH3D [eth3d] high and low-res in order to evaluate the generalization capabilities of IB-MVS, since we do not train on their respective training sets.

The feasible depth range is inferred from the SfM model using the method in [mvsnet]. The computed depth maps are fused into a single point cloud with the proposed method of [mvsnet]; we denote its parameters representing the number of consistent views and the geometric re-projection error threshold as and , respectively. We denote the used image resolution as , the runtime per image as and the memory consumption as . Finally, we recall that is number of used source views.

For the evaluation on the DTU benchmark, we train the network on DTU for (, ,

) epochs and learning rates (

, , ) in the stages ( , ) described in Section 4.1, respectively. For the evaluations on Tanks and Temples and ETH3D, we train on DTU at stage and on Blended MVS [blended_mvs] at stages and , for (, , ) epochs using learning rates (, , ). Every training on DTU employs the ground truth depth maps and train-test split of [mvsnet]. Finally, our method is run for iterations on DTU and for iterations on Tanks and Temples [tanksandtemples] and ETH3D [eth3d]. All the experiments were performed using an Nvidia RTX 2080Ti graphics card.low-res-test | low-res-train | high-res-test | high-res-train | |||||||||||

publication | F | acc. | cmp. | RT | F | acc. | cmp. | F | acc. | cmp. | RT | F | acc. | cmp. |

DeepC-MVS [deepcmvs] | 62.37 | 65.89 | 59.42 | 5746 | 61.99 | 65.98 | 59.27 | 87.08 | 89.15 | 85.52 | 3155 | 84.81 | 90.37 | 80.30 |

COLMAP [colmap_mvs] | 52.32 | 61.51 | 45.89 | 3312 | 49.91 | 69.58 | 40.86 | 73.01 | 91.97 | 62.98 | 1658 | 67.66 | 91.85 | 55.13 |

ACMM [acmm] | 55.01 | 52.37 | 58.27 | 1662 | 55.12 | 54.69 | 57.01 | 80.78 | 90.65 | 74.34 | 1165 | 78.86 | 90.67 | 70.42 |

PCF-MVS [pcf_mvs] | 57.06 | 56.56 | 58.42 | 9289 | 57.32 | 57.03 | 58.17 | 80.38 | 82.15 | 79.29 | 2272 | 79.42 | 84.11 | 75.73 |

R-MVSNet [rmvsnet] | 36.87 | 37.45 | 37.16 | 2413 | - | - | - | - | - | - | - | - | - | - |

CasMVSNet [casmvs] | 44.49 | 55.44 | 38.80 | - | 49.00 | 62.06 | 41.86 | - | - | - | - | - | - | - |

MVSCRF [mvscrf] | 28.32 | 34.84 | 24.97 | - | - | - | - | - | - | - | - | - | - | - |

P-MVSNet [pmvsnet] | 44.46 | 54.95 | 38.28 | 627 | - | - | - | - | - | - | - | - | - | - |

BP-MVSNet [bp_mvsnet] | 43.22 | 32.65 | 64.34 | - | 50.87 | 49.12 | 55.29 | - | - | - | - | - | - | - |

Att-MVS [attmvsnet] | 45.85 | 64.84 | 37.07 | - | - | - | - | - | - | - | - | - | - | - |

PVSNet [pvsnet] | 45.78 | 38.39 | 57.76 | 2116 | - | - | - | 72.08 | 66.41 | 80.05 | 830 | 67.48 | - | - |

PMNet [patchmatchnet] | - | - | - | - | - | - | - | 73.12 | 69.71 | 77.46 | 493 | 64.21 | 64.81 | 65.43 |

IB-MVS (ours) | 49.19 | 39.31 | 67.29 | 1487 | 55.84 | 61.06 | 52.66 | 75.85 | 71.64 | 82.18 | 616 | 71.21 | 75.21 | 69.02 |

DTU [dtu] | Tanks and Temples [tanksandtemples] | ||||||||

test | intermediate | advanced | |||||||

publication | avg. | acc. | cmp. | F | preci. | reca. | F | preci. | reca. |

DeepC-MVS [deepcmvs] | - | - | - | 59.79 | 59.11 | 61.21 | 34.54 | 40.68 | 31.30 |

ACMM [acmm] | - | - | - | 57.27 | 49.19 | 70.85 | 34.02 | 35.63 | 34.90 |

COLMAP [colmap_mvs] | - | - | - | 42.14 | 43.16 | 44.48 | 27.24 | 33.65 | 23.96 |

PCF-MVS [pcf_mvs] | - | - | - | 55.88 | 49.82 | 65.68 | 35.69 | 37.52 | 35.36 |

R-MVSNet [rmvsnet] | 0.422 | 0.385 | 0.459 | 48.40 | 43.74 | 57.60 | 24.91 | 31.47 | 22.05 |

BP-MVSNet [bp_mvsnet] | 0.327 | 0.333 | 0.320 | 57.60 | 51.26 | 68.77 | 31.35 | 29.62 | 35.61 |

P-MVSNet [pmvsnet] | 0.420 | 0.406 | 0.434 | 55.62 | 49.93 | 63.82 | - | - | - |

Att-MVS [attmvsnet] | 0.356 | 0.383 | 0.329 | 60.05 | 61.89 | 58.93 | 31.93 | 40.58 | 27.26 |

CVP-MVSNet [cvp_mvsnet] | 0.351 | 0.296 | 0.406 | 54.03 | 51.41 | 60.19 | - | - | - |

CasMVSNet [casmvs] | 0.355 | 0.325 | 0.385 | 56.84 | 47.62 | 74.01 | 31.12 | 29.68 | 35.24 |

PatchMatchNet [patchmatchnet] | 0.352 | 0.427 | 0.277 | 53.15 | 43.64 | 69.37 | 32.31 | 27.27 | 41.66 |

UCS-Net [ucsnet] | 0.344 | 0.338 | 0.349 | 54.83 | 46.66 | 70.34 | - | - | - |

LANet [lanet] | 0.335 | 0.320 | 0.349 | 55.70 | 45.62 | 75.68 | - | - | - |

D2HC-RMVSNet [dhcrmvsnet] | 0.386 | 0.395 | 0.378 | 59.20 | 49.88 | 74.08 | - | - | - |

VisMVSNet [vismvsnet] | 0.365 | 0.369 | 0.361 | 60.03 | 54.44 | 70.48 | - | - | - |

IB-MVS (ours) | 0.321 | 0.334 | 0.309 | 56.02 | 47.71 | 72.64 | 31.96 | 27.85 | 41.48 |

#### ETH3D benchmark [eth3d]

This dataset is the most challenging one for learning-based MVS methods, especially the high-res subset. On the one hand, the high resolution of its images represents a memory bottleneck for learning-based cost-volume methods. On the other, it is characterized by images with wide baselines and with a significantly lower overlap than in Tanks and Temples [tanksandtemples] and DTU [dtu], which can make matching without pixelwise source view selection difficult. For the high-res dataset we set , , , and obtain , . For the low-res dataset we set , , , and obtain , . We provide quantitative results for both the datasets in Table 2. Although this benchmark had been dominated by traditional methods such as ACMM [acmm] and DeepC-MVS [deepcmvs] in the past, the recently published PatchMatchNet [patchmatchnet] was able to achieve competitive results: IB-MVS outperforms [patchmatchnet] on both the training and the test datasets. Learning-based approaches relying on cost-volumes are limited on this benchmark, however our iterative approach allows IB-MVS to infer accurate results, even on high resolution images. Further, since large viewpoint changes are present in this benchmark, the implemented fusion scheme allows IB-MVS to deal with occluded regions.

#### DTU benchmark [dtu]

This dataset contains close-range images of various objects. For this dataset, we set , , , and obtain , . In Table 3 we compare IB-MVS to the recent learning-based state-of-the-art methods. It can be observed that IB-MVS provides very competitive accuracy and completeness values: in particular it achieves a good trade-off between the two, which results in the best average score. Finally, we achieve a complete reconstruction, even with a strict filtering parameter . This is due to IB-MVS ability to yield very precise results without the need to use large cost-volumes. This is especially important in the case of the DTU benchmark, where we target highly precise reconstructions of single objects.

#### Tanks and Temples benchmark [tanksandtemples]

The intermediate subset focuses on reconstructing small and large single objects, while the advanced subset consists of large scale indoor and outdoor scenes. In our experiments, we set , , , (interm.), , (adv.) and obtain , . On the advanced subset, IB-MVS is competitive with other state-of-the-art learning-based methods such as CasMVSNet [casmvs] and PatchMatchNet [patchmatchnet], as shown in Table 3. Our iterative exploration of the hypothesis space, along with our pixelwise source view fusion, allows IB-MVS to achieve competitive results among learning based methods on the challenging advanced subset. A possible direction for improving the intermediate set results could be to employ a confidence measure for the final predicted depth map, to filter out inaccurate points ahead of the point cloud fusion, similarly to [deepcmvs]. Another direction could consider the improvement of the core architecture, for instance, by incorporating a regularization stage. In particular, the binary decision mask prediction could be regularized by a differentiable CRF, such as [bp_mvsnet].

Finally, we discuss IB-MVS runtime performance on ETH3D, measured as the per-scene runtime of the complete reconstruction, in seconds, including the point cloud fusion step. IB-MVS is faster than traditional methods on both ETH3D high-res and low-res, as shown in Table 2. It is noteworthy that the low-res scenes contain more images. While the learning-based method PM-Net [patchmatchnet] yields a better runtime than IB-MVS, a direct comparison is difficult. In fact, the PatchMatch algorithm utilized within PM-Net [patchmatchnet] operates at half-resolution and the full resolution is obtained via a subsequent up-sampling and refinement. Instead, IB-MVS operates at full resolution and does not perform an additional refinement. Furthermore, IB-MVS achieves a better F-score than PM-Net [patchmatchnet] on ETH3D high-res. On low-res, IB-MVS offers a competitive runtime compared to learning-based methods. In terms of F-score, IB-MVS outperforms the fastest learning-based method P-MVSNet [pmvsnet], which does not participate on high-res.

## 5 Conclusion

We presented IB-MVS, a novel learning-based approach for MVS that explores the depth space iteratively in a binary decision fashion. IB-MVS couples the advantages of learning-based methods, such as learned input representations, with an efficient exploration strategy of the hypothesis space. In fact, IB-MVS can handle high resolution images, as it does not require a cost volume. In addition, IB-MVS benefits from a pixelwise source view fusion strategy. Extensive results show that IB-MVS achieves competitive results compared to state of the art methods on popular MVS benchmarks. Acknowledgement: This work has been supported by the FFG, Contract No. 881844: "ProFuture".

## Supplementary material

## A Network architecture hyper-parameters

We provide the network hyper-parameters of D-Net and W-Net in Table 4 and Table 5, respectively. As specified in the main paper, we use three resolution levels at quarter, half and full resolution. For generating the image features and , we utilize the FPN architecture of [casmvs] and set the number of feature channels for each level

. Further, we replace the batch normalization

[batch_norm] in the FPN [casmvs] with instance normalization [instance_norm]. We denote the entropy calculated from the output mask according to Section 3.4 of the main paper as . In Tables 4 and 5, we denote 2D convolutions as 2D conv, deformable 2D convolutions [deform_convsv2] as 2D def. conv and transposed 2D convolutions as 2D tran. conv.Further, we denote the leaky ReLU activation function as

LReLU, the number of input and output channels with and, the stride with

str. and kernel size with (we addwhen no bias is used, the padding is set to

).## B Supplementary qualitative results

In Figure 5, we provide qualitative point cloud results of our method IB-MVS for the DTU [dtu], Tanks and Temples [tanksandtemples] and ETH3D [eth3d] high and low-res datasets. Additionally, in Figure 7, we provide supplementary qualitative IB-MVS depth map results from the DTU [dtu] dataset.

In order to provide further insights into IB-MVS, in Figure 6 we provide a visualization of its intermediate outputs at different iterations . For each source image , at iteration the hypothesis is used to compute the binary decision mask and the weight mask via D-Net and W-Net, respectively; this permits to compute the new reference depth map hypothesis using Eq. (2) of the main paper. The rows 2-5 of Figure 6 show , and for the 4 source images and different values of . The new reference image depth hypothesis are then fused into a single depth map using the weights . The first row of Figure 6 shows the reference image along with for different values of . We conclude by observing that, as desired, the weight masks in Figure 6 assign a low confidence to those areas of the reference image that are occluded in the source image, as these areas cannot be matched. This can be appreciated in the weight masks depicted in rows 2 and 3, where the area below the sofa and the left-most region (highlighted in yellow) of the reference image are dark because they are occluded in the respective source images.

in name | out name | operation | ||
---|---|---|---|---|

Conv1 | 2D conv., k=3, str.=1, act.=LReLU | |||

DConv1 | 2D def. conv., k=5, str.=1, act.=LReLU | |||

Conv1 | DConv1 | Conc1 | concatenate along channel dim. | ||

Conc1 | Conv2 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv2 | Sc1 | 2D conv., k=3, str.=2, act.=LReLU | ||

half | bilinear interp. downscale to half | |||

half | bilinear interp. downscale to half | |||

half | Conv3 | 2D conv., k=3, str.=1, act.=LReLU | ||

half | DConv2 | 2D def. conv., k=5, str.=1, act.=LReLU | ||

Conv3 | DConv2 | Conc2 | concatenate along channel dim. | ||

Conc2 | Conv4 | 2D conv., k=3, str.=1, act.=LReLU | ||

Sc1 | Conv4 | Conc3 | for concatenate along channel dim. | ||

Conc3 | Conv5 | for 2D conv., k=3, str.=1, act.=LReLU | ||

| Sc1 | Conv4 | Conc3 | for concatenate along channel dim. | ||

Conc3 | ConvPr | for 2D conv., k=3, str.=1, act.=LReLU | ||

ConvPr | Conv5 | for 2D conv., k=3, str.=1, act.=LReLU | ||

Conv5 | Sc2 | 2D conv., k=3, str.=2, act.=LReLU | ||

quar. | bilinear interp. downscale to quarter | |||

quar. | bilinear interp. downscale to quarter | |||

quar. | Conv6 | 2D conv., k=3, str.=1, act.=LReLU | ||

quar. | DConv3 | 2D def. conv., k=5, str.=1, act.=LReLU | ||

Conv6 | DConv3 | Conc4 | concatenate along channel dim. | ||

Conc4 | Conv7 | 2D conv., k=3, str.=1, act.=LReLU | ||

Sc2 | Conv7 | Conc5 | concatenate along channel dim. | ||

Conc5 | Conv8 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv8 | Conv9 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv9 | Conv10 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv10 | UConv1 | 2D tran. conv.,k=4, str.=2, act.=LReLU | ||

Conv5 | UConv1 | Conc6 | concatenate along channel dim. | ||

Conc6 | Conv11 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv11 | Conv12 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv12 | UConv2 | 2D tran. conv.,k=4, str.=2, act.=LReLU | ||

Conv2 | UConv2 | Conc7 | concatenate along channel dim. | ||

Conc7 | 2D conv., k=3, str.=1, act.=LReLU | |||

2D conv., k=3, str.=1, act.=sigmoid |

in name | out name | operation | ||
---|---|---|---|---|

Conv1 | for 2D conv., k=3, str.=1, act.=LReLU | |||

Conv0 | for 2D conv., k=3, str.=1, act.=LReLU | |||

for bilinear interp. upscale to double res. | ||||

ConvPr | for 2D conv., k=3, str.=1, act.=LReLU | |||

Conv0 | ConvPr | Conc1 | for concatenate along channel dim. | ||

Conc1 | Conv1 | for 2D conv., k=3, str.=1, act.=LReLU | ||

Conv1 | Conv2 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv2 | Conv3 | 2D conv., k=3, str.=1, act.=LReLU | ||

Conv3 | 2D conv., k=3, str.=1, act.=LReLU | |||

1 | 2D conv., k=3, str.=1, act.=identity |