Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation and Focal Loss
Depth estimation is solved as a regression or classification problem in existing learning-based multi-view stereo methods. Although these two representations have recently demonstrated their excellent performance, they still have apparent shortcomings, e.g., regression methods tend to overfit due to the indirect learning cost volume, and classification methods cannot directly infer the exact depth due to its discrete prediction. In this paper, we propose a novel representation, termed Unification, to unify the advantages of regression and classification. It can directly constrain the cost volume like classification methods, but also realize the sub-pixel depth prediction like regression methods. To excavate the potential of unification, we design a new loss function named Unified Focal Loss, which is more uniform and reasonable to combat the challenge of sample imbalance. Combining these two unburdened modules, we present a coarse-to-fine framework, that we call UniMVSNet. The results of ranking first on both DTU and Tanks and Temples benchmarks verify that our model not only performs the best but also has the best generalization ability.READ FULL TEXT VIEW PDF
Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation and Focal Loss
Multi-view stereo (MVS) is a vital branch to extract geometry from photographs, which takes stereo correspondence from multiple images as the main cue to reconstruct dense 3D representations. Although traditional methods [seitz2006comparison, barnes2009patchmatch, furukawa2009accurate, schonberger2016pixelwise] have achieved excellent performance after occupying researchers for decades, more and more learning-based approaches [yao2018mvsnet, yao2019recurrent, chen2019point, cheng2020deep, gu2020cascade, yang2020cost] are proposed to promote the effectiveness of MVS due to their more powerful representation capability in low-texture regions, reflections, etc. Concretely, they infer the depth for each view from the 3D cost volume, which is constructed from the warped feature according to a set of predefined depth hypotheses. Compared with hand-crafted similarity metrics in traditional methods, the 3D cost volume can capture more discriminative features to achieve more robust matching. Without loss of integrity, existing learning-based methods can be divided into two categories: Regression and Classification.
Regression is the most primitive and straightforward implementation of the learning-based MVS method. It’s a group of approaches [yao2018mvsnet, luo2019p, cheng2020deep, yang2020cost, gu2020cascade, yu2020fast] to regress the depth from the 3D cost volume through Soft-argmin, which softly weighting each depth hypothesis. More specifically, the model expects to regress greater weight for the depth hypothesis with a small cost. Theoretically, it can achieve the sub-pixel estimation of depth by weighted summation of discrete depth hypotheses. Nevertheless, the model needs to learn a complex combination of weights under indirect constraints performed on the weighted depth but not on the weight combination, which is non-trivial and tends to overfit. You can imagine that there are many weight combinations for a set of depth hypotheses that can be weighted and summed to the same depth, and this ambiguity also implicitly increases the difficulty of the model convergence.
Classification is proposed in R-MVSNet [yao2019recurrent] to infer the optimal depth hypothesis. Different from the weight estimation in regression, classification methods [huang2018deepmvs, yao2019recurrent, yan2020dense]
predict the probability of each depth hypothesis from the 3D cost volume and take the depth hypothesis with the maximum probability as the final estimation. Obviously, these methods cannot infer the exact depth directly from the model like regression methods. However, classification methods directly constrain the cost volume through thecross-entropy
loss executed on the regularized probability volume, which is the essence of ensuring the robustness of MVS. Moreover, the estimated probability distribution can directly reflect the confidence, which is difficult to derive from the weight combination intuitively.
In this paper, we seek to unify the advantages of regression and classification, that is, we hope that the model can accurately predict the depth while maintaining robustness. There is a fact that the depth hypothesis close to the ground-truth has more potential knowledge, while that of other remaining hypotheses is limited or even harmful due to the wrong induction of multimodal [zhang2020adaptive]. Motivated by this, we present that estimating the weights for all depth hypotheses is redundant, and the model only needs to do regression on the optimal depth hypothesis that the representative depth interval (referring to the upper area until the next larger depth hypothesis) contains the ground truth depth. To achieve this, we propose a unified representation for depth, termed Unification. As shown in Fig. 1, unlike regression, the loss is executed on the regularized probability volume directly, and different from classification, our method estimates the Unity (What we call), whose label is composed by at most one non-zero continuous target , to simultaneously represent the location of optimal depth hypothesis and its offset to the ground-truth depth. We take proximity (defined as the complement of the offset between ground-truth and optimal depth hypothesis) to characterize the non-zero target in unity label, which is more efficient than purely using offset. The detailed comparisons are in Supp. Mat.
Moreover, we note that this unified representation faces an undeniable sample imbalance in both category and hardness. While Focal Loss (FL) [lin2017focal] is the common solution proposed in the detection field, which is tailored to the traditional discrete label, the more general form (GFL) is proposed in [gfl2020, zhang2021varifocalnet] to deal with the continuous label. Even though GFL has demonstrated its performance, we hold the belief that it has an obvious limitation in distinguishing hard and easy samples due to ignoring the magnitude of ground-truth. To this end, we put forward a more reasonable and unified form, called Unified Focal Loss (UFL), after thorough analysis to better address these challenges. In this way, the traditional FL can be regarded as a special case of UFL, while GFL is its imperfect expression.
To demonstrate the superiority of our proposed modules, we present a coarse-to-fine framework termed UniMVSNet (or UnifiedMVSNet), named for its unification of depth representation and focal loss, which replaces the traditional representation of recent works [cheng2020deep, yang2020cost, gu2020cascade] with Unification and adopts UFL for optimization. Extensive experiments show that our model surpasses all previous MVS methods and achieves state-of-the-art performance on both DTU [aanaes2016large] and Tanks and Temples [knapitsch2017tanks] benchmarks.
Traditional MVS methods.
Taking the output scene representation as an axis of taxonomy, there are mainly four types of classic MVS methods: volumetric[seitz1999photorealistic, kutulakos2000theory], point cloud based [lhuillier2005quasi, furukawa2009accurate], mesh based [fua1995object] and depth map based [campbell2008using, schonberger2016pixelwise, schonberger2016structure, xu2019multi, galliani2015massively]. Among them, the depth map based method is the most flexible one. Instead of operating in the 3D domain, it degenerates the complex problem of 3D geometry reconstruction to depth map estimation in the 2D domain. Moreover, as the intermediate representation, the estimated depth maps of all individual images can be merged into a consistent point cloud [merrell2007real] or a volumetric reconstruction [newcombe2011kinectfusion], and the mesh can even be further reconstructed.
Learning-based MVS methods.
While the traditional MVS pipeline mainly relies on hand-crafted similarity metrics, recent works apply deep learning for superior performance on MVS. SurfaceNet[ji2017surfacenet] and LSM [lsm2017] are the first proposed volumetric learning-based MVS pipelines to regress surface voxels from 3D space. However, they are restricted to memory which is the common drawback of the volumetric representation. Most recently, MVSNet [yao2018mvsnet]
first realizes an end-to-end memory low-sensitive pipeline based on 3D cost volumes. This pipeline mainly consists of four steps: image feature extraction by 2D CNN, variance-based cost aggregation by homography warping, cost regularization through 3D CNN, and depth regression. To further excavate the potential capacity of this pipeline, some variants of MVSNet have been proposed,e.g., [yao2019recurrent, gu2020cascade, cheng2020deep, yang2020cost] are proposed to reduce the memory requirement through RNN or coarse-to-fine manner, [luo2019p, yi2020pyramid] are proposed to adaptively re-weight the contribution of different pixels in cost aggregation. Meanwhile, all existing methods are based on one of the two complementary of classification and regression to infer depth. In this paper, we propose a novel unified representation to integrate their advantages.
This section will present the main contributions of this paper in detail. We first review the common pipeline of the learning-based MVS approach in Sec. 3.1, then introduce the proposed unified depth representation in Sec. 3.2 and unified focal loss in Sec. 3.3, and finally describe the detailed network architecture of our UniMVSNet in Sec. 3.4.
Most end-to-end learning-based MVS methods are inherited from MVSNet [yao2018mvsnet], which constructs an elegant and effective pipeline to infer the depth of the reference image . Given multiple images of a scene taken from different viewpoints, image features of all images are first extracted through a 2D network with shared weights. As mentioned above, the learning-based method is based on the 3D cost volume, and the depth hypothesis of layers is sampled from the whole known depth range to achieve this, where represents the minimum depth and represents the maximum depth. With this hypothesis, feature volumes can be constructed in 3D space via differentiable homography by warping 2D image features of source images to the reference camera frustum. The homography between the feature maps of view and the reference feature maps at depth is expressed as:
where and refer to camera intrinsics and extrinsics respectively.
To handle arbitrary number of input views, the multiple feature volumes need to be aggregated to one cost volume . The aggregation strategy consists two dominant groups: statistical and adaptive. The variance-based mapping is a typical statistical aggregation:
Where denotes the average feature volume. Furthermore, the adaptive aggregation is proposed to re-weight the contribution of different pixels, which can be modeled as:
where is the learnable weight generated by an auxiliary network, and denotes element-wise multiplication.
The matching cost between the reference view and all source views under each depth hypothesis has been encoded into the cost volume, which is required to be further refineed to generate a probability volume through a softmax-based regularization network. Concretely, the probability volume is treated as the weight of depth hypotheses in regression methods and the depth at pixel is calculated as the sum of the weighted hypotheses as:
and the model is constrained by the L1 loss between and the ground-truth depth. In classification methods, refers to the probability of depth hypotheses and the depth is estimated as the hypothesis whose probability is maximum:
and the model is trained by the cross-entropy loss between and the ground-truth one-hot probability volume.
In traditional one-stage methods, compared with original input images, the depth map is either downsized during feature extraction [yao2018mvsnet] or before the input [yi2020pyramid] to save memory, while in the coarse-to-fine method [gu2020cascade, yang2020cost, cheng2020deep], it’s a multi-scale result with incremental resolution generated by repeating the above pipeline times. The multi-scale is realized by a FPN-like [lin2017feature] feature extraction network, and the depth hypothesis with decreasing depth range is sampled based on the depth map generated in the previous stage.
As aforementioned, the regression method tends to overfit due to its indirect learning cost volume and ambiguity in the correspondence between depth and the weight combination. For the classification method, although it can constrain the cost volume directly, it cannot predict exact depth like regression methods due to its discrete prediction. In this paper, we found that they can complement each other, and we unify them successfully through our unified depth representation, as shown in Fig. 2
. We recast the depth estimation as a multi-label classification task, in which the model needs to classify which hypothesis is the optimal one and regress the proximity for it. In other words, we first adopt classification to narrow the depth range of the final regression, but they are executed simultaneously in our implementation. Therefore, the model in ourUnification representation is able to estimate an accurate depth like regression methods, and it also directly optimizes the cost volume like classification methods. Below, we will introduce how to generate the ground-truth unity from ground-truth depth (Unity generalization), and how to regress the depth from the estimated unity (Unity regression).
Unity generation: As shown in Fig. 1, ground-truth unity is a more general form of one-hot label peaked at the optimal depth hypothesis whose depth interval contains ground-truth depth. The at most one non-zero target is a continuous number and represents the proximity of the optimal hypothesis to ground-truth depth. The detail of unity generation is shown in Algorithm 1, which is one more step of proximity calculation than the one-hot label generation in classification methods.
Unity regression: Unlike the traditional way of predicting the probability volume by softmax operators, unification representation estimated it through sigmoid operators. Here, we disassemble the estimated probability volume into the estimated unity along the dimension. To regress the depth, we first select the optimal hypothesis with the maximum unity at each pixel, then calculate the offset to ground-truth depth, and finally fuse the estimated depth. The detailed procedure is shown in Algorithm 2.
Generally, the depth hypothesis of MVS models will be sampled quite densely to ensure the accuracy of the estimated depth, which will cause obvious sample imbalance due to only one positive sample (the at most one non-zero target) among hundreds of hypotheses. Meanwhile, the model needs to pay more attention to hard samples to prevent overfitting. Relevant Focal Loss (FL) [lin2017focal] has been proposed to solve these two problems, which automatically distinguishes hard samples through the estimated unity and rebalances the sample through tunable parameter and . Here, we discuss a certain pixel for convenience. The typical definition of FL is:
where is the discrete target. Therefore, the traditional FL is not suitable for our continuous situation. To enable successful training under the case of our representation, we borrow the main idea from FL. Above all, the binary cross-entropy or needs to be extended to its complete form . Correspondingly, the scaling factor should also be adjusted appropriately. The generalized FL form (GFL) obtained through these two steps is:
where is the continuous target. This advanced version is currently adopted by some existing methods with different names, e.g., QFL in [gfl2020] or VFL in [zhang2021varifocalnet]. But in this paper, we point out that this implementation is not perfect in scaling hard and easy samples, because they ignore the magnitude of the ground-truth.
As shown in Fig. 3, the first two samples will be considered the hardest under the absolute error measurement in GFL. However, the absolute error cannot distinguish samples with different targets. Even if the first two samples in Fig. 3 have the same absolute error, this error obviously has a smaller effect on the first sample due to its larger ground-truth. To solve this ambiguity, we further improve the scaling factor in GFL through relative error and propose our naive version of Unified Focal Loss (UFL) just as:
where is the positive target. It can be seen from Eq. 8 that FL is a special case of UFL when the positive target is the constant 1.
Moreover, we noticed that the range of scaling factor is , which may lead to a special case like the last sample in Fig. 3. Even a small number of such samples will overwhelm the loss and computed gradients due to their huge scaling factor. In this paper, we solve this problem by introducing a dedicated function to control the range of the scaling factor. Meanwhile, to keep the precious positive learning signals, we adopt an asymmetrical scaling strategy. And the complete UFL can be modeled as:
where the dedicated function is designed as the sigmoid-like function () with a base of in this paper.
It’s straightforward to apply our Unification and UFL to existing learning-based MVS methods. To illustrate the effectiveness and flexibility of the proposed modules, we build the UniMVSNet, whose framework is depicted in Fig. 4, based on the coarse-to-fine strategy. This pipeline abides by the procedure reviewed in Sec. 3.1, except the depth representation and optimization.
Inherited from CasMVSNet [gu2020cascade], We adopt a FPN-like network to extract multi-scale features, and uniformly sample the depth hypothesis with a decreasing interval and a decreasing number. To better handle the unreliable matching in non-Lambertian regions, we adopt an adaptive aggregation method with negligible parameters increasing like [yi2020pyramid] to aggregate the feature volumes warped by the differentiable homography. Meanwhile, we also apply multi-scale 3D CNNs to regularize the cost volume, and the generated probability volume at each stage is treated as the estimated Unity here, which can be further regressed to accurate depth as shown in Algorithm 2. It can be seen from Fig. 4 that UniMVSNet directly optimizes cost volume through UFL, which can effectively avoid the overfitting of indirect learning strategy in regression methods.
Training Loss. As shown in Fig. 4, we apply UFL to all stages and fuse them with different weights. The total loss can be defined as:
where is the average of UFL of all valid pixels at stage and denotes the weight of at stage.
|COLMAP [schonberger2016structure, schonberger2016pixelwise]||0.400||0.664||0.532|
Quantitative results of F-score on Tanks and Temples benchmark.Best results in each category are in bold. “Mean” refers to the mean F-score of all scenes (higher is better). Our model outperforms all previous MVS methods with a significant margin on both Intermediate and Advanced set. The snapshot of the most recent ranking results is shown in Supp. Mat.
This section demonstrates the start-of-the-art performance of UniMVSNet with comprehensive experiments and verifies the effectiveness of the proposed Unification and UFL through ablation studies. We first introduce the datasets and implementation and then analyze our results.
Datasets. We evaluate our model on DTU [aanaes2016large] and Tanks and Temples [knapitsch2017tanks] benchmark and finetune on recently published BlendedMVS [yao2020blendedmvs]. (a) DTU is an indoor MVS dataset with 124 different scenes scanned from 49 or 64 views under 7 different lighting conditions with fixed camera trajectories. We adopt the same training, validation, and evaluation split as defined in [yao2018mvsnet]. (b) Tanks and Temples is collected in a more complex realistic environment, and it’s divided into the intermediate and advanced set. While intermediate set contains 8 scenes with large-scale variations, advanced set has 6 scenes. (c) BlendedMVS is a large-scale synthetic dataset, which is consisted of 113 indoor and outdoor scenes and is split into 106 training scenes and 7 validation scenes.
Implementation. Following the common practice, we first train our model on the DTU training set and evaluate on DTU evaluation set, and then finetune our model on BlendedMVS before validating the generalization of our approach on Tanks and Temples. The input view selection and data pre-processing strategies are the same as [yao2018mvsnet]. Meanwhile, we utilize the finer DTU ground-truth as [wei2021aa]. In this paper, UniMVSNet is implemented in 3 stages with 1/4, 1/2 ,and 1 of original input image resolution respectively. We follow the same configuration (e.g., depth interval) of the model at each stage as [gu2020cascade] in both training and evaluation of DTU. When training on DTU, the number of input images is set to and the image resolution is resized to . To emphasize the contribution of positive signals, we set , and scale the range of to and to
. The other tunable parameters in UFL are configured stage-wise. We optimize our model for 16 epochs with Adam optimizer[kingma2014adam], and the initial learning rate is set to 0.001 and decayed by 2 after 10, 12, and 14 epochs. During the evaluation of DTU, we also resize the input image size to and set the number of the input images to 5. We report the standard metrics (accuracy, completeness, and overall) proposed by the official evaluation protocol [aanaes2016large]. Before testing on Tanks and Temples benchmark, we finetune our model on BlendedMVS for 10 epochs. We take 7 images as the input with the original size of . For benchmarking on Tanks and Temples, the number of depth hypotheses in the coarsest stage is changed from 48 to 64, and the corresponding depth interval is set to 3 times as the interval of [yao2018mvsnet]. We report the F-score metric to measure both the accuracy and completeness.
Similar to previous methods [yao2018mvsnet, yao2019recurrent, gu2020cascade], we introduce photometric and geometric constraints for depth map filtering. The probability threshold and the number of consistent views are set to 0.3 and 3 respectively, which is the same as [yao2019recurrent]. The final 3D point cloud is obtained through the same depth map fusion method as [yao2018mvsnet, yao2019recurrent, gu2020cascade].
We compare our method to those traditional and recent learning-based MVS methods. The quantitative results on the DTU evaluation set are summarized in Tab. 1, which indicates that our method has made great progress in performance. While Gipuma [galliani2015massively] ranks first in the accuracy metric, our method outperforms all methods on the other two metrics significantly. Depth map estimation and point reconstruction of a reflective and low-textured sample are shown in Fig. 5, which shows that our model is more robust on the challenge regions. Figure 6 shows some qualitative results compared with other methods. We can see that our model can generate more complete point clouds with finer details. More results about the robust probability volume are shown in Supp. Mat.
|Baseline (Uni) + GFL||✓||✓||✓||5||0.361||0.289||0.325|
|Baseline (Uni) + UFL||✓||✓||✓||5||0.353||0.287||0.320|
|Baseline (Uni) + UFL + AA||✓||✓||✓||5||0.355||0.279||0.317|
|Baseline (Uni) + UFL + AA + FGT||✓||✓||✓||✓||5||0.352||0.278||0.315|
As the common practice, we verify the generalization ability of our method on Tanks and Temples benchmark using the model trained on DTU and finetuned on BlendedMVS. We adopt a depth map filtering strategy similar to that of DTU, except for the geometric constraint. Here, we follow the dynamic geometric consistency checking strategies proposed in [yan2020dense]. Through this dynamic method, those pixels with fewer consistent views but smaller reprojection errors and those with larger errors but more consistent views will also survive.
The corresponding quantitative results on both intermediate and advanced sets are reported in Tab. 2. Our method achieves state-of-the-art performance among all existing MVS methods and yields first place in most scenes. Notably, our model outperforms the previous best model by 2.68 points and 3.24 points on the intermediate and advanced sets. Such obvious advantages just show that our model not only has the best performance but also exhibits the strongest generalization and robustness. The qualitative point cloud results are visualized in Fig. 7.
As aforementioned, we adopt some extra strategies (e.g., adaptive aggregation and finer ground-truth) that have been adopted by recent methods [wei2021aa, yi2020pyramid] to train our model for a fair comparison with them. However, this may not be fair to those methods inherited only from MVSNet. In this section, we will prove through extensive ablation studies that even if these strategies are eliminated, our method still has a significant improvement. We use our baseline CasMVSNet [gu2020cascade], whose original representation is regression, as the backbone and changing various components, e.g., depth representation, optimization, aggregation, and ground-truth. And we adopt 5 input views for all models for a fair comparison.
Benefits of Unification. As shown in Tab. 3, significant progress can be made even if purely replacing the traditional depth representation with our unification. Meanwhile, unification is more robust when the hypothesis range of the finer stage doesn’t cover the ground-truth depth. In this case, the target unity of the unification representation generated by Algorithm 1 is all zero, which is a correct supervision signal anyway, and the traditional representation will generate an incorrect supervision signal to pollute the model training.
Benefits of UFL. Applying Focal Loss to our representation can effectively overcome the sample imbalance problem. It can be seen from Tab. 3 that GFL has a huge benefit to accuracy, albeit with a slight loss of completeness. And our UFL can further improve the accuracy and completeness significantly on the basis of GFL. More ablation results about UFL are shown in Supp. Mat.
As aforementioned, there are several tunable parameters in our Unified Focal Loss that will affect performance. The process of finding a satisfactory parameter configuration is a cumbersome challenge for newcomers. On the other hand, as long as we have a sufficient understanding of Focal Loss [lin2017focal], this process will become handy. Anyway, an adaptive form or a form with fewer parameters will be a more concise and efficient choice.
In this paper, we propose a unified depth representation and a unified focal loss to promote the effectiveness of multi-view stereo. Compared with traditional regression or classification representation, our unification representation can recover finer 3D scene benefits from the direct learning cost volume. Compared with traditional FL or GFL, our UFL is able to capture more fine-grained indicators for rebalancing samples and deal with continuous labels more reasonably. What’s more valuable is that these two modules are high-performance modules that don’t impose any memory or computational costs. Each plug-and-play module can be easily integrated into the existing MVS framework and achieve significant performance improvements, and we have shown this through the state-of-the-art performance on DTU and Tanks and Temples benchmarks with UniMVSNet. In the future, we plan to explore the integration of our modules into stereo matching or monocular field and look for more concise loss functions.