HFS
Improved surface reconstruction using high-frequency details
view repo
Neural rendering can be used to reconstruct implicit representations of shapes without 3D supervision. However, current neural surface reconstruction methods have difficulty learning high-frequency geometry details, so the reconstructed shapes are often over-smoothed. We develop HF-NeuS, a novel method to improve the quality of surface reconstruction in neural rendering. We follow recent work to model surfaces as signed distance functions (SDFs). First, we offer a derivation to analyze the relationship between the SDF, the volume density, the transparency function, and the weighting function used in the volume rendering equation and propose to model transparency as transformed SDF. Second, we observe that attempting to jointly encode high-frequency and low-frequency components in a single SDF leads to unstable optimization. We propose to decompose the SDF into a base function and a displacement function with a coarse-to-fine strategy to gradually increase the high-frequency details. Finally, we design an adaptive optimization strategy that makes the training process focus on improving those regions near the surface where the SDFs have artifacts. Our qualitative and quantitative results show that our method can reconstruct fine-grained surface details and obtain better surface reconstruction quality than the current state of the art. Code available at https://github.com/yiqun-wang/HFS.
READ FULL TEXT VIEW PDFImproved surface reconstruction using high-frequency details
3D reconstruction from a set of images is a fundamental challenge in computer vision
Hartley and Zisserman (2003). In the recent past, the seminal framework NeRF Mildenhall et al. (2020) inspired a lot of follow up work by modeling 3D objects as a density function and view-dependent color for each pointin the volume. The density function and view-dependent color function are implicit functions modeled by a neural network. The results of this approach are very strong and therefore NeRF inspired a large amount of follup up work, e.g.
Liu et al. (2020); Lin et al. (2021); Park et al. (2021); Zhang et al. (2020); Niemeyer and Geiger (2021); Barron et al. (2021).In particular, one direction of work tries to constrain the density field to make it more consistent with a density field stemming from a surface. In the original formulation, almost arbitrary densities can be modeled by the neural network and there is no guarantee that a meaningful surface can be extracted from the density. Two noteworthy recent approaches, Neus Wang et al. (2021) and VolSDF Yariv et al. (2021) proposed to embed a signed distance field in the volume rendering equation. Therefore, instead of modeling the density with a neural network, these approaches model a signed distance function with a neural network. This leads to greatly improved surface reconstruction.
We build on this exciting recent work and seek further improvement in the quality of surfaces that are being reconstructed. To this end, we propose our method HF-NeuS consisting of three new building blocks. First, we analyze the relationship between the signed distance function on the one hand and the volume density, transparency, and the weighting function on the other hand. We conclude from our derivation that it would be best to model a function that maps signed distances to the transparency and propose a class of functions that fulfill the theoretical requirements. Second, we observe that it is difficult to learn high-frequency details directly with a single signed distance function as shown in Fig. 2. We therefore propose to decompose the signed distance function into a base function and a displacement function following related work. We adapt this idea to the differentiable NeRF rendering framework and the NeRF training scheme. Third, the functions that translate distance to transparency can be chosen to have a parameter, we call it scale , that controls the slope of the function (or the deviation of the derivative). The parameter controls how precise the surface is being localized and how strongly colors away from the surface influence the result. In previous work, this parameter is set globally, but is trainable so it can change from iteration to iteration. We propose a novel spatially adaptive weighting scheme to influence this parameter, so that the optimization focuses more on problematic regions in the distance field. The three building blocks are the three main contributions of the paper. In the results, we can see that HF-NeuS has a clear improvement in surface reconstruction. On the 15 scene DTU benchmark we can improve from the current best values of 0.87 (NeuS) and 0.86 (VolSDF) to 0.77 the Chamfer distance (See Figs. 1 and 4 for a visual comparison). The benchmark as well as the metric were proposed by previous work.
Multi-view 3D reconstruction. 3D reconstruction based on multiple views is a fundamental challenge in the field of 3D vision. Classical 3D reconstruction algorithms usually reconstruct discrete 3D representations. The methods can be roughly categorized into voxel-based methods and point-based methods. Voxel-based methods De Bonet and Viola (1999); Seitz and Dyer (1999); Kutulakos and Seitz (2000); Broadhurst et al. (2001); Izadi et al. (2011); Nießner et al. (2013) first discretize the three-dimensional space uniformly into voxels, and then decide whether the surface occupies a particular voxel. Point-based methods Barnes et al. (2009); Furukawa and Ponce (2009); Schönberger et al. (2016); Schonberger and Frahm (2016); Galliani et al. (2016) usually use the correlation between multiple views to reconstruct depth maps, and fuse multiple depth maps into a point cloud. The point cloud needs to be subsequently reconstructed into a mesh model using explicit algorithms like ball-pivoting Bernardini et al. (1999) and Delaunay trianglulation Labatut et al. (2007) or implicit algorithms like Poisson surface reconstruction Kazhdan et al. (2006).
Neural implicit surfaces.
Recently, neural implicit representations have received a lot of attention. The corresponding methods aim to reconstruct continuous implicit function representations of shapes directly from 2D images. A required building block is differentiable rendering, which maps the 3D scene representation to a 2D image for a given camera pose. DVR
Niemeyer et al. (2020) utilizes surface rendering to model the occupancy function of a 3D shape, which uses a root search approach to obtain the location of the surface and predicts a 2D image. IDR Yariv et al. (2020) models the signed distance function of the shape and uses a sphere tracking algorithm to render 2D images. A significant milestone in 3D reconstruction was the development of NeRF Mildenhall et al. (2020). It uses volume rendering to map a 3D density field and a 3D directional color field to a 2D image. The proposed representation is flexible enough so that realistic images can be synthesized. To model more complex scenes, NeRF++ Zhang et al. (2020) proposes to model the background scene with an additional neural radiance field, which handles the foreground and background separately, and achieves better results for large scenes. However, the density function is not as easy to control as the occupancy function or the signed distance function, and it is difficult to guarantee the smoothness of the generated 3D shape. Subsequently, UNISURF Oechsle et al. (2021) embeds the occupancy function into the volume rendering equation of NeRF. They use a decay strategy to control which region to sample around the surface during training without explicitely modeling volumetric density. Using signed distance functions, VolSDF Yariv et al. (2021) embeds a signed distance function into the density formulation and proposes a sampling strategy that satisfies a derived error bound on the transparency function. NeuS Wang et al. (2021)derive an unbiased density function equation using logistic sigmoid functions and introduce a learnable parameter to control the slope of the function during rendering and sampling. Concurrent to our work, NeuralPatch
Darmon et al. (2022) uses the homography matrix to warp the source patches adjacent to the reference image to constrain colors in the volume to come from closeby patches. However, the calculation of patch warping relies on the accurate surface normal, so it cannot be trained from scratch. Therefore, it is only used as a fine-tuning or post-processing method for other algorithms to optimize the surface. We consider VolSDF and NeuS as the current state of the art and we will compare to these two methods.High-frequency detail reconstruction.
It is generally difficult for neural networks to learn high-frequency information from raw signals. Inspired by the field of natural language processing, positional encoding
Mildenhall et al. (2020); Tancik et al. (2020) is used to guide the network to reconstruct high-frequency details. Positional encoding spreads the original signal into different frequency bands using sin and cos functions of different frequency. Subsequently, SIREN Sitzmann et al. (2020)proposes to use the sin function as activation function in the network. MipNeRF
Barron et al. (2021) presents an integrated positional encoding to control frequency in different scales. Park et al. Park et al. (2021)proposed to use a coarse-to-fine learning strategy to gradually increase high-frequency information, which was subsequently used for pose estimation
Lin et al. (2021). Hertz et al. Hertz et al. (2021) further propose a spatially adaptive progressive coding strategy. For surface reconstruction, implicit displacement fields were proposed for single-view 3D reconstruction Li and Zhang (2021). Based on the supervision of ground truth SDF values of sampled points, the method utilizes separate networks to model the base SDF and implicit displacement fields. Subsequently, Wang et al. Yifan et al. (2022) utilize the SIREN network to learn the base implicit function and implicit displacement function, respectively, for point cloud reconstruction tasks. In contrast to our proposed algorithm, these methods require 3D supervision. Further, they do not involve the NeRF formulation or volume rendering. In our work, we build on these ideas to develop a new state-of-the-art algorithm for multi-view reconstruction.As input we consider a set of images , and their corresponding intrinsic and extrinsic camera parameters . The goal of HF-NeuS is to reconstruct a representation of the 3D surface as implicit function. Specifically, we encode surfaces as signed distance fields. We will explain our method in three parts: 1) First, we show how to embed the signed distance function into the formulation of volume rendering and discuss how to model the relationship between distance and transparency. 2) Then, we propose to utilize an additional displacement signed distance function to add high-frequency details to the base signed distance function. 3) Finally, we observe that the function that maps signed distances to transparency is controlled by a parameter that determines the slope of the function. We propose a scheme to set this parameter in a spatially varying manner depending on the gradient norm of the distance field, rather than keeping it constant for the complete volume within a single training iteration.
We first review the integral formula for volume rendering and derive a relationship between transparency and the weighting function (the product between density and transparency). Based on this analysis we discuss the criteria for functions that are suitable to map signed distances to transparency and propose a class of functions that fulfill the theoretical requirements.
Given a ray , the volume rendering equation is used to calculate the radiance of the pixel corresponding to the ray . The volume rendering equation is an integral along the ray and involves the following quantities defined for each point in the volume: the volume density and the (directional) color . In addition, the volume has compact support and the boundaries of the volume are encoded by and .
(1) |
The transparency is derived from the volume density as explained below. The function denotes the accumulated transmittance along the ray from to
(2) |
and is a monotonic decreasing function with a starting value of . The product can be regarded as a weighting function in the volume rendering equation as in Eq. (1).
In order to involve a signed distance functions , we have to define a function to transform a signed distance function so that it can be used to compute the density related terms in the rendering equation. One way is to directly model a density function as proposed by VOLSDF Yariv et al. (2021). Taking this approach, a sampling method is required to satisfy an error bound of the sampling to make it less than an error threshold by gradually reducing the scale parameter. Another way is to model the weighting function as proposed by NeuS. The NeuS paper showcases a complex derivation to get the expression for the density function .
We rethink this problem to obtain a simplified derivation by focusing on transparency instead of the weighting function and also a better understanding of the problem, as follows:
(3) |
An interesting observation is that the derivative of the transparency function is the negative weighting function. The weighting function has the property of having a maximum on the surface. We take the derivative of the weighting function and set it to 0 to find the extrema (maxima), as follows.
(4) |
Assuming a planar surface and a single ray-plane intersection, we can see that the extremum point, denoted as , of the weighting function will also be the extremum point of the derivative of the transparency function . The point is expected to be the intersection of the ray and the surface. Therefore, we consider defining the transparency function directly as . If the transparency function is designed in such a way that its derivative reaches a minimum on the surface, it follows that the weighting function has a maximum on the surface. Therefore, one can directly model a transparency function under the condition that its derivative has a minimum on the surface. This is conceptually simpler than modeling the weighting function as proposed by NeuS. We compute the derivative of as follows.
(5) |
where is the product of the surface normal and the ray direction, which is a constant in case of a planar surface and a single ray-plane intersection. The signed distance function is zero on the surface. Hence has an extremum at . This also means has the steepest slope at the surface of the shape. On the other hand, the signed distance function is positive outside of the object, and negative when entering the interior of the object. We generally assume that is outside so that the signed distance starts positive and decays to a negative value along a ray, which is a monotonic decreasing function. According to the characteristics of transparency , the transparency starts at 1 at and is a monotonic decreasing function to 0 inside the object. This inverse property results in the function being a monotonic increasing function from 0 to 1. Therefore, we have our design criteria for : should be a monotonic increasing function from 0 to 1, with the steepest slope at 0.
A very intuitive idea to satisfy this criteria is to use a sigmoid function and normalize the function to have an output in the interval . We simply use the logistic sigmoid function proposed by NeuS Wang et al. (2021) for a fair comparison. However, our idea is more general and other sigmoid functions could be used. Our designed transparency function is as follows,
(6) |
where is the logistic sigmoid function with parameter controlling the slope of the function. Note that the parameter
is also the standard deviation of the function
. We will use this fact later when discussing the adaptive version of the framework.Given the differentiable transparency function , the volume density can be easily calculated following Eq. 3.
(7) |
For discretization, we bring Eq. 5 and Eq. 6 into Eq.7, and take advantage of the properties of the derivative of the logistic sigmoid function . We can get the formula for the discretization computation:
(8) |
Then the volume rendering integral can be approximated using -composition, where . For multiple surface intersections, we follow the same strategy as NeuS Wang et al. (2021), where . Compared with NeuS, we obtain a simpler formula for the density for the discretization computation, reducing the numerical problems caused by division in NeuS. Furthermore, our approach does not need to involve two different sampling points, namely section points and mid-points, which makes it easier to satisfy the unbiased weighting function. Since there is no need to calculate the SDF and the color separately for the two different point sets, the color and the geometry are more consistent compared to NeuS. Compared to VolSDF Yariv et al. (2021), since the transparency function is explicit, our method can use an inverse distribution sampling computed with the inverse CDF to satisfy the approximation quality. Thus no complex sampling scheme as in VolSDF is required. A visual comparison is shown in Fig. 3.
In order to enable a multi-scale fitting framework, we propose to model the signed distance function as a combination of a base distance function and a displacement function Yifan et al. (2022); Li and Zhang (2021) along the normal of the base distance function. The implicit displacement function is an additional implicit function. The reason for this design is that it is difficult for a single implicit function to learn low-frequency and high-frequency information at the same time. The implicit displacement function can complement the base implicit function, so that it is easier to learn high-frequency information.
Compared with the task of learning implicit functions from point clouds, reconstructing 3D shapes from multiple images makes it more difficult to learn high-frequency content. We propose to use neural networks to learn frequencies at multiple scales, and to gradually increase the frequency content in a coarse-to-fine manner.
Suppose is the combined implicit function that represents the surface we want to obtain. The function is the base implicit function that represents the base surface. Following Yifan et al. (2022), the displacement implicit function is used to map the point on the base surface to the surface point along the normal and vice versa is used to map the point on the base surface to the surface point along the normal , thus . Because of the nature of implicit functions, the relationship between the two functions can be expressed as follows,
(9) |
where , is the normal of on the base surface. To compute the expression for the implicit function , we bring the formula into the Eq. (9) and obtain the expression for the combined implicit function:
(10) |
Therefore, we can use the base implicit function and the displacement implicit function to represent the combined implicit function. However, two challenges arise. First, the Eq. 10 is only satisfied if the point is on the surface. Second, the normal at the point is difficult to estimate when only knowing the position . We rely on two assumption to solve the problem. One assumption is that this deformation can be applied to all iso-surfaces, i.e. . In this way the equation is assumed to be valid for all points in the volume and not only on the surface. Another assumption is that and are not too far away, thus can be replaced with normal on the point in the Eq. (10). We control the magnitude of the implicit displacement function using a displacement constraint 4.
To precisely control the frequency, we use positional encoding to encode the base implicit function and the displacement implicit function separately. We would like to note some differences to Yifan et al. (2022). We use positional encoding instead of Siren Sitzmann et al. (2020), so that the frequency can be explicitly controlled by a coarse-to-fine strategy rather than simply using two Siren networks with two different frequency levels. This is useful when 3D supervision is not given. More details are shown in the supplementary. Positional encoding decomposes the input position into multiple selected frequency bands.
(11) |
where each component consists of a and a function with different frequency.
(12) |
Directly learning high-frequency positional encoding makes the network susceptible to noise, because wrongly learned high-frequencies hinder the learning of low frequencies. This problem is less pronounced if 3D supervision is available, however high-frequency information of images is easily introduced into the surface generation as noise. We use the coarse-to-fine strategy proposed by Park et al. Park et al. (2021) to gradually increase the frequency of the positional encoding.
(13) |
where is the parameter to control the frequency information involved. In each iteration, is increased by until it touches 1, where is the maximum number of iterations.
We utilize two kinds of positional encoding with different parameter and . We set and only control for simplicity. We also use two MLP functions for fitting the base and displacement functions.
(14) |
where that can be computed by the gradient of and . The of the displacement constraint should be clamped during training. We show how to control the adaptive in the supplemental materials.
We bring this implicit function into Eq. (6) for calculating the transparency so that the radiance (color) of images can be computed by the volume rendering equation.
To train the network, we employ the loss function
, which includes the radiance loss and the Eikonal regularization loss of the signed distance functions. For the regularization loss, we constrain both the base implicit function and the detailed implicit function.(15) |
In previous subsections, the transparency function is defined as sigmoid function that is controlled by a scale . This parameter controls the slope of the sigmoid function and it is also the standard deviation of the derivative. We can also say that it controls the smoothness of the function. When is large, the value of the sigmoid function drops sharply as the position moves away from the surface. On the contrary, the value decreases smoothly when is small. However, choosing a single parameter per iteration gives the same behavior at all spatial locations in the volume.
Since two signed distance functions need to be reconstructed, especially after the high frequency is superimposed, it is easy to have a situation where the Eikonal equation is not satisfied, that is, the norm of the gradient of the SDF is not 1 in some positions. Even with the regularization loss, it is impossible to avoid this problem.
We propose to use the norm of the gradient of the signed distance field to weight the parameter in a spatially varying manner. We increase when the norm of the gradient along the ray direction is larger than 1. This means that when the norm of the gradient is greater than 1, the implicit function changes more drastically and this indicates a region that should be improved. Making larger in certain regions, requires the distance function to be more precise and it magnifies errors due to an incorrect distance function, especially near the surface. In order to modify the scale adaptively, we propose the following equation:
(16) |
where is the gradient of the signed distance function, and is the number of sampling points, is the normalized as the weight and .
While this method can be used to control the transparency function, it can also be used in the hierarchical sampling stage proposed by standard NeRF. By locally increasing , more samples will be generated near the surface where the signed distance values change more rapidly. This mechanism also helps to optimization to focus on these regions in the volume.
Metric | Method | 24 | 37 | 40 | 55 | 63 | 65 | 69 | 83 | 97 | 105 | 106 | 110 | 114 | 118 | 122 | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fidelity | NeRF | 1.90 | 1.60 | 1.85 | 0.58 | 2.28 | 1.27 | 1.47 | 1.67 | 2.05 | 1.07 | 0.88 | 2.53 | 1.06 | 1.15 | 0.96 | 1.49 |
VOLSDF | 1.14 | 1.26 | 0.81 | 0.49 | 1.25 | 0.70 | 0.72 | 1.29 | 1.18 | 0.70 | 0.66 | 1.08 | 0.42 | 0.61 | 0.55 | 0.86 | |
NeuS | 1.37 | 1.21 | 0.73 | 0.40 | 1.20 | 0.70 | 0.72 | 1.01 | 1.16 | 0.82 | 0.66 | 1.69 | 0.39 | 0.49 | 0.51 | 0.87 | |
OURS | 0.76 | 1.32 | 0.70 | 0.39 | 1.06 | 0.63 | 0.63 | 1.15 | 1.12 | 0.80 | 0.52 | 1.22 | 0.33 | 0.49 | 0.50 | 0.77 | |
PSNR | NeRF | 26.24 | 25.74 | 26.79 | 27.57 | 31.96 | 31.50 | 29.58 | 32.78 | 28.35 | 32.08 | 33.49 | 31.54 | 31.0 | 35.59 | 35.51 | 30.65 |
VOLSDF | 26.28 | 25.61 | 26.55 | 26.76 | 31.57 | 31.50 | 29.38 | 33.23 | 28.03 | 32.13 | 33.16 | 31.49 | 30.33 | 34.90 | 34.75 | 30.38 | |
NeuS | 28.20 | 27.10 | 28.13 | 28.80 | 32.05 | 33.75 | 30.96 | 34.47 | 29.57 | 32.98 | 35.07 | 32.74 | 31.69 | 36.97 | 37.07 | 31.97 | |
OURS | 29.15 | 27.33 | 28.37 | 28.88 | 32.89 | 33.84 | 31.17 | 34.83 | 30.06 | 33.37 | 35.44 | 33.09 | 32.12 | 37.13 | 37.32 | 32.33 |
Baselines. We compare HF-NeuS to the following three state-of-the-art baselines: (1)NeuS Wang et al. (2021) is the most relevant baseline for our work. We consider it to be the best published method. (2)VolSDF Yariv et al. (2021) is concurrent work to NeuS. We consider it to be the second best published method. Overall it also performs very well. (3)NeRF focuses on image synthesis and is included for completeness. NeRF is not really a surface reconstruction method and does not reconstruct high quality surfaces, but it is very good in image-based metrics. We use a threshold of 25 (as proposed by NeuS Wang et al. (2021)) to extract surfaces for the comparisons. For all three methods, we use the default parameters and the number of iterations recommended in their respective papers. We do not include older methods in the comparison, such as UNISURF Oechsle et al. (2021) or IDR Yariv et al. (2020), because NeuS and VolSDF have better results.
Datasets. We conduct experiments on the DTU dataset Jensen et al. (2014). We follow previous work and choose the same 15 models for comparison. DTU is a multi-view stereo dataset. Each scene consists of 49 or 64 views with 1600 1200 resolution. We further choose 9 challenging scenes from other datasets: 6 scenes from the NeRF-synthetic dataset Mildenhall et al. (2020) and 3 scenes from BlendedMVS Yao et al. (2020)(CC-4 License). The image resolution of NeRF synthetic dataset Mildenhall et al. (2020) is 800 800 and 100 views are provided for each scene. The dataset contains objects with very obvious detailed and sharp features, such as the Lego and Microphone scene. We chose this dataset for the analysis of reconstructions of high-frequency details. The BlendedMVS dataset is similar to the DTU dataset, but with a richer background. This dataset provides image resolution of 768 576. We also select models with high-frequency details or sharp features which are difficult to reconstruct. In all three datasets, ground truth surfaces and camera poses are provided.
Evaluation metrics. To evaluate the quality of the reconstruction, we follow previous work and used Chamfer distance (lower values are better) and PSNR (higher values are better). For the DTU dataset, we use the official evaluation protocol, which means computing the mean of accuracy (distance from the reconstructed surface to the ground truth surface) and completeness (distance from the ground truth surface to the reconstructed surface). For DTU and BlendedMVS, the background is not part of the ground truth surface. Therefore, we remove the background for computing the Chamfer distance, following previous work. The NeRF-synthetic dataset Mildenhall et al. (2020) has no background, so we only remove disconnected parts for all competing methods.
Implementation details. We use MLPs to model two signed distance functions and . Each MLP consists of 8 layers. Related work like NeuS Wang et al. (2021) and IDR Yariv et al. (2020) also use MLPs with 8 layers. We use Adam with learning rate 5 for the network training using NVIDIA TITAN A100 40GB graphics cards. For adaptive sampling, we first uniformly sample 64 points on the ray, then calculate the SDF and its gradient at these points. We utilize the Eq. 16 to calculate the gain of the parameter, and then adaptively update the weight according to the gain and sample an additional 64 points. For the coarse-to-fine strategy, we observe that using at the beginning for surface reconstruction produces smoothed results. We utilize and for both signed distance functions. We set for the parameter of the frequency band of positional encoding. For other parameter settings, please see the supplemental materials.
Comparison. In table 1, we show quantitative results with other competitors on 15 scenes of the DTU dataset Jensen et al. (2014). The values shown in the upper part of the table measure the fidelity of the surface reconstruction, the Chamfer distance. The numbers indicate that HF-NeuS significantly outperforms NeRF. In most scenes, HF-NeuS is better than VolSDF and NeuS so that the overall average distance is also improved. In the lower part of the table we show the PSNR values. It can be seen that our PSNR surpasses all other methods. We further compare the visual quality achieved by different methods. As shown in Fig. 4, HF-NeuS can reconstruct high-frequency details. For example, the windows have better geometric details, and the feathers of the bird are more distinct.
Most of the scenes in the DTU dataset have smooth surfaces, and high-frequency details are not obvious. We selected 9 challenging models from the NeRF-synthetic dataset Mildenhall et al. (2020) and BlendedMVS dataset Yao et al. (2020), which have more high-frequency details. For example, the Lego model has uneven repeating bumps, and the power cord of the Mic model has a very thin structure (Fig. 1). The robot model has richer edge and corner features (Fig. 4 second row). As shown in Table 2, the gap between our surface reconstruction quality and that of all other methods widens. This shows that HF-NeuS is especially advantageous for surface reconstructions with high-frequency information. We can also observe that NeRF is very good in the image-based metric (PSNR) while performing poorly in the surface reconstruction metric (Chamfer distance). This observation is consistent with previous work. Compared with the NeRF-Synthetic dataset, the BlendedBMS dataset has a more complex background, this also restricts the performance of NeRF to a certain extent. Besides outperforming other baselines in terms of quantitative error, we also achieve better results in terms of qualitative visual effects. As shown in Fig. 1, HF-NeuS can more accurately reconstruct the details of each Lego block and even some of the tiny holes that are not reconstructed by any other method. For the Robot scene, HF-NeuS can reconstruct more accurate facial contours and sharper horns. Finally, for the Mic model, HF-NeuS can clearly reconstruct the power cord, while other methods will mess up this structure.
Metric(10) | Method | Chair | Ficus | Lego | Materials | Mic | Ship | Mean | Bread | Dog | Robot | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Fidelity | NeRF | 2.12 | 5.17 | 3.05 | 1.51 | 4.77 | 3.54 | 3.36 | 0.102 | 0.693 | 2.325 | 1.07 |
VOLSDF | 1.26 | 1.54 | 2.83 | 1.35 | 3.62 | 2.92 | 2.37 | 0.074 | 0.354 | 1.453 | 0.63 | |
NeuS | 0.74 | 1.21 | 2.35 | 1.30 | 3.89 | 2.33 | 1.97 | 0.068 | 0.173 | 1.036 | 0.43 | |
OURS | 0.69 | 1.12 | 0.94 | 1.08 | 0.72 | 2.18 | 1.12 | 0.065 | 0.155 | 0.922 | 0.38 | |
PSNR | NeRF | 33.00 | 30.15 | 32.54 | 29.62 | 32.91 | 28.34 | 31.09 | 31.27 | 27.46 | 25.33 | 28.02 |
VOLSDF | 25.91 | 24.41 | 26.99 | 28.83 | 29.46 | 25.65 | 26.86 | 31.05 | 28.24 | 25.46 | 28.25 | |
NeuS | 27.95 | 25.79 | 29.85 | 29.36 | 29.89 | 25.46 | 28.05 | 31.32 | 28.71 | 25.87 | 28.63 | |
OURS | 28.69 | 26.46 | 30.72 | 29.87 | 30.35 | 25.87 | 28.66 | 31.89 | 29.42 | 26.15 | 29.15 |
Chamfer Distance | PSNR | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Datasets | Base | Base+H | Base+C2F | IDF+H | IDF+C2F | FULL | Base | Base+H | Base+C2F | IDF+H | IDF+C2F | FULL |
DTU | 1.08 | 1.20 | 1.07 | 1.25 | 0.89 | 0.78 | 31.77 | 32.73 | 32.56 | 32.69 | 32.13 | 32.49 |
NeRF-Synthetic | 2.51 | 3.61 | 2.95 | 2.83 | 1.35 | 0.91 | 28.39 | 30.52 | 30.63 | 30.12 | 29.88 | 30.31 |
BlendedMVS | 0.43 | fail | 0.63 | 0.47 | 0.41 | 0.38 | 28.63 | fail | 27.35 | 28.20 | 28.95 | 29.15 |
Ablation study. We verify the influence of different modules on the reconstruction results, including the coarse-to-fine module, the implicit displacement function module, and the position-adaptive control module. In Table 3, “Base ”refers to the baseline method, which is NeuS. "H" means we use high-frequency positional encoding. Here we set L=16 to represent high frequencies. "C2F" refers to the coarse-to-fine optimization strategy with high-frequency positional encoding. We set the initial to 0.5. "IDF" represents using the implicit displacement function in reconstruction. For each dataset, we chose the mean of the three scenes as the quantitative metric. From the results of the BlendedMVS dataset, we can observe that the divergence of network training can be prevented based on the coarse-to-fine strategy. From the DTU and NeRF-synthetic datasets, introducing high-frequency directly can easily lead to overfitting on these datasets. This means that an increase in PSNR cannot guarantee the improvement of the fidelity of surface reconstruction. Although the coarse-to-fine module can alleviate this mismatch to some degree, it is difficult to further improve the performance. However, adding the implicit displacement function component improves the fidelity of the surface reconstruction and PSNR at the same time. During reconstruction, the network with adaptive can help to improve the reconstruct quality upon more complex scenes.
Limitation. As shown in Fig. 5, our method still has challenges. We show a reference ground truth image, our corresponding reconstructed image, and our reconstructed surface. For the grid of ropes of the ship, some overfitting to ground-truth radiance is still observed. Specifically, the grid of ropes is visible in the image, but the surface is not reconstructed accurately. Another limitation is that the individual thin ropes are missing. We also visualize a bad case of Table 1 where the error is larger than that of the other methods as shown in Fig. 14 DTU Bunny in the supplementary material. In this case, the lighting of this model varies and the texture is not as pronounced, thus it is difficult to reconstruct the details of the belly.
We introduce HF-NeuS, a new method for multi-view surface reconstruction with high-frequency details. We propose a new derivation to explain the relationship between signed distance and transparency and propose a class of functions that can be used. By decomposing the signed distance field into a combination of two independent implicit functions, and using adaptive scale constraints to focus on optimizing the regions where the implicit function distribution is not ideal, a more refined surface can be reconstructed compared to previous work. The experimental results show that the method outperforms the current state of the art in terms of quantitative reconstruction quality and visual inspection. A current limitation is that HF-NeuS needs to optimize an additional implicit function, so it requires more computational resources and creates additional coding complexity. In addition, due to the lack of 3D supervision, we still observe overfitting to the ground truth radiance to some extent. An interesting direction for future work is to explore reconstruction of scenes under different lighting modalities. Finally, we do not expect negative social impacts that will be directly linked to our research. Negative social impacts of surface reconstruction in general are possible though.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 6260–6269. Cited by: §2.We perform an ablation study on the coarse-to-fine parameter and the number of frequency bands . In Fig. 6, we show the surface reconstruction results of the DTU Buddha model under different frequency parameters. Each model is trained for 300K iterations. In the first row we show the results of surface reconstruction quality under different coarse-to-fine parameters . It can be seen that when the parameter is too small, the surface reconstruction tends to be oversmoothed. When the parameter is too large, many artifacts will appear in the reconstruction results. We adopt for our model. In the second row we show the effect of different numbers of frequency bands on the reconstruction results when the coarse-to-fine parameter is fixed. We can observe that many details such as the cracks of the Buddha are not included in the reconstruction results when the number of frequency bands is too small. When the number of frequency bands is too large, although the details increase, high-frequency noise is also introduced, which will cause the reconstructed model to be unnatural. We adopt as the number of frequency bands of our model.
In Fig. 7, an ablation experiment for adaptive transparency is shown. We compare the reconstruction results without using the adaptive transparency strategy. We selected the Lego model from the NeRF-synthetic dataset, the skull model (scan65) from the DTU dataset and the dog model from the BlendedMVS for validation. For the Lego model in the first column, we find that the adaptive transparency strategy can better deal with regions with holes. Holes can be better unclogged during reconstruction. For the skull model in the second column, since there are only a few training images on the right side of the skull, the cheekbones are difficult to reconstruct accurately, but our method can better reconstruct these regions. For the dog model in the third column, because the belt is a fine part, it is not easy to be reconstructed well by the network. The strategy using adaptive transparency can better focus on these regions and reconstruct surface details better.
Benefits of modeling transparency. We provide a qualitative comparison to Volsdf and NeuS in Fig. 10. Our transparency model is easy to evaluate, avoids numerical problems due to division, and does not need to sample section points and mid-points separately. "OUR Base-Sigmoid" shows better geometry consistency on the roof of the house compared with VolSDF and NeuS. We also conduct an experiment to show the results with different choice of transparency in Fig. 10. "OUR Base-Laplace" means using the CDF of the Laplace distribution for . In the negative semi-axis, the derivative of the Laplace distribution CDF divided by the distribution is a constant function, the scale parameter , and this will affect the quality of surface reconstruction.
Benefits of using positional encoding. We provide a qualitative comparison to the result of using Siren Sitzmann et al. (2020) network in Fig. 10. We observed that the IDF using Sirens ("OURS-Siren" in Fig. 10) used in Yifan et al. (2022) can obtain a high PSNR result but low geometry fidelity. Although Yifan et al. (2022) also use a coarse-to-fine strategy between two frequency levels, we found that the method still has the problems when learning high-frequency details because of the high-frequency noise involved at the beginning. Our IDF using positional encoding does not use high-frequency information at the beginning of training, which makes the training more stable. In general, We provide a solution that allows more fine-grained control over frequency. This approach is more stable for the case without 3D supervision.
Ablations of using different Eikonal regularization. We provide an ablation study of Eikonal regularization in Fig. 10. We observe that training without regularization of the base SDF ("OUR-w/o Base Reg") results in slightly worse reconstruction quality. Thus constraining the base SDF can help improve the quality of the reconstruction.
Training with few images. We conduct an experiment for surface reconstruction with 10% of the training image in Fig. 11. We find that our method can keep the structure of reconstructed objects complete compared to NeuS, and can better reconstruct parts such as thin stripes with fewer training image. PSNR of NeuS and OURS is 28.31 and 31.77 respectively for the shown test image, which is also improved.
We provide a visualization of an example result in Fig. 12. As can be seen from the figure, the base SDF can reconstruct a smooth model of the Buddha. The displacement function is used to add extra details like cracks in the Buddha model and some small holes in the forehead (Base + IDF).
We show qualitative results for different ablations in Fig. 13. We observed that learning high-frequency details (Base+H) using only image information is difficult, and the network may overfit the image without reconstructing the correct geometry. Learning high-frequency details using IDF (IDF+H) will alleviate the noise but still produce large geometric errors. We provide a solution that allows more fine-grained control over frequency. By using a coarse-to-fine positional encoding, the frequency can be explicitly controlled by a coarse-to-fine strategy. This approach is more stable for the case without 3D supervision. Compared with only using the coarse-to-fine strategy (Base+C2F), IDF using C2F with positional encoding (IDF+C2F) has the ability to further improve the geometric fidelity as well as the image reconstruction quality as measured by PSNR. Using an adaptive strategy(FULL) can improve geometric fidelity and PSNR even further.
We provide a heatmap result for local errors in the Fig. 14 to better highlight the local error. It can be seen that we have a higher improvement in the details, such as the roof and the details in the shovel of the excavator.
Given are two point clouds and densely sampled from a surface. The Chamfer distance is defined as follows.
(17) |
The training time of each scene is around 20 hours for 300k iterations. The inference time for extracting a mesh surface with high resolution (512 grid resolution for marching cubes) is around 60 seconds and rendering an image at the resolution of 1600x1200 takes around 540 seconds.
Without supervision in 3D space, it is difficult to obtain accurate base implicit functions and implicit displacement functions simultaneously for the complete volume. Specifically, the lack of supervision is more problematic far from the surface since the volume rendering is mainly influenced by the region in space close to the surface. We assume that the displacement function approaches 0 as moves away from the surface. In regions far away from the surface, the details are not needed because we do not extract a surface in these regions. Therefore, when the point is very far away, we assume that the value of the base function and the combined function is the same. This assumption reduces the difficulty of network training and suppresses the effects of high-frequency noise when 3D supervision is not present.
The derivative of the sigmoid function has the property of converging to 0 away from the surface when applied to the signed distance function. It can therefore be used to constrain displacement distances. In practice, we use the function to constrain the displacement function as follows.
(18) |
We relax the constraints near the surface with a factor of 0.01 and the of the displacement constraint is clamped to less than . Because , the is small over all the space with small at the beginning, which can be interpreted as having a tight constraint at the beginning to speed up the convergence of base SDF. As the number of iterations increases, the constraint on the displacement function near the surface is gradually relaxed to fit high-frequency details.
We use hierarchical sampling to determine the sampling points on the ray. We first uniformly sample 64 points, and then sample a new set of 64 points according to the weighting function. The weighting function can be seen as a probability density function, where the probability of sampling is high when the ray intersects the surface, and the probability is low elsewhere. The scale parameter
controls if the sample points are mainly located very close to the surface, or if they spread out around the surface. We also weight the parameter of rays at different spatial locations according to the gradients. We first calculate the signed distance function and its gradient using 64 uniformly sampled points, and then the weighting coefficients for scale are calculated.(19) |
where is the normalized weight for each point on the ray, and is the number of sampling points. We modulate the magnitude of the scale with the coefficient , and use scale to control the probability density for the adaptive sampling. Here we do not clamp . As increases, our coefficient depends more on the gradients close to the surface.