1.1 Multi-view Shape-from-shading
Over the decades the reconstruction of dense 3D geometry from images has been tackled in numerous ways. Two of the most popular strategies are the reconstruction from multiple views using the notion of color or feature correspondence and the reconstruction of shaded objects using the technique of shape-from-shading. Both approaches are in many ways complementary, both have their strengths and limitations. While the fusion of these complementary concepts in a single reconstruction algorithm bears great promise, to date this challenge has remained unsolved and convicing experimental realizations have remained elusive. In this work, we will review existing efforts and propose a novel solution to this challenge.
1.2 Related Work
Multi-view stereo reconstruction.
Multi-view stereo reconstruction (MVS) [Furukawa2015] is among the most powerful techniques to recover 3D geometry from multiple real-world images. The key idea is to exploit the fact that 3D points are likely to be on the (Lambertian) object surface if the projection into various cameras gives rise to a consistent color, patch or feature value. The arising photo-consistency-weighted minimal surface problems can be optimized using techniques such as graph cuts [Vogiatzis2005] or convex relaxation [Kolev2009]. Despite its enormous popularity for real-world reconstruction, multi-view stereo methods have several well-known shortcomings. Firstly, the estimation of dense correspondences is computationally challenging [Tola2010]. Secondly, in the absense of color variations (textureless areas), the color consistency assumption degenerates leading to a need for regularity or smoothness assumptions – the resulting photoconsistency-weighted minimal surface formulations degenerate to Euclidean minimal surface problems which exhibit a shrinking bias that leads to the loss of concavities, indentations and other fine-scale geometric details.
In contrast to matching features or colors across images, photometric techniques [Ackermann2015] such as shape-from-shading (SFS) [Ikeuchi1981, Horn1989a] explicitly model the reflectance of the object surface. As a result, the brightness variations observed in a single image provides an indication about variations in the normal and geometry. SFS is a classical ill-posed problem with well-known ambiguities such as the one shown in Figure 2. From a single greylevel image both the indentation (red curve) and the protrusion (blue curve) are possible geometric configurations. There exist two main strategies for solving this ambiguity [Durou2008, Zhang1999]. Variational methods [Horn1986] employ regularization. As a result, they provide an approximate SFS solution which is often over-smoothed. Alternatively, methods based on the exact resolution of a nonlinear PDE [Lions1993] yield the highest level of detail while implicitly enforcing smoothness in the sense of viscosity solutions. Unfortunately, these PDE solutions lack robustness and they require a boundary condition. Since most shape-from-shading methods require a highly controlled illumination, they often fail when deployed in real-world conditions outside the lab. As shown in Figure 3, existing methods for shape-from-shading under natural illumination [Barron2015, Or-El2015] strongly depend on the use of a regularization mechanism, which limits their accuracy.
Shading-based geometry refinement.
Obviously the mentioned concaveconvex ambiguity disappears when using more than one observation – see Figure 2, right side. The natural question is therefore how to combine multi-view reconstruction with the concept of shape-from-shading. This has long been identified as a promising track [Blake1985], and theoretical guarantees on uniqueness exist [Chambolle1994]. Still, there is a lack of practical multi-view shape-from-shading methods. Jin presented in [Jin2008] a variational solution, which relies on regularization and may thus miss thin structures. Besides, this solution assumes a single, infinitely distant light source and thus cannot be applied under natural illumination. Methods combining stereo and shading information have also been developed [Galliani2016, Kim2016, Langguth2016, Maurer2016, Nehab2005, Samaras2000, Wu2011, Zollhoefer2015]. Yet, they do not fully exploit the potential of shading, because they all consider photometry as a way to refine multi-view 3D-reconstruction, which remains the baseline of the process.
|Input synthetic image||Fixed point [Or-El2015]||SIRFS [Barron2015]||Proposed ADMM 3D-reconstruction|
|and illumination||without regularization||using only one scale||(single-scale and regularization-free)|
In this work, we revisit the challenge of multi-view shape-from-shading. Instead of considering SFS as a post-processing for fine-scale geometric refinement, we rather place it at the core of the multi-view 3D-reconstruction process. The key idea is to model the brightness variations of each color channel and each image by means of a partial differential equation and to subsequently couple these PDE solutions across all images and channels by means of a variational approach. Furthermore, rather than alternatingly solving for shape-from-shading and concistency across all images (which is known to lead to suboptimal solutions of poor quality), we make use of an efficient ADMM algorithm in order to solve the nonlinearly coupled optimization problem. In numerous experiments we demonstrate that the proposed variational fusion of shape-from-shading across multiple views gives rise to highly accurate dense reconstructions of real-world objects without the need for dense correspondence. We believe that the proposed extension of SFS to multiple views will help to finally bring SFS strategies from the lab into the real world.
1.4 Problem Statement and Paper Organization
Given a set of input images and the reflectance function , we ultimately wish to estimate depth maps , which are both consistent with the observed images (photometric constraint), and consistent with each other (geometric constraint). The proposed framework is of the variational form, and can be written as follows:
where the photometric energy and the geometric one have to be chosen appropriately in order to ensure that: i) the finest details are being captured; ii) natural illumination can be considered; iii) the solution is not over-smoothed; iv) the depth maps are consistent.
The choice of the photometric energy is first discussed in detail in Section 2. It introduces a new approach to SFS under natural illumination which is both variational and PDE-based. It captures the finest details of a surface by avoiding regularization. Yet, since we also avoid using any boundary condition, 3D-reconstruction remains ambiguous if no initial estimate is available. To tackle this issue, we show in Section 3 that sparse correspondences across multi-view images disambiguate the problem.
2 Variational SFS Under Natural Illumination
This section introduces an algorithm for solving SFS under general lighting, modeled by channel-dependent, second-order spherical harmonics. We make the same assumptions as in [Johnson2011] , the lighting and the albedo of the surface are known. In practice, this means that a calibration object (, a sphere) with known geometry and same albedo as the surface to reconstruct must be inserted in the scene. These assumptions are usual in the SFS literature. They could be relaxed by simultaneously estimating shape, illumination and reflectance [Barron2015], but we leave this as future work. Instead, we wish to solve SFS without resorting to any prior except differentiability of the depth map.
Our approach relies on the new differential SFS model (5). To solve it in practice, we introduce the variational reformulation (9), which separates the difficulties due to nonlinearity from those due to the non-local nature of the problem. Experimental validation is eventually conducted through an application to shading-based depth refinement.
2.1 Image Formation Model and Related Work
Let , be a greylevel () or multi-channel () image of a surface, where represents a “mask” of the object being pictured.
We assume that the surface is Lambertian, so its reflectance is characterized by the albedo . We further consider a second-order spherical harmonics model [Basri2003, Ramamoorthi2001] for the lighting . To deal with the spectral dependencies of reflectance and lighting, we assume both and are channel-dependent. The albedo is thus a function , and the lighting in each channel
is represented as a vector.
Eventually, let be the field of unit-length outward normals to the surface.
With these notations, the image value in each channel writes as follows, :
Our goal is to recover the object shape, given its image, its albedo and the lighting. Each unit-length normal vector
has two degrees of freedom, thus it is in general impossible to solve Equation (2) independently in each pixel . In particular, if and lighting is directional (, Equation (2) is a single scalar equation with two unknowns. This particular situation characterizes the classic SFS problem, which is ill-posed [Horn1989a]. Its resolution has given rise to a number of methods [Durou2008, Zhang1999].
Yet, few SFS methods deal with non-directional lighting. Near-field pointwise lighting has been shown to help resolving the ambiguities [Prados2005], but only partly [Breuss2012]. Besides, to deal with more diffuse lighting such as natural outdoor illumination, spherical harmonics are better suited. First-order harmonics have been considered in [Huang2011], but they only capture up to of “real” lighting, while this rate is over using second-order harmonics [Frolova2004].
In the context of SFS, second-order harmonics have been used in [Barron2015, Johnson2011, Richter2015]. The SFS approach of Johnson and Adelson [Johnson2011] has the same objective as ours , handling multi-channel images and “natural” illumination, knowing the albedo and the lighting. It is shown that this general illumination model actually limits the ambiguities of SFS, since it is the intermediate case between SFS and color photometric stereo [Hernandez2007]. However, this work relies on regularization terms, which favors over-smoothed surfaces. Barron and Malik solve in [Barron2015] the more challenging problem of shape, illumination and reflectance from shading (SIRFS). By fixing the albedo and the lighting, and removing all the regularization terms, SIRFS can be applied to SFS. However, the proposed method “fails badly” [Barron2015] if a multi-scale strategy is not considered (see Figure 3). Let us also mention for completeness the recent work in [Richter2015], which has similar goals as ours (shape-from-shading under natural illumination), but follows an entirely different track based on discriminative learning, which requires prior training.
Overall, there exists no purely data-driven approach to SFS under natural illumination. The rest of this section aims at filling this gap.
2.2 Differential Model
Since Equation (2) cannot be solved locally, it must be solved globally over the entire domain . This can be achieved by assuming surface smoothness. However, in order to prevent losing the fine-scale surface details, this assumption should be as minimal as possible. In particular, regularization terms, which have been widely explored in early SFS works [Horn1986, Ikeuchi1981], may over-smooth the solution. Instead of having the normal vectors as unknowns and penalizing their variations, as achieved for instance in [Johnson2011], we rather directly estimate the underlying depth map. To this end, we resort to a differential approach building upon PDEs [Lions1993]. This has the advantage of implicitly enforcing differentiability without requiring any regularization term. Let us thus first rewrite (2) as a PDE.
Let the shape be represented as a function , which is the depth map under orthographic projection, and the of the depth map under perspective projection. In both cases, the normal to the surface is given by
where: is the gradient of ; under orthographic projection while, under perspective projection, is the focal length and , with the coordinates of the principal point; and is a coefficient of normalization:
with the following definitions for the fields and :
Various methods have been suggested for solving PDEs akin to (5), in some specific cases. When , and lighting is directional and frontal (, is the only non-zero lighting component), then (5) becomes the eikonal equation, which was first exhibited for SFS in [Bruss1982]. After this inverse problem has caught the attention of several mathematicians [Lions1993, Rouy1992], efficient numerical methods for approximating solutions to this well-known equation have been suggested, using for instance semi-Lagrangian schemes [Cristiani2007]. Under perspective projection, an eikonal-like equation also arises [Prados2003, Tankus2003]. The case where lighting is depth-dependent (so-called attenuation factor) is also interesting as it is less ambiguous [Prados2005]. Semi-Lagrangian schemes can also be used for the resolution, see for instance [Breuss2012]. Still, most of these differential methods require a boundary condition, or at least a state constraint, which are rarely available in practice. In addition, there currently lacks a purely data-driven numerical SFS method which would handle second-order lighting and multi-channel images: [Barron2015] is strongly dependent on a multi-scale strategy, and [Johnson2011] is non-differential (per-pixel surface normal estimation) and thus resorts to regularization. The variational approach discussed hereafter solves all these issues at once.
|A single input real image with illumination + Initial shape||Shape-from-shading without any regularization|
|Same, on three other RGB-D datasets|
2.3 Variational Formulation
The PDEs in (5) are in general incompatible due to noise. Thus, an approximate solution must be sought. For simplicity, we follow here a least-squares approach:
where is the norm over the domain .
If the fields and were not dependent on , then (8) would be a linear least-squares problem. In recent works on shading-based refinement [Or-El2015], it is suggested to proceed iteratively, by freezing these terms at each iteration. Although this “fixed point” strategy looks appealing, Figure 3 shows that it induces artifacts, and thus regularization must be employed [Or-El2015]. Other artifacts also arise in SIRFS [Barron2015], when the multi-scale strategy is not employed.
Instead of eliminating artifacts by regularization, which may induce a loss of geometric details, we rather separate the difficulty induced by the nonlinearity from that induced by the dependency on the gradient. To this end, we introduce an auxiliary variable , and rewrite (8) in the following, equivalent, manner:
|Ground truth||Greylevel, first-order (14)||Greylevel, second-order (15)||Colored, second-order (16)|
|MAE-N , RMSE-I||MAE-N , RMSE-I||MAE-N , RMSE-I|
|MAE-N , RMSE-I||MAE-N , RMSE-I||MAE-N , RMSE-I|
|MAE-N , RMSE-I||MAE-N , RMSE-I||MAE-N , RMSE-I|
|MAE-N , RMSE-I||MAE-N , RMSE-I||MAE-N , RMSE-I|
where represent Lagrange multipliers, and . ADMM then minimizes (9) by the following iterations:
where is determined automatically [He2000].
We then discretize (11) by finite differences, and solve the discrete optimality conditions by conjugate gradient. With this approach, no explicit boundary condition is required. As for (12), it is solved in each pixel by a Newton method [Coleman1996]. In our experiments, the algorithm stops when the relative residual of the energy in (8) falls below .
Since our method estimates a locally optimal solution, initialization matters. There is one situation where a reasonable initial estimate is available. This is when using an RGB-D camera: the depth channel D is noisy, but it may be refined using shading [Chatterjee2015, Choe2017, Han2013, Or-El2016, Or-El2015, Wu2014]. Hence, to qualitatively evaluate our approach, we consider in Figure 4 three real-world RGB-D datasets from [Han2013], estimating lighting from the rough depth map (assuming ). We attain the finest level of surface detail possible, since the surface is not over-smoothed through regularization.
For quantitative evaluation, we use in Figure 5 the well-know “Joyful Yell” dataset, using three lighting scenarios. We first consider greylevel images, with a single-order and then a second-order lighting model, respectively defined by:
In the third experiment, we consider a colored, second-order lighting model defined by:
The importance of initialization is assessed by using two different initial estimates. The accuracy of 3D-reconstruction is evaluated by the mean angular error between the recovered normals and the ground truth ones, and the ability to explain the input image is measured through the RMSE between the data and the image simulated from the 3D-reconstruction.
We compared those values against SIRFS [Barron2015], which is the only method for SFS under natural illumination whose code is freely available. For fair comparison, we disabled albedo and lighting estimation in SIRFS, and gave a zero weight to all smoothing terms. To avoid the artifacts shown in Figure 3, SIRFS’s multi-scale strategy was used. Figure 5 proves that SFS under natural illumination can be solved using a purely data-driven strategy, without resorting neither to regularization nor to multi-scale. Besides, the runtimes of our method and SIRFS are comparable: a few minutes in all cases, for images with non-black pixels.
Still, these experiments show that the proposed method strongly depends on the choice of the initial estimate. We now show how to better constrain the 3D-reconstruction problem through sparse multi-view correspondences.
3 Multi-view Shape-from-shading
Although colored natural illumination partly disambiguates SFS, it does not entirely remove ambiguities [Johnson2011]. Another disambiguation strategy must be considered in the absence of a good initial estimate. We now show that sparse correspondences in a multi-view framework can be employed for this purpose.
To this end, let us now assume that we are given images , along with the corresponding albedo maps and lighting vectors, both assumed to be channel- and image-dependent and denoted by and . The joint resolution of the SFS problems could be achieved by solving variational problems such as (9). However, this would result in inconsistent depth maps: the SFS problems need to be coupled.
3.1 Sparse Multi-view Constraints
We use multi-view consistency to couple the
SFS problems, and show that ambiguities are limited when introducing sparse correspondences between the images. We conjecture that any ambiguity even disappears if the correspondence set is dense. This conjecture could probably be proved by following[Chambolle1994], but we leave this as future work.
Let us assume that some sparse inter-images pixel correspondences are given (which can be obtained, for instance, by matching SIFT descriptors), and let us write them as the following functions, where and are the masks of the object in images and , :
Assuming perspective projection, a 3D-point in world coordinates is conjugate to a pixel according to
where is the -th depth map (recall that we set to the depth map under perspective projection), is the pixel coordinates the -th principal point, is the -th focal length, and and are the rotation and translation describing the -th pose of the camera (we assume that these poses are calibrated).
The multi-view consistency constraint then writes
which we rewrite as the following nonlinear constraint:
where is a function depending on the depth maps and , whereas the function does not.
3.2 Proposed Variational Paradigm
To disambiguate SFS through multi-views, we suggest to use in the variational model (1), where is the norm over , and is a weighting factor. Since the constraint (20) only depends on the depth values, and not on their gradients, we rather write it in terms of the auxiliary variables of the ADMM algorithm. This is motivated by the fact that the updates of these variables already require per-pixel nonlinear least-squares optimization. Moreover, the depth updates remain linear least-squares ones if the multi-view constraint is written in terms of the auxiliary variables. We thus define new auxiliary variables , and turn (9) into:
|Greylevel, first-order lighting||Colored, second-order lighting||Fused point cloud|
We experimentally found that the choice of a particular value of the parameter is not important. Obviously, if is set to , then the SFS are uncoupled, and thus ambiguous. Yet, as long as is “high enough”, ambiguities disappear. In our tests, we found that the range provides comparable results, and always used the value .
It is straightforward to modify the previous ADMM algorithm for solving (21). In Figure 6, we show the 3D-reconstructions obtained from synthetic views, in the same lighting scenarios as in the first and third experiments of Figure 5, using the same non-realistic initial estimate. We used pixel correspondences which were randomly picked using the ground-truth geometry. In comparison with the single-view results (see Figure 5), the estimated depth maps are more accurate. Besides, if we fuse both depth maps into a point cloud (using the known camera poses), we observe that both 3D-reconstructions are “consistent”, which proves that amiguities are eliminated.
Eventually, we present in Figures 1 and 7 the results of our method on two real-world datasets from [Zollhoefer2015]. We chose these datasets because they exhibit a uniform, though unknown, albedo. This albedo can thus be estimated during lighting calibration (since illumination is not provided in these datasets, it was calculated from the 3D-reconstructions provided in [Zollhoefer2015], but these 3D-reconstructions were then not used any further). Sparse correspondences were extracted by matching standard SIFT features [Lowe1999] (the total number of used matches is worth for the images of the “Sokrates” dataset, and for the images of the “Figure” one). These real-world experiments demonstrate that shading-based multi-view 3D-reconstruction constitutes a promising alternative to standard dense multi-view stereo.
|CMPMVS [Jancosek2011] ( views)||Ours ( views)|
We have shown how to achieve dense multi-view 3D-reconstruction without dense correspondences. A new variational approach to shape-from-shading under general lighting is used as the main tool for densification. It allows to drastically reduce the number of required images, while improving the amount of detail in the 3D-reconstruction. In future work, the new approach may be extended by automatic estimation of the albedo and of the lighting. This would allow coping with a broader variety of surfaces, and simplify the overall procedure.