1 Introduction
1.1 Multiview Shapefromshading
Over the decades the reconstruction of dense 3D geometry from images has been tackled in numerous ways. Two of the most popular strategies are the reconstruction from multiple views using the notion of color or feature correspondence and the reconstruction of shaded objects using the technique of shapefromshading. Both approaches are in many ways complementary, both have their strengths and limitations. While the fusion of these complementary concepts in a single reconstruction algorithm bears great promise, to date this challenge has remained unsolved and convicing experimental realizations have remained elusive. In this work, we will review existing efforts and propose a novel solution to this challenge.
1.2 Related Work
Multiview stereo reconstruction.
Multiview stereo reconstruction (MVS) [Furukawa2015] is among the most powerful techniques to recover 3D geometry from multiple realworld images. The key idea is to exploit the fact that 3D points are likely to be on the (Lambertian) object surface if the projection into various cameras gives rise to a consistent color, patch or feature value. The arising photoconsistencyweighted minimal surface problems can be optimized using techniques such as graph cuts [Vogiatzis2005] or convex relaxation [Kolev2009]. Despite its enormous popularity for realworld reconstruction, multiview stereo methods have several wellknown shortcomings. Firstly, the estimation of dense correspondences is computationally challenging [Tola2010]. Secondly, in the absense of color variations (textureless areas), the color consistency assumption degenerates leading to a need for regularity or smoothness assumptions – the resulting photoconsistencyweighted minimal surface formulations degenerate to Euclidean minimal surface problems which exhibit a shrinking bias that leads to the loss of concavities, indentations and other finescale geometric details.
Shapefromshading.
In contrast to matching features or colors across images, photometric techniques [Ackermann2015] such as shapefromshading (SFS) [Ikeuchi1981, Horn1989a] explicitly model the reflectance of the object surface. As a result, the brightness variations observed in a single image provides an indication about variations in the normal and geometry. SFS is a classical illposed problem with wellknown ambiguities such as the one shown in Figure 2. From a single greylevel image both the indentation (red curve) and the protrusion (blue curve) are possible geometric configurations. There exist two main strategies for solving this ambiguity [Durou2008, Zhang1999]. Variational methods [Horn1986] employ regularization. As a result, they provide an approximate SFS solution which is often oversmoothed. Alternatively, methods based on the exact resolution of a nonlinear PDE [Lions1993] yield the highest level of detail while implicitly enforcing smoothness in the sense of viscosity solutions. Unfortunately, these PDE solutions lack robustness and they require a boundary condition. Since most shapefromshading methods require a highly controlled illumination, they often fail when deployed in realworld conditions outside the lab. As shown in Figure 3, existing methods for shapefromshading under natural illumination [Barron2015, OrEl2015] strongly depend on the use of a regularization mechanism, which limits their accuracy.
Shadingbased geometry refinement.
Obviously the mentioned concaveconvex ambiguity disappears when using more than one observation – see Figure 2, right side. The natural question is therefore how to combine multiview reconstruction with the concept of shapefromshading. This has long been identified as a promising track [Blake1985], and theoretical guarantees on uniqueness exist [Chambolle1994]. Still, there is a lack of practical multiview shapefromshading methods. Jin presented in [Jin2008] a variational solution, which relies on regularization and may thus miss thin structures. Besides, this solution assumes a single, infinitely distant light source and thus cannot be applied under natural illumination. Methods combining stereo and shading information have also been developed [Galliani2016, Kim2016, Langguth2016, Maurer2016, Nehab2005, Samaras2000, Wu2011, Zollhoefer2015]. Yet, they do not fully exploit the potential of shading, because they all consider photometry as a way to refine multiview 3Dreconstruction, which remains the baseline of the process.
Input synthetic image  Fixed point [OrEl2015]  SIRFS [Barron2015]  Proposed ADMM 3Dreconstruction 

and illumination  without regularization  using only one scale  (singlescale and regularizationfree) 
1.3 Contribution
In this work, we revisit the challenge of multiview shapefromshading. Instead of considering SFS as a postprocessing for finescale geometric refinement, we rather place it at the core of the multiview 3Dreconstruction process. The key idea is to model the brightness variations of each color channel and each image by means of a partial differential equation and to subsequently couple these PDE solutions across all images and channels by means of a variational approach. Furthermore, rather than alternatingly solving for shapefromshading and concistency across all images (which is known to lead to suboptimal solutions of poor quality), we make use of an efficient ADMM algorithm in order to solve the nonlinearly coupled optimization problem. In numerous experiments we demonstrate that the proposed variational fusion of shapefromshading across multiple views gives rise to highly accurate dense reconstructions of realworld objects without the need for dense correspondence. We believe that the proposed extension of SFS to multiple views will help to finally bring SFS strategies from the lab into the real world.
1.4 Problem Statement and Paper Organization
Given a set of input images and the reflectance function , we ultimately wish to estimate depth maps , which are both consistent with the observed images (photometric constraint), and consistent with each other (geometric constraint). The proposed framework is of the variational form, and can be written as follows:
(1) 
where the photometric energy and the geometric one have to be chosen appropriately in order to ensure that: i) the finest details are being captured; ii) natural illumination can be considered; iii) the solution is not oversmoothed; iv) the depth maps are consistent.
The choice of the photometric energy is first discussed in detail in Section 2. It introduces a new approach to SFS under natural illumination which is both variational and PDEbased. It captures the finest details of a surface by avoiding regularization. Yet, since we also avoid using any boundary condition, 3Dreconstruction remains ambiguous if no initial estimate is available. To tackle this issue, we show in Section 3 that sparse correspondences across multiview images disambiguate the problem.
2 Variational SFS Under Natural Illumination
This section introduces an algorithm for solving SFS under general lighting, modeled by channeldependent, secondorder spherical harmonics. We make the same assumptions as in [Johnson2011] , the lighting and the albedo of the surface are known. In practice, this means that a calibration object (, a sphere) with known geometry and same albedo as the surface to reconstruct must be inserted in the scene. These assumptions are usual in the SFS literature. They could be relaxed by simultaneously estimating shape, illumination and reflectance [Barron2015], but we leave this as future work. Instead, we wish to solve SFS without resorting to any prior except differentiability of the depth map.
Our approach relies on the new differential SFS model (5). To solve it in practice, we introduce the variational reformulation (9), which separates the difficulties due to nonlinearity from those due to the nonlocal nature of the problem. Experimental validation is eventually conducted through an application to shadingbased depth refinement.
2.1 Image Formation Model and Related Work
Let , be a greylevel () or multichannel () image of a surface, where represents a “mask” of the object being pictured.
We assume that the surface is Lambertian, so its reflectance is characterized by the albedo . We further consider a secondorder spherical harmonics model [Basri2003, Ramamoorthi2001] for the lighting . To deal with the spectral dependencies of reflectance and lighting, we assume both and are channeldependent. The albedo is thus a function , and the lighting in each channel
is represented as a vector
.Eventually, let be the field of unitlength outward normals to the surface.
With these notations, the image value in each channel writes as follows, :
(2) 
Our goal is to recover the object shape, given its image, its albedo and the lighting. Each unitlength normal vector
has two degrees of freedom, thus it is in general impossible to solve Equation (
2) independently in each pixel . In particular, if and lighting is directional (, Equation (2) is a single scalar equation with two unknowns. This particular situation characterizes the classic SFS problem, which is illposed [Horn1989a]. Its resolution has given rise to a number of methods [Durou2008, Zhang1999].Yet, few SFS methods deal with nondirectional lighting. Nearfield pointwise lighting has been shown to help resolving the ambiguities [Prados2005], but only partly [Breuss2012]. Besides, to deal with more diffuse lighting such as natural outdoor illumination, spherical harmonics are better suited. Firstorder harmonics have been considered in [Huang2011], but they only capture up to of “real” lighting, while this rate is over using secondorder harmonics [Frolova2004].
In the context of SFS, secondorder harmonics have been used in [Barron2015, Johnson2011, Richter2015]. The SFS approach of Johnson and Adelson [Johnson2011] has the same objective as ours , handling multichannel images and “natural” illumination, knowing the albedo and the lighting. It is shown that this general illumination model actually limits the ambiguities of SFS, since it is the intermediate case between SFS and color photometric stereo [Hernandez2007]. However, this work relies on regularization terms, which favors oversmoothed surfaces. Barron and Malik solve in [Barron2015] the more challenging problem of shape, illumination and reflectance from shading (SIRFS). By fixing the albedo and the lighting, and removing all the regularization terms, SIRFS can be applied to SFS. However, the proposed method “fails badly” [Barron2015] if a multiscale strategy is not considered (see Figure 3). Let us also mention for completeness the recent work in [Richter2015], which has similar goals as ours (shapefromshading under natural illumination), but follows an entirely different track based on discriminative learning, which requires prior training.
Overall, there exists no purely datadriven approach to SFS under natural illumination. The rest of this section aims at filling this gap.
2.2 Differential Model
Since Equation (2) cannot be solved locally, it must be solved globally over the entire domain . This can be achieved by assuming surface smoothness. However, in order to prevent losing the finescale surface details, this assumption should be as minimal as possible. In particular, regularization terms, which have been widely explored in early SFS works [Horn1986, Ikeuchi1981], may oversmooth the solution. Instead of having the normal vectors as unknowns and penalizing their variations, as achieved for instance in [Johnson2011], we rather directly estimate the underlying depth map. To this end, we resort to a differential approach building upon PDEs [Lions1993]. This has the advantage of implicitly enforcing differentiability without requiring any regularization term. Let us thus first rewrite (2) as a PDE.
Let the shape be represented as a function , which is the depth map under orthographic projection, and the of the depth map under perspective projection. In both cases, the normal to the surface is given by
(3) 
where: is the gradient of ; under orthographic projection while, under perspective projection, is the focal length and , with the coordinates of the principal point; and is a coefficient of normalization:
(4) 
Plugging (3) into (2), we obtain, , the following nonlinear PDE in the depth over :
(5) 
with the following definitions for the fields and :
(6)  
(7) 
Various methods have been suggested for solving PDEs akin to (5), in some specific cases. When , and lighting is directional and frontal (, is the only nonzero lighting component), then (5) becomes the eikonal equation, which was first exhibited for SFS in [Bruss1982]. After this inverse problem has caught the attention of several mathematicians [Lions1993, Rouy1992], efficient numerical methods for approximating solutions to this wellknown equation have been suggested, using for instance semiLagrangian schemes [Cristiani2007]. Under perspective projection, an eikonallike equation also arises [Prados2003, Tankus2003]. The case where lighting is depthdependent (socalled attenuation factor) is also interesting as it is less ambiguous [Prados2005]. SemiLagrangian schemes can also be used for the resolution, see for instance [Breuss2012]. Still, most of these differential methods require a boundary condition, or at least a state constraint, which are rarely available in practice. In addition, there currently lacks a purely datadriven numerical SFS method which would handle secondorder lighting and multichannel images: [Barron2015] is strongly dependent on a multiscale strategy, and [Johnson2011] is nondifferential (perpixel surface normal estimation) and thus resorts to regularization. The variational approach discussed hereafter solves all these issues at once.
A single input real image with illumination + Initial shape  Shapefromshading without any regularization 

Same, on three other RGBD datasets 
2.3 Variational Formulation
The PDEs in (5) are in general incompatible due to noise. Thus, an approximate solution must be sought. For simplicity, we follow here a leastsquares approach:
(8) 
where is the norm over the domain .
If the fields and were not dependent on , then (8) would be a linear leastsquares problem. In recent works on shadingbased refinement [OrEl2015], it is suggested to proceed iteratively, by freezing these terms at each iteration. Although this “fixed point” strategy looks appealing, Figure 3 shows that it induces artifacts, and thus regularization must be employed [OrEl2015]. Other artifacts also arise in SIRFS [Barron2015], when the multiscale strategy is not employed.
Instead of eliminating artifacts by regularization, which may induce a loss of geometric details, we rather separate the difficulty induced by the nonlinearity from that induced by the dependency on the gradient. To this end, we introduce an auxiliary variable , and rewrite (8) in the following, equivalent, manner:
(9) 
Ground truth  Greylevel, firstorder (14)  Greylevel, secondorder (15)  Colored, secondorder (16)  

Ours 

MAEN , RMSEI  MAEN , RMSEI  MAEN , RMSEI  
SIRFS 

MAEN , RMSEI  MAEN , RMSEI  MAEN , RMSEI  

Ours 

MAEN , RMSEI  MAEN , RMSEI  MAEN , RMSEI  
SIRFS 

MAEN , RMSEI  MAEN , RMSEI  MAEN , RMSEI 
We then turn (9) into a sequence of simpler problems through an ADMM algorithm [Boyd2011]. The augmented Lagrangian functional associated to (9) is defined as
(10) 
where represent Lagrange multipliers, and . ADMM then minimizes (9) by the following iterations:
(11)  
(12)  
(13) 
where is determined automatically [He2000].
We then discretize (11) by finite differences, and solve the discrete optimality conditions by conjugate gradient. With this approach, no explicit boundary condition is required. As for (12), it is solved in each pixel by a Newton method [Coleman1996]. In our experiments, the algorithm stops when the relative residual of the energy in (8) falls below .
2.4 Experiments
Since our method estimates a locally optimal solution, initialization matters. There is one situation where a reasonable initial estimate is available. This is when using an RGBD camera: the depth channel D is noisy, but it may be refined using shading [Chatterjee2015, Choe2017, Han2013, OrEl2016, OrEl2015, Wu2014]. Hence, to qualitatively evaluate our approach, we consider in Figure 4 three realworld RGBD datasets from [Han2013], estimating lighting from the rough depth map (assuming ). We attain the finest level of surface detail possible, since the surface is not oversmoothed through regularization.
For quantitative evaluation, we use in Figure 5 the wellknow “Joyful Yell” dataset, using three lighting scenarios. We first consider greylevel images, with a singleorder and then a secondorder lighting model, respectively defined by:
(14)  
(15) 
In the third experiment, we consider a colored, secondorder lighting model defined by:
(16) 
The importance of initialization is assessed by using two different initial estimates. The accuracy of 3Dreconstruction is evaluated by the mean angular error between the recovered normals and the ground truth ones, and the ability to explain the input image is measured through the RMSE between the data and the image simulated from the 3Dreconstruction.
We compared those values against SIRFS [Barron2015], which is the only method for SFS under natural illumination whose code is freely available. For fair comparison, we disabled albedo and lighting estimation in SIRFS, and gave a zero weight to all smoothing terms. To avoid the artifacts shown in Figure 3, SIRFS’s multiscale strategy was used. Figure 5 proves that SFS under natural illumination can be solved using a purely datadriven strategy, without resorting neither to regularization nor to multiscale. Besides, the runtimes of our method and SIRFS are comparable: a few minutes in all cases, for images with nonblack pixels.
Still, these experiments show that the proposed method strongly depends on the choice of the initial estimate. We now show how to better constrain the 3Dreconstruction problem through sparse multiview correspondences.
3 Multiview Shapefromshading
Although colored natural illumination partly disambiguates SFS, it does not entirely remove ambiguities [Johnson2011]. Another disambiguation strategy must be considered in the absence of a good initial estimate. We now show that sparse correspondences in a multiview framework can be employed for this purpose.
To this end, let us now assume that we are given images , along with the corresponding albedo maps and lighting vectors, both assumed to be channel and imagedependent and denoted by and . The joint resolution of the SFS problems could be achieved by solving variational problems such as (9). However, this would result in inconsistent depth maps: the SFS problems need to be coupled.
3.1 Sparse Multiview Constraints
We use multiview consistency to couple the
SFS problems, and show that ambiguities are limited when introducing sparse correspondences between the images. We conjecture that any ambiguity even disappears if the correspondence set is dense. This conjecture could probably be proved by following
[Chambolle1994], but we leave this as future work.Let us assume that some sparse interimages pixel correspondences are given (which can be obtained, for instance, by matching SIFT descriptors), and let us write them as the following functions, where and are the masks of the object in images and , :
(17) 
Assuming perspective projection, a 3Dpoint in world coordinates is conjugate to a pixel according to
(18) 
where is the th depth map (recall that we set to the depth map under perspective projection), is the pixel coordinates the th principal point, is the th focal length, and and are the rotation and translation describing the th pose of the camera (we assume that these poses are calibrated).
The multiview consistency constraint then writes
(19) 
which we rewrite as the following nonlinear constraint:
(20) 
where is a function depending on the depth maps and , whereas the function does not.
3.2 Proposed Variational Paradigm
To disambiguate SFS through multiviews, we suggest to use in the variational model (1), where is the norm over , and is a weighting factor. Since the constraint (20) only depends on the depth values, and not on their gradients, we rather write it in terms of the auxiliary variables of the ADMM algorithm. This is motivated by the fact that the updates of these variables already require perpixel nonlinear leastsquares optimization. Moreover, the depth updates remain linear leastsquares ones if the multiview constraint is written in terms of the auxiliary variables. We thus define new auxiliary variables , and turn (9) into:
(21) 
Greylevel, firstorder lighting  Colored, secondorder lighting  Fused point cloud  
MAEN  MAEN 
We experimentally found that the choice of a particular value of the parameter is not important. Obviously, if is set to , then the SFS are uncoupled, and thus ambiguous. Yet, as long as is “high enough”, ambiguities disappear. In our tests, we found that the range provides comparable results, and always used the value .
It is straightforward to modify the previous ADMM algorithm for solving (21). In Figure 6, we show the 3Dreconstructions obtained from synthetic views, in the same lighting scenarios as in the first and third experiments of Figure 5, using the same nonrealistic initial estimate. We used pixel correspondences which were randomly picked using the groundtruth geometry. In comparison with the singleview results (see Figure 5), the estimated depth maps are more accurate. Besides, if we fuse both depth maps into a point cloud (using the known camera poses), we observe that both 3Dreconstructions are “consistent”, which proves that amiguities are eliminated.
Eventually, we present in Figures 1 and 7 the results of our method on two realworld datasets from [Zollhoefer2015]. We chose these datasets because they exhibit a uniform, though unknown, albedo. This albedo can thus be estimated during lighting calibration (since illumination is not provided in these datasets, it was calculated from the 3Dreconstructions provided in [Zollhoefer2015], but these 3Dreconstructions were then not used any further). Sparse correspondences were extracted by matching standard SIFT features [Lowe1999] (the total number of used matches is worth for the images of the “Sokrates” dataset, and for the images of the “Figure” one). These realworld experiments demonstrate that shadingbased multiview 3Dreconstruction constitutes a promising alternative to standard dense multiview stereo.
CMPMVS [Jancosek2011] ( views)  Ours ( views) 
4 Conclusion
We have shown how to achieve dense multiview 3Dreconstruction without dense correspondences. A new variational approach to shapefromshading under general lighting is used as the main tool for densification. It allows to drastically reduce the number of required images, while improving the amount of detail in the 3Dreconstruction. In future work, the new approach may be extended by automatic estimation of the albedo and of the lighting. This would allow coping with a broader variety of surfaces, and simplify the overall procedure.
Comments
There are no comments yet.