Photometric stereo techniques aim at acquiring both the shape and the reflectance of a scene. To this end, multiple images are acquired under the same viewing angle but varying lighting, and a physics-based image formation model is inverted. However, the classic way to solve this inverse problem requires lighting to be highly controlled, which restricts practical photometric stereo applications to laboratory setups where careful calibration of lighting must be carried out prior to any 3D-reconstruction.
The objective of this research work is to simplify the overall photometric stereo pipeline, by providing an efficient solution to uncalibrated photometric stereo under general lighting, as illustrated in Figure 1.111We plan to make the code publicly available. In comparison with existing efforts in the same direction, the proposed one has the following advantages:
The joint estimation of shape, reflectance and general lighting is formulated as an end-to-end, mathematically transparent variational problem;
A real 3D-surface represented as a depth map is recovered, rather than possibly non-integrable normals;
It is robust, due to the use of Cauchy’s robust M-estimator and Huber-TV albedo regularization;
It is computationally efficient, thanks to a tailored lagged block coordinate descent scheme initialized using a simple balloon-like shape.
After reviewing related works in Section 2, we discuss in Section 3 the spherical harmonics-based image formation model considered in this work. It can be inverted using the variational approach to the joint recovery of depth, albedo and lighting which is discussed in Section 4. A dedicated numerical solution is then introduced in Section 5 and empirically evaluated in Section 6. Section 7 eventually draws the conclusion of this research.
2 Related Work
3D-models of scenes are essential in many applications such as visual inspection [Farooq2005] or computer-aided surgery using augmented reality [Collins2012]. A 3D-model consists of geometric (position, orientation, etc.) and photometric (color, texture, etc.) properties. Given a set of photographies, the aim of 3D scanning is to invert the image formation process in order to recover these geometric and photometric properties of the observed scene. This notion thus includes both those of 3D-reconstruction (geometry) and of reflectance estimation (photometry).
Many approaches to the problem of 3D-reconstruction from photographies have been studied, and they are grouped under the generic naming “shape-from-X”, where X stands for the clue which is being used (shadows [Shafer1983], contours [Brady1984], texture [Witkin1981], template [Bartoli2015], structured light [Geng2011], motion [Moons2008], focus [Nayar1994], silhouettes [Hernandez2004], etc.). Geometric shape-from-X techniques are based on the identification and analysis of feature point or areas in the image. In contrast, photometric techniques build upon the analysis of the quantity of light received by each photosite of the camera’s sensor. Among photometric techniques, shape-from-shading
is probably the most famous one. This technique, developed in the 70s by Hornet al. [Horn1970], consists in 3D-reconstruction from a single image of a shaded scene. It is a classic ill-posed inverse problem whose numerical solving usually requires the surface’s reflectance to be known [Durou2008]. In order both to limit the ambiguities of shape-from-shading and to allow for automatic reflectance estimation, it has been suggested to consider not just one image of the scene, but several ones acquired from the same viewing angle but under varying lighting. This variant, which was introduced in the late 70s by Woodham [Woodham1978], is known as photometric stereo.
Among the various shape-from-X techniques mentioned above, photometric stereo is the only 3D-scanning technique i.e., the only one which is able to achieve both 3D-reconstruction and reflectance estimation. However, early photometric approaches strongly rely on the control of lighting. The latter is usually assumed for simplicity to be directional, although the case of nearby point light sources has recently regained some attention [logothetis2017semi, mecca2014near]. More importantly, lighting is assumed to be calibrated. Indeed, the uncalibrated problem is ill-posed: the underlying normal map can be estimated only up to a linear ambiguity [Hayakawa1994], which reduces to a generalized bas-relief one if integrability is enforced [Belhumeur1999]. To resolve the latter ambiguity, some prior on the scene’s surface or geometry must be introduced, see [Shi2018] for a recent survey. A natural way to enforce integrability consists in following a differential approach to photometric stereo [Chand13, mecca14] i.e., directly estimate the 3D-surface as a depth map instead of first estimating the surface normals and then integrating them. Such a differential approach to photometric stereo can be coupled with variational methods in order to iteratively refine depth, reflectance and lighting in a robust manner [CVPR2017]. In addition to the theoretical interest of enforcing integrability in order to limit ambiguities, differential approaches to photometric stereo have the advantage of bypassing the problem of integrating the estimated normal field, which is by itself a non-trivial problem [Queau_Survey]. Besides, any error in the estimated normal field might propagate during integration, and thus robustness to specularities or shadows must be enforced during normal estimation, see again [Shi2018] for some discussion.
All the research works mentioned in the previous paragraph assume that lighting is induced by a single light source. Nevertheless, many studies rather considered the case of more general illumination conditions, which finds a natural application in outdoor conditions [Sato1995]
. For instance, the apparent motion of the sun within a day induces changes in the illumination direction which, in theory, allow photometric stereo-based 3D-reconstruction. However, this apparent motion is close to being planar, and thus the rank of the set of illumination vectors is equal or close to[Shen2014] (see also [Holdgeoffroy2015] for additional discussion on the stability of single-day photometric stereo). This situation is thus similar to the two-image case, which is known to be ill-posed since the early 90s [Kozera1993b, Onn1990, Yang1992], although it is still an active research area [Kozera2018]. In order to limit the instabilities due to this issue, one possibility is to consider images acquired over many seasons as in [Abrams2012, Ackermann2012]
, or to resort to deep neural networks[Holdgeoffroy2018]. Another one is to consider a non-directional illumination model to represent natural illumination, as for instance in [Jung2019]. Modeling natural illumination is a promising track, since such a model would not be restricted to sunny days, and images acquired under cloudy days are known to yield more accurate 3D-reconstructions [Holdgeoffroy2015].
However, the previous approaches to photometric stereo under natural illumination assume calibrated lighting, where calibration is deduced from time and GPS coordinates or from a calibration target. The case of both general and uncalibrated lighting is much more challenging and has been fewly explored, apart from studies restricted to sparse 3D-reconstructions [Shen2009] or relying on the prior knowledge of a rough geometry [Ackermann2012b, Peng2017, Shi2014]. Uncalibrated photometric stereo under natural illumination has been revisited recently in [Mo2018], using a spatially-varying equivalent directional lighting model. However, results were limited to the recovery of possibly non-integrable surface normals. Instead, the method which we propose in the present paper directly recovers the underlying surface represented as a depth map. Following the seminal work of Basri and Jacobs [Basri2007], it considers the spherical harmonics representation of general lighting in lieu of the equivalent directional approximation, as discussed in the next section.
3 Image Formation Model
In photometric stereo (PS), we are given a number of observations , each representing a multi-channel image (i.e. ) over a masked pixel domain . Assuming that the object being pictured is Lambertian, the surface’s reflectance is represented by the albedo , and the general image formation model is as follows, for all , , and :
Here is the unit sphere in , represents the channel-wise intensity of the incident light, and and are the channel-wise albedos and the unit-length surface normals, respectively, at the surface point conjugate to pixel . The operation in (1) encodes self-shadows. The overall integral collects elementary luminance contributions arising from all incident lighting directions . In the setup of uncalibrated PS, the quantities , , in addition to , are unknown.
Equivalent directional lighting [holdgeoffroy-15] approximates (1) via
where represents the mean lighting over the visible hemisphere at . The field is spatially variant but can be approximated by directional lighting over small local patches. Over each patch, one is thus faced with the ambiguities of directional uncalibrated PS [Hayakawa1994]. State-of-the-art patch-wise methods [Mo2018] first solve this problem over each patch, then connect the patches to form a complete normal field up to rotation, and eventually estimate the rotation which best satisfies the integrability constraint. Errors may however get propagated during the sequence, resulting in a possibly non-integrable normal field.
Instead of such an equivalent directional lighting model, we rather consider a spherical harmonic approximation (SHA) of general lighting [Basri2003, Basri2007]. By defining the half-cosine kernel as
we can view (1) as an analog of a convolution:
Invoking the Funk-Hecke theorem, we obtain the following harmonic expansion analogous to Fourier series:
Here the spherical harmonics form an orthonormal basis on , and and are the expansion coefficients of and with respect to . Since most energy in the expansion (5) concentrates on low-order terms [Basri2003], we obtain the second-order SHA by truncating the series up to the first nine terms (i.e., ):
The first-order SHA refers to the truncation up to the first four terms (i.e., ). It is shown in [Basri2003] that, for distant lighting, at least of the resulting irradiance is captured by the first-order SHA, and by the second-order SHA (cf. Figure 2 for a visualization).
|Model (1)||Model (7) (1st order)||Model (7) (2nd order)|
Here represents the second-order harmonic images, and represents the harmonic lighting vector whose entries have absorbed and constant factors of . A key advantage of the SHA (7) over the equivalent directional lighting model (2) lies in the spatial invariance of the lighting vectors , which yields a less ill-posed inverse problem [Basri2007]. The counterpart is the nonlinear dependency upon the normal components, which we will handle in Section 5 using a tailored numerical solution. In the next section, we build upon the key observations that integrability [Belhumeur1999] and perspective projection [papadhimitri2013new] both largely reduce the ambiguities of uncalibrated PS to derive a variational approach to inverting the SHA (7).
4 Variational Uncalibrated PS
In this section, we shall propose a joint variational model for uncalibrated PS. To this end, let a 3D-framebe attached to the camera, with the optical center, the -axis aligned with the optical axis such that for any 3D point in front of the camera. Further let a 2D-frame be attached to the focal plane which is parallel to the -plane and contains the masked pixel domain . Under perspective projection, the surface geometry is modeled as a map given by
with the depth map and
the calibrated camera’s intrinsics matrix. In the following we denote for convenience .
Assuming that is differentiable, the surface normal at point is the unit vector oriented towards the camera such that , which yields the following parameterization of the normal by the depth:
Note that the dependence of on is linear.
In the first term above, we use Cauchy’s M-estimator to penalize the data-fitting discrepancy:
It is indeed well-known that Cauchy’s estimator, being non-convex, is robust against outliers; see for instance[CVPR2017] in the context of PS. The scaling parameter is used in all experiments.
The second term in (13) represents a Huber total-variation (TV) regularization on each albedo map , with the Huber loss defined by
and being fixed in the experiments. It turns out that the Huber TV imposes desirable smoothness on the albedo maps and in turn improves the joint estimation overall. Eventually, is a weight parameter which balances the data-fitting term and the Huber TV one. Its value was empirically set to (see Section 6 for some discussion).
We remark that in (13), geometry is directly optimized in terms of the depth (rather than indirectly in terms of the normal ). This is typically known as a differential approach to PS in the literature [Chand13, logothetis2017semi, mecca14, mecca2014near, CVPR2017], with the benefit of both implicitly ensuring integrability (thus limiting ambiguities) and avoiding possibly error-prone integration of normals into depths as a post-processing step.
5 Solver and Implementation
To solve the variational problem (13) numerically, we follow a “discretize-then-optimize” approach. There, is replaced by , being the number of pixels inside , which yields discretized vectors . To alleviate notational burden, we sometimes refer to a pixel by its index and sometimes by its position . The spatial gradient is discretized using a forward difference stencil.
We shall apply a lagged block coordinate descent (LBCD) method to find a local minimum of the objective function in (23). Due to the (highly) non-convex nature of (23), initialization of optimization variables has a strong influence on the final solution. In our implementation, we initialize for all and for all . Moreover, during the first eight iterations we freeze the second-order spherical harmonics coefficients i.e., we reconstruct using only first-order spherical harmonic approximation as a warm start.
Based on the observation that most real-world scenes are convex, we initialize the depth as a balloon-like surface, as discussed in the following.
5.1 Depth Initialization
It is readily seen that a trivial constant initialization of the depth yields uniform vertically aligned normals and, hence, zero entries in the initial harmonic images . This would cause non-meaningful updates on albedos and lighting vectors ; cf. Figure 3 for an illustration.
To solve this issue, we specialize the depth initialization which undergoes two phases:
1. Following [Oswald2012], we generate a balloon-like depth map under orthographic projection.
2. We then convert the orthographic depth to a perspective depth via normal integration [Queau_Survey].
Phase 1 is pursued via the following constrained convex optimization:
Intuitively, the model seeks for which has minimal surface area subject to a constant volume . A global minimizer of this model can be efficiently computed by simple projected gradient iterations:
where and with the spectral norm. The volume constant
is a hyperparameter which is empirically chosen, see Section6 for discussion.
Next, we convert the orthographic depth to a perspective depth . Note that complies with the orthographic projection, under which a 3D-point is represented by
and the corresponding surface normal to the surface at conjugate to pixel is given by
Since is invariant to the projection model, Eq. (11) also implies that
where stands for the log-perspective depth. This further implies the formula for :
which can be integrated to obtain (and hence ). The overall pipeline in Phase 2 is summarized as follows:
|Trivial and its result||Our and its result|
5.2 Lagged Block Coordinate Descent
Even with a reasonable initialization, the numerical resolution of Problem (23) remains challenging. Due to the appearances of the spherical harmonic approximation and the Cauchy’s M-estimator , the objective in (23) is highly nonlinear and nonconvex. To tackle these challenges, here we present a lagged block coordinate descent (LBCD) method which performs efficiently in practice.
To derive LBCD, we introduce an auxiliary variable such that . This enables us to rewrite (11) as . Then we formulate the following constrained optimization problem:
where is the residual function defined by:
Upon initialization, the proposed LBCD proceeds as follows. At iteration , we lag one iteration behind, i.e.,
and then sequentially update each of the three blocks (namely , and ). In each resulting subproblem, we solve (lagged) weighted least squares problems as an approximation of the Cauchy loss and/or the Huber loss. This is detailed in the following:
(Update ): We evaluate the residual
and then set up the (lagged) weight factors for both the Cauchy loss and the Huber loss as
The albedos are updated as the solution to the following linear weighted least-squares problem:
which is carried out by conjugate gradient (CG).
(Update ): The lighting subproblem is similar to the one for albedos, except for absence of the Huber TV term. Upon evaluation of the residual and the weight factor , we update by solving the following linear weighted least-squares problem via CG:
(Update ): The depth subproblem requires additional efforts. With and evaluated after the -update, we are faced with the following weighted least squares problem:
where the dependence of on is still nonlinear. Therefore, we further linearize with respect to and arrive at the following update:
where is the Jacobian of the map at . The resulting linearized least-squares problem is again solved by CG. In our experiments, we additionally incorporate backtracking line search in the -update to ensure a monotonic decrease of the energy.
6 Experimental Validation
|Armadillo||Joyful Yell||Lucy||Thai Statue||Exemplary environment maps|
This section is concerned with the evaluation of the proposed nonconvex variational approach to uncalibrated photometric stereo under general lighting.
6.1 Synthetic Experiments
To validate the impact of the initial volume in (16), the tunable hyper-parameter , and the number of input images in (13), we consider 36 challenging synthetic datasets. We use four different depth maps (“Joyful Yell” [Bendansie2015], “Lucy” [Levoy2005data], “Armadillo” [Levoy2005data] and “Thai Statue” [Levoy2005data]) and nine different albedo maps and each of those 36 combinations is rendered as described in (1) using different environment maps222Environment maps are downloaded from http://www.hdrlabs.com/sibl/archive.html, cf. Figure 4. The resulting 25 RGB images per dataset are used as input, along with the intrinsic camera parameters and a binary mask . A quantitative evaluation on the triplet is carried out on four randomly chosen datasets (Armadillo & White albedo, Joyful Yell & Ebsd albedo, Lucy & Hippie albedo, and Thai Statue & Voronoi albedo), comparing the impact of each value of on the resulting mean angular error (MAE) between ground truth and estimated normals.
First, we validate the choice of the input volume using the initially fixed values of and . As the volume depends on the size of the mask, we consider a linear parametrization and evaluate a range of ratios . Figure 5 (left) indicates that the optimal value of is dataset-dependent. For synthetic datasets we always selected this optimal value, yet for real-world data no such evaluation is possible and must be tuned manually. Since the ballooning-based depth initialization can be carried out in real-time (implementation is parallelized in CUDA), the user has an immediate feedback on the initial depth and thus a plausible initial shape is easily drawn. Humans excel at estimating size and shape of objects [Baldwin2016] and real-world experiments will show that a manual choice of can result in appealing geometries.
Next, we evaluate the impact of , cf. Figure 5 (right). As can be seen, the depth estimate seems to deteriorate for too small and too large values of , whereas seems to provide good depth estimates across all albedo maps. Therefore we fix for all our upcoming experimental evaluation.
Unsurprisingly, the MAE is inversely proportional to the number of input images, but runtime increases (linearly) with , cf. Figure 6. We found that represents a good trade-off between runtime and accuracy, and fix for all our further experiments. Our Matlab implementation needs about – minutes on a computer with an Intel processor.
Having fixed the choice of , we can now evaluate our approach against other state-of-the-art methods. We compare our results against those obtained by an uncalibrated photometric stereo approach assuming directional lighting [Favaro2012], and another one assuming general (first-order spherical harmonics) illumination yet relying on an input shape prior (e.g., from an RGB-D sensor) [Peng2017]. As this limiting assumption on the access to a sensor-based depth prior is not always given and to make comparison fair, we input as depth prior to this method the ballooning initialization described in Section 5.1. Furthermore, we compare against another uncalibrated photometric stereo work under natural illumination [Mo2018]333Code associated with [Favaro2012] and [Peng2017] can be found online, and the results obtained by [Mo2018] were provided by the authors., which resorts to the equivalent directional lighting instead of spherical harmonics, cf. Section 3. Table 1 shows the median and mean MAEs over all 36 datasets (a more detailed table can be found in the supplementary material). On these datasets, it can be seen that our method quantitatively outperforms the current state-of-the-art by a factor of –. This gain is also evaluated qualitatively in Figure 7, which shows a selection of two results.
Joyful Yell & Lena
Thai Statue & White
6.2 Real-World Experiments
For real-world data we use the publicly available dataset of [Haefner2018b]. It offers eight challenging real-world datasets of objects with complex geometry and albedo captured under daylight and a freely moving LED, along with intrinsics matrix and masks . Results are presented in Figure 8. Despite relying on a directional lighting model, the approach of [Favaro2012] produces reasonable results on some datasets (Face1, Ovenmitt or Shirt), but it fails on others. As [Peng2017] assumes a reliable prior on depth in order to perform a photometric refinement, this approach is biased towards its initialization and thus, only when the depth prior is very close to the objects’ rough shape (Ovenmitt, Shirt, Tabletcase, Vase) a meaningful geometry is recovered. The approach of [Mo2018] estimates a possibly non-integrable normal field only, and it can be seen that after integration the depth map might not be satisfactory. As our approach optimizes over depth directly, such issues are not apparent and we are able to recover fine-scale geometric details throughout all tests.
We proposed a variational approach to uncalibrated photometric stereo (PS) under general lighting. Assuming a perspective camera setup, our method jointly estimates shape, reflectance and lighting in a robust manner. The possible non-integrability of normals is bypassed by the direct estimation of the underlying depth map, and robustness is ensured by resorting to Cauchy’s M-estimator and Huber-TV albedo regularization. Although the problem is nonconvex and thus numerically challenging and initialization-dependent, we tackled it efficiently through a tailored lagged block coordinate descent algorithm and ballooning-based depth initialization. Over a series of evaluations on synthetic and real data, we demonstrated that our method outperforms existing methods in terms of MAE by a factor of – and provides highly detailed reconstructions even in challenging real-world settings.
In future research, a more automated balloon-like depth initialization is desirable. Exploring the theoretical foundations (uniqueness of a solution) of differential perspective uncalibrated PS under spherical harmonic lighting and analyzing the convergence properties of the proposed numerical scheme constitute two other promising perspectives.
Appendix A Further Details on Synthetic Experiments
To provide further insights on the synthetic experiments (in Section 6.1), we visualize the environment lighting , , used to render each image. Figure 9 shows all environment maps444All environment maps were downloaded from http://www.hdrlabs.com/sibl/archive.html.
The impact of each incident lighting , , is illustrated in Figure 10 showing the Joyful Yell with a White () albedo. Thus, color changes in the images are caused by lighting only, as depicted in model (1) and (7) in the main paper.
Table 2 shows the mean angular error (MAE) of each dataset on the state-of-the-art approaches [Favaro2012, Mo2018, Peng2017] and our proposed methodology. It can be seen that our approach consistently overcomes [Favaro2012, Mo2018, Peng2017] by a factor of –. Only the Pattern albedo seems to bias the resulting depth negatively, yet even in this case our approach estimates the geometry more faithfully than the current state-of-the-art.
Two more qualitative results on synthetic data are shown in Figure 11. While [Favaro2012] gives more meaningful results on Armadillo with Constant albedo, depth deteriorates strongly on Lucy with Hippie albedo. Methods of [Mo2018, Peng2017] both result in rather flattened shapes (cf. Lucy). Most accurate results are achieved using the proposed method where fine geometric details, as well as non flattened depth estimates are shown.
Armadillo & Constant
Lucy & Hippie
Appendix B Further Details on Real-World Results
Supplementary to the real-world experiments (in Section 6.2), Figure 12 shows alternative viewpoints of the real-world results. The estimated albedos, which are mapped onto the surfaces, appear satisfactory.
|Face 1||Face 2||Rucksack||Backpack|