1 Introduction
Acquiring the shape and the reflectance of a scene is a key issue, e.g., for the movie industry, as it allows proper relighting. The current proposed solutions focus on small objects and stand on multiple priors UnityDelight or need very controlled environments (ReinhardWardPattanaikDebevec, Chapter 9). Wellestablished shape acquisition techniques such as multiview stereo exist for accurate 3Dreconstruction. Nevertheless, they do not aim at recovering the surface reflectance. Hence, the original input images are usually mapped onto the 3Dreconstruction as texture. Since the image graylevel mixes shading information (induced by lighting and geometry) and reflectance (which is characteristic of the surface), relighting based on this approach usually lacks realism. To improve the results, reflectance needs to be separated from shading.
In order to more precisely illustrate our purpose, let us take the example of a Lambertian surface. In a 2Dpoint (pixel) conjugate to a 3Dpoint of a Lambertian surface, the graylevel is written
(1) 
In the righthand side of (1), is the albedo^{1}^{1}1Since the albedo suffices to characterize the reflectance of a Lambertian surface, we will name it “reflectance” as well.,
the lighting vector, and
the outer unitlength normal to the surface. All these elements a priori depend on i.e., they are defined locally. Whereas is always supposed to be given, different situations can occur, according to which are also known, among , and .(a)  (b)  (c)  (d) 
One equation (1) per pixel is not enough to simultaneously estimating the reflectance , the lighting and the geometry, represented here by , because there are much more unknowns than equations. Figure 1 illustrates this source of illposedness through the socalled “workshop metaphor” introduced by Adelson and Pentland in Adelson: among three plausible interpretations (b), (c) and (d) of image (a), we are particularly interested in (d), which illustrates the principle of photometric 3Dreconstruction. This class of methods usually assume that the lighting is known. Still, there remains three scalar unknowns per equation (1): and
, which has two degrees of freedom. Assuming moreover that the reflectance
is known, the shapefromshading technique Horn uses the shading as unique clue to recover the shape from Equation (1), but the problem is still illposed.A classical way to make photometric 3Dreconstruction wellposed is to use images taken using a single camera pose, but under varying known lighting:
(2) 
In this variant of shapefromshading called photometric stereo Woodham1980a, the reflectance and the normal can be estimated without any ambiguity, as soon as noncoplanar lighting vectors are used.
Symmetrically to (2), solving the problem:
(3) 
allows to estimate the lighting , as soon as the reflectance and noncoplanar normals , , are known. This can be carried out, for instance, by placing a small calibration pattern with known color and known shape near each 3Dpoint Queau2017bis.
The problem we aim at solving in this paper is slightly different. Suppose we are given a series of images of a scene taken using a single lighting, but camera poses. According to Lambert’s law, this ensures that a 3Dpoint looks equally bright in all the images where it is visible. Such invariance is the basic clue of multiview stereo (MVS), which has become a very popular technique for 3Dreconstruction MVS. Therefore, since an estimate of the surface shape is available, is known. Now, we have to index the pixels by the image number . Fortunately, additional data provided by MVS are the correspondences between the different views, taking the form of tuples of pixels which are conjugate to a common 3Dpoint .
Our problem is written^{2}^{2}2Even if they look very similar, Problems (2), (3) and (4) have completely different peculiarities.:
(4) 
where is the projection of in the th image, and and are unknown. Obviously, this system reduces to Equation (1), since its equations are the same one: the righthand side of (4) does not depend on , not more than the lefthand side since, as already noticed, the lighting does not vary from one image to another, and the surface is Lambertian.
Multiview helps estimating the reflectance, because it provides the 3Dshape via MVS. However, even if is known, Equation (1) remains illposed. This is illustrated, in Figure 1, by the solutions (b) and (c), which correspond to the same image (a) and to a common planar surface. In the absence of any prior, Equation (1) has an infinity of solutions in . In addition, determining from each of these solutions would give rise to another ambiguity, since is not forced to be unitlength, contrarily to .
Such a double source of illposedness probably explains why various methods for reflectance estimation have been designed, introducing a variety of priors in order to disambiguate the problem. Most of them assume that brightness variations induced by reflectance changes are likely to be strong but sparsely distributed, while the lighting is likely to induce smoother changes
Land1971.This suggests to separate a single image into a piecewise smooth layer and a more oscillating one. In the computer vision literature, this is often referred to as “intrinsic image decomposition”, while the terminology “cartoon + texture decomposition” is more frequently used by the mathematical imaging community (both these problems will be discussed in Section
2).Contributions.
In this work, we show the relevance of using multiview images for reflectance estimation. Indeed, this enables a prior shape estimation using MVS, which essentially reduces the decomposition problem to the joint estimation of a set of reflectance maps, as illustrated in Figure 2. We elaborate on the variational approach to multiview decomposition into reflectance and shading, which we initially presented in SSVM_2017_Jean. The latter introduced a robust TV framework for the joint estimation of piecewisesmooth reflectance maps and of spherical harmonics lighting, with an additional term ensuring the consistency of the reflectance maps. The present paper extends this approach by developing the theoretical foundations of this variational model. In this view, our parameterization choices are further discussed and the underlying ambiguities are exhibited. The variational model is motivated by a Bayesian rationale, and the proposed numerical scheme is interpreted in terms of a majorizationminimization algorithm. Finally, we conclude that, besides a preliminary measurement of the incoming lighting, varying the lighting along with the viewing angle, in the spirit of photometric stereo, is the only way to estimate the reflectance without resorting to any prior.
Organization of the Paper.
After reviewing related approaches in Section 2, we formalize in Section 3 the problem of multiview reflectance estimation. Section 4 then introduces a Bayesiantovariational approach to this problem. A simple numerical strategy for solving the resulting variational problem, which is based on alternating majorizationminimization, is presented in Section 5. Experiments on both synthetic and realworld datasets are then conducted in Section 6, before summarizing our achievements and suggesting future research directions in Section 7.
2 Related Works
Studied since the 1970s Land1971, the problem of decomposing an image (or a set of images) into a piecewisesmooth component and an oscillatory one is a fundamental computer vision problem, which has been addressed in numerous ways.
Cartoon + Texture Decomposition.
Researchers in the field of mathematical imaging have suggested various variational models for this task, using for instance nonsmooth regularization and Fourierbased frequency analysis Aujol2006, or TV variational models IpolCartoon. However, such techniques do not use an explicit photometric model for justifying the decomposition, whereas photometric analysis, which is another important branch of computer vision, may be a source of inspiration for motivating new variational models.
Photometric Stereo.
As discussed in the introduction, photometric stereo techniques Woodham1980a are able to unambiguously estimate the reflectance and the geometry, by considering several images obtained from the same viewing angle but under calibrated, varying lighting. Photometric stereo has even been extended to the case of uncalibrated, varying lighting Basri2007. In the same spirit as uncalibrated photometric stereo, our goal is to estimate reflectance under unknown lighting. However, the problem is less constrained in our case, since we cannot ensure that the lighting is varying. Our hope is that this can be somewhat compensated by the prior knowledge of geometry, and by the resort to appropriate priors. Various priors for reflectance have been discussed in the context of intrinsic image decomposition.
Intrinsic Image Decomposition.
Separating reflectance from shading in a single image is a challenging problem, often referred to as intrinsic image decomposition. Given the illposed nature of this problem, prior information on shape, reflectance and/or lighting must be introduced. Most of the existing works are based on the “retinex theory” Land1971, which states that most of the slight brightness variations in an image are due to lighting, while reflectance is piecewiseconstant (as for instance a Mondrian image). A variety of clusteringbased GMLG12; Shen2011 or sparsityenhancing methods Gehler2011; NadianGhomsheh2016; Shen2011; Song2017 have been developed based on this theory. Among others, the work of Baron and Malik Barron, which presents interesting results, stands on multiple priors to solve the fundamental ambiguity of shapefromshading, that we aim at revoking in the multiview context. Some other methods disambiguate the problem by requiring the user to “brush” uniform reflectance parts BPD09; NadianGhomsheh2016, or by resorting to a crowdsourced database bell14intrinsic. Still, these works require user interactions, which may not be desirable in certain cases.
Multiview 3Dreconstruction.
Instead of introducing possibly unverifiable priors, or relying on user interactions, ambiguities can be reduced by assuming that the geometry of the scene is known. Intrinsic image decomposition has for instance been addressed using an RGBD camera Chen2013 or, closer to our proposal, multiple views of the same scene under different angles Laffont2013; Laffont2012. In the latter works, the geometry is first extracted from the multiview images, before the problem of reflectance estimation is addressed. Geometry computation can be achieved using multiview stereo (MVS). MVS techniques Seitz have seen significant growth over the last decade, an expansion which goes hand in hand with the development of structurefrommotion (SfM) solutions Moulon. Indeed, MVS requires the parameters of the cameras, outputs of the SfM algorithm. Nowadays, these mature methods are commonly used in uncontrolled environments, or even with largescale Internet data RomeInADay. For the sake of completeness, let us also mention that some efforts in the direction of multiview and photometrically consistent 3Dreconstruction have been devoted recently Jinetal08; Kim; Langguth; Robert; Maurer. Similar to these methods, we will resort to a compact representation of lighting, namely the spherical harmonics model.
Spherical Harmonics Lighting Model.
Let us consider a point lying on the surface of the observed scene, and let be the outer unitlength normal vector to in . Let be the hemisphere centered in , having as basis plane the tangent plane to in . Each light source visible from can be associated to a point on . If we describe by the vector the corresponding elementary light beam (oriented towards the source), then by definition of the reflectance (or BRDF) of the surface, denoted , the luminance of in the direction is given by
(5) 
where is the surface illuminance. In general, depends both on the direction of the light , and on the viewing direction , relatively to .
This expression of the luminance is intractable in the general case. However, if we restrict our attention to Lambertian surfaces, the reflectance reduces to the albedo , which is independent of any direction, and does not depend on the viewing direction anymore. If the light sources are further assumed to be distant enough from the object, then is independent of i.e., the light beams are the same for the whole (supposedly convex) object, and thus the lighting is completely defined on the unit sphere. Therefore, the integral (5) acts as a convolution on , having as kernel . Spherical harmonics, which can be considered as the analogue to the Fourier series on the unit sphere, have been shown to be an efficient lowdimensional representation of this convolution Basri; Ramamoorthi. Many vision applications Kim; Wu use second order spherical harmonics, which can capture over of the natural lighting Frolova using only nine coefficients. This yields an approximation of the luminance of the form
(6) 
where is the albedo (reflectance), is a compact lighting representation, and stores the local geometric information. The latter is deduced from the normal according to:
(7) 
In (6), the lighting vector is the same in all the points of the surface, but the reflectance and the geometric vector vary along the surface of the observed scene. Hence we will write (6) as:
(8) 
Our aim in this paper is to estimate the reflectance in each point , as well as the lighting vector , given a set of multiview images and the geometric vector . We formalize this problem in the next section.
3 Multiview Reflectance Estimation
In this section, we describe with more care the problem of reflectance estimation from a set of multiview images. First, we need to explicit the relationship between graylevel, reflectance, lighting and geometry.
3.1 Image Formation Model
Let be a point on the surface of the scene. Assume that it is observed by a graylevel camera with linear response function and let be the image, where is the projection of onto the image plane. Then, the graylevel in the pixel conjugate to is proportional to the luminance of in the direction of observation :
(9) 
where the coefficient , referred to in the following as the “camera coefficient”, is unknown^{3}^{3}3This coefficient depends on several factors such as the lens aperture, the magnification, the exposure time, etc.. By assuming Lambertian reflectance and the light sources distant enough from the object, Equations (8) and (9) yield:
(10) 
Now, let us assume that images of the surface, , obtained while moving a single camera, are available, and discuss how to adapt (10).
Case 1: unknown, yet fixed lighting and camera coefficient.
If all the automatic settings of the camera are disabled, then the camera coefficient is independent from the view. We can thus incorporate this coefficient and the denominator into the lighting vector: . Moreover, if the illumination is fixed, the lighting vector is independent from the view. In any point which is visible in the th view, Equation (10) becomes:
(11) 
where we denote by the 3Dto2D projection associated to the th view. In (11), the unknowns are the reflectance and the lighting vector . Equations (11), , constitute a generalization of (4) to more complex illumination scenarios. For the whole scene, this is a problem with unknowns and up to equations, where is the number of 3Dpoints which have been estimated by multiview stereo. However, as for System (4), only equations are linearly independent, hence the problem of reflectance and lighting estimation is underconstrained.
Case 2: unknown and varying lighting and camera coefficient.
If lighting is varying, then we have to make the lighting vector viewdependent. If it is also assumed to vary, the camera coefficient can be integrated into the lighting vector with the denominator i.e., , since the estimation of each will include that of . Equation (10) then becomes:
(12) 
There are even more unknowns (), but this time the equations are linearly independent, at least as long as the are not proportional i.e., if not only the camera coefficient or the lighting intensity vary across the views, but also the lighting direction^{4}^{4}4Another case, which we do not study here, is when the lighting and camera coefficient are both varying, yet only lighting is calibrated. This is known as “semicalibrated” photometric stereo Cho2016.. Typically, is of the order of , hence the problem is overconstrained as soon as at least two out of the lighting vectors are noncollinear. This is a situation similar to uncalibrated photometric stereo Basri2007, but much more favorable: the geometry is known, hence the ambiguities arising in uncalibrated photometric stereo are likely to be reduced. However, contrarily to uncalibrated photometric stereo, lighting is not actively controlled in our case. Lighting variations are likely to happen e.g., in outdoor scenarios, yet they will be limited. The lighting vectors , , will thus be close to each other: lighting variations will not be sufficient in practice for disambiguation (illconditioning).
Since (11) is underconstrained and (12) is illconditioned, additional information will have to be introduced either ways, and we can restrict our attention to the varying lighting case (12).
So far, we have assumed that graylevel images were available. To extend our study to RGB images, we abusively assume channel separation, and apply the framework independently in each channel . We then consider the expression:
(13) 
where and denote, respectively, the colored reflectance and the th colored lighting vector, relatively to the response of the camera in channel . A more complete study of Model (13) is presented in JMIV2017_LEDS.
Since we will apply the same framework independently in each color channel, we consider hereafter the graylevel case only i.e., we consider the image formation model (12) instead of (13). The question which arises now is how to estimate the reflectance from a set of equations such as (12), when the geometry is known but the lighting is unknown.
3.2 Reflectance Estimation on the Surface
We place ourselves at the end of the multiview 3Dreconstruction pipeline. Thus, the projections are known (in practice, they are estimated using SfM techniques), as well as the geometry, represented by a set of 3Dpoints , , and the corresponding normals (obtained for instance using SFM techniques), from which the geometric vectors are easily deduced according to (7).
The unknowns are then the reflectance values and the lighting vectors , which are independent from the 3Dpoint number due to the distant light assumption. At first glance, one may think that their estimation can be carried out by simultaneously solving (12) in all the 3Dpoints , in a purely datadriven manner, using some fitting function :
(14) 
where we denote , and is a visibility boolean such that if is visible in the th image, and otherwise.
Let us consider, for the sake of pedagogy, the simplest case of leastsquares fitting () and perfect visibility (). Then, Problem (14) is rewritten in matrix form:
(15) 
where the Kronecker product is a matrix of , being a vector of which stores the unknown reflectance values, and a matrix of which stores the unknown lighting vectors , columnwise, is a blockdiagonal matrix whose th block, , is the row vector , matrix stores the graylevels, and is the Frobenius norm.
Using the pseudoinverse of , (15) is rewritten:
(16) 
Problem (16
) is a nearest Kronecker product problem, which can be solved by singular value decomposition (SVD)
(GolubV4, Theorem 12.3.1).However, this matrix factorization approach suffers from three shortcomings:

It is valid only if all 3Dpoints are visible under all the viewing angles, which is rather unrealistic. In practice, (15) should be replaced by
(17) where is a visibility matrix containing the values , and is the Hadamard product. This yields a Kronecker product problem with missing data, which is much more arduous to solve.

It is adapted only to leastsquares estimation. Considering a more robust fitting function would prevent a direct SVD solution.

If lighting is not varying (), then it can be verified that (15) is illposed. Among its many solutions, the following trivial one can be exhibited:
(18) (19) where:
(20) and is the mean over the view indices . This trivial solution means that the lighting is assumed to be completely diffuse^{5}^{5}5In the computer graphics community, this is referred to as “ambient lighting”., and that the reflectance is equal to the image graylevel, up to noise only. Obviously, this is not an acceptable interpretation. As discussed in the previous subsection, in realworld scenarios we will be very close to this degenerate case, hence additional regularization will have to be introduced, which makes things even harder.
Overall, the optimization problem which needs to be addressed is not as easy as (16). It is a nonquadratic regularized problem of the form:
(21) 
where is the set of neigbors of on surface , and the regularization function needs to be chosen appropriately to ensure piecewisesmoothness.
However, the sampling of the points on surface is usually nonuniform, because the shape of is potentially complex. It may thus be difficult to design appropriate fidelity and regularization functions and , and to design an appropriate numerical solving. In addition, some thin brightness variations may be missed if the sampling is not dense enough. Overall, direct estimation of reflectance on the surface looks promising at first sight, but rather tricky in practice. Therefore, we leave this as an interesting future research direction and follow in this paper a simpler approach, which consists in estimating reflectance in the image domain.
3.3 Reflectance Estimation in the Image Domain
Instead of trying to colorize the 3Dpoints estimated by MVS i.e., of parameterizing the reflectance over the (3D) surface , we can also formulate the reflectance estimation problem in the (2D) image domain.
Equation (12) is equivalently written, in each pixel :
(22) 
where we denote and . Instead of estimating one reflectance value per estimated 3Dpoint, the reflectance estimation problem is thus turned into the estimation of “reflectance maps”
(23) 
On the one hand, the 2Dparameterization (23) does not enforce the consistency of the reflectance maps. This will have to be explicitly enforced later on. Besides, the surface will not be directly colorized, but the estimated reflectance maps have to be backprojected and fused over the surface in a final step.
On the other hand, the question of occlusions (visibility) does not arise, and the domains are subsets of a uniform square 2Dgrid. Therefore, it will be much easier to design appropriate fidelity and regularization terms. Besides, there will be as many reflectance estimates as pixels in those sets: with modern HD cameras, this number is much larger than the number of 3Dpoints estimated by multiview stereo. Estimation will thus be much denser.
With such a parameterization choice, the regularized problem (21) will be turned into:
(24) 
with some function to ensure multiview consistency, and where is the set of neighbors of pixel which lie inside . Note that, since is a subset of a square, regular 2Dgrid, this neighborhood is much easier to handle than that appearing in (21).
In the next section, we discuss appropriate choices for , and in (24), by resorting to a Bayesian rationale.
4 A Bayesiantovariational Framework for Multiview Reflectance Estimation
Following Mumford’s Bayesian rationale for the variational formulation Mumford1994, let us now introduce a Bayesiantovariational framework for estimating reflectance and lighting from multiview images.
4.1 Bayesian Inference
Our problem consists in estimating the reflectance maps and the lighting vectors , given the images ,
. As we already stated, a maximum likelihood approach is hopeless, because a trivial solution arises. We rather resort to Bayesian inference, estimating
as the maximum a posteriori (MAP) of the distribution(25) 
where the denominator is the evidence, which can be discarded since it depends neither on the reflectance nor on the lighting, and the factors in the numerator are the likelihood and the prior, respectively.
Likelihood.
The image formation model (22) is never strictly satisfied in practice, due to noise, castshadows and possibly slightly specular surfaces. We assume that such deviations from the model can be represented as independent (with respect to pixels and views) Laplace laws^{6}^{6}6
We consider the Laplace law here because: i) since it has higher tails than the Gaussian, it allows for sparse outliers to the Lambertian model such as castshadows or specularities; ii) it yields convex optimization problems, unlike other heavytailed distributions such as Cauchy or
distributions. with zero mean and scale parameter :(26) 
where , , is the norm over and is the cardinality of .
Prior.
Since the reflectance maps are independent from the lighting vectors , the prior can be factorized to . Since the lighting vectors are independent from each other, the prior distribution of the lighting vectors factorizes to
. As each lighting vector is unconstrained, we can consider the same uniform distribution
i.e., , independently from the view index . This distribution being independent from the unknowns, we can discard the lighting prior from the inference process. Regarding the reflectance maps, we follow the retinex theory Land1971, and consider each of them as piecewiseconstant. The natural prior for each such map is thus the Potts model:(27) 
where represents the gradient of at pixel (approximated, in practice, using firstorder forward stencils with a Neumann boundary condition), and with a normalization coefficient and a scale parameter. Note that we use the abusive norm notation to denote:
(28) 
with if , and otherwise.
The reflectance maps are obviously not independent: the reflectance, which characterizes the surface, should be independent from the view. It follows that the parameters are the same for each Potts model (27), and that the reflectance prior can be taken as the product of independent distributions with the same parameters :
(29) 
but only if the coupling between the reflectance maps is enforced by the following linear constraint:
(30) 
where is a “correspondence function”, which is easily created from the (known) projection functions and the geometry, and which is defined as follows:
(31) 
4.2 Relationship with Cartoon + Texture Decomposition
Applying a logarithm transformation to both sides of (22), we obtain:
(33) 
where the tilde notation is used as a shortcut for the logarithm.
By applying the exact same Bayesiantovariational rationale, we would end up with the following variational problem:
(34) 
The variational problem (34) can be interpreted as a multiview cartoon + texture decomposition problem, where each logimage is decomposed into a component which is piecewisesmooth (“cartoon”, here the logreflectance), and a component which contains higherfrequency details (“texture”, here the logshading). In contrast with conventional methods for such a task, the present one uses an explicit shading model for the texture term.
Note however that such a decomposition is justified only if the images are considered. If using the original images , our framework should rather be considered as a multiview cartoon “” texture decomposition framework.
4.3 Biconvex Relaxation of the Variational Model (32)
Problem (32) is a nonconvex (due to the regularizers), nonsmooth (due to the regularizers and to the fidelity term). Although some efforts have recently been devoted to the resolution of optimization problems involving regularizers Storath2014, we prefer to keep the optimization simple, and approximate these by (convex, but nonsmooth) anisotropic total variation terms:
(35) 
Besides, the correspondence function may be slightly inaccurate in practice, due to errors in the prior geometry estimation obtained via multiview stereo. Therefore, we turn the linear constraint in (32) into an additional term. Eventually, we replace the nondifferentiable absolute values arising from the norms by the (differentiable) Moreau envelope i.e., the Huber loss^{7}^{7}7We use , in the experiments.:
(36) 
Altogether, this yields the following smooth, biconvex variational problem:
(37) 
In Equation (37
), the first term ensures photometric consistency (in the sense of the Huber loss function), the second one ensures reflectance smoothness (smoothed anisotropic total variation), and the third term ensures multiview consistency of the reflectance estimates (again, in the sense of the Huber loss function). At last,
and are tunable hyperparameters controlling the reflectance smoothness and the multiview consistency, respectively.5 Alternating Majorizationminimization for Solving (37)
To solve (37), we propose an alternating majorizationminimization method, which combines alternating and majorizationminimization optimization techniques. As sketched in Figure 3, this algorithm works as follows. Given an estimate of the solution at iteration , the lighting vectors and the reflectance maps are successively updated according to:
(38)  
(39) 
where and are local quadratic majorants of and around, respectively, and . Then, the process is repeated until convergence.
To this end, let us first remark that the function
(40) 
is such that , and is a proper local quadratic majorant of around , . This is easily verified if , from the definition (36) of . If , the difference writes:
(41) 
which is positive in any case.
Therefore, the function
(42) 
with
(43) 
is a local quadratic majorant of around which is suitable for the update (38).
Similarly, the function
(44) 
is a local quadratic majorant of around which is suitable for the update (39).
The update (38) then comes down to solving a large sparse linear leastsquares problem, which we achieve by applying conjugate gradient iterations to the associated normal equations. Regarding (39), it comes down to solving a series of independent smallscale linear leastsquares problems, for instance by resorting to the pseudoinverse.
We iterate the optimisation steps (38) and (39) until convergence or a maximum iteration number is reached, starting from the trivial solution of the nonregularized () problem. This nonregularized solution is attained by considering diffuse lighting (see (20)) and using the input images as reflectance maps. In our experiments, we found iterations were always sufficient to reach a stable solution ( relative residual between two consecutive energy values and ).
Proving convergence of our scheme is beyond the scope of this paper, but the proof could certainly be derived from that in JMIV2017_LEDS, where a similar alternating majorizationminimization called “alternating reweighted leastsquares” is used. Note, however, that the convergence rate seems to be sublinear (see Figure 4), hence possibly faster numerical strategies could be explored in the future.
6 Results
In this section, we evaluate the proposed variational method for multiview reflectance estimation, on a variety of synthetic and realworld datasets. We start by a quantitative comparison of our results with two singleview methods, namely, the cartoon + texture decomposition method from IpolCartoon and the intrinsic image decomposition method from Gehler2011.
6.1 Quantitative Evaluation on a Synthetic Dataset
(a)  (b)  (c)  (d) 
Input images  
Cartoon + texture IpolCartoon  
Intrinsic decomposition Gehler2011  
Ours  
Ground truth 
Input images  
Cartoon + texture IpolCartoon  
Intrinsic decomposition Gehler2011  
Ours  
Ground truth 
Test  Channel  Cartoon + texture IpolCartoon  Intrinsic decomposition Gehler2011  Ours  












We first test our reflectance estimation method using synthetic images, of size , of an object whose geometry is perfectly known (see Figure 5a). Two scenarios are considered:

In Figure 6, a purelyLambertian, piecewiseconstant reflectance is mapped onto the surface of the object, which is then illuminated by a “skydome” i.e., an almost diffuse lighting. Shading effects are thus rather limited, hence applying to each image an estimation method which does not use an explicit reflectance model e.g., the cartoon + texture decomposition method from IpolCartoon, should already provide satisfactory results. The reflectance being perfectly piecewise constant, applying sparsitybased intrinsic image decomposition methods such as Gehler2011 to each image should also work well.

In Figure 7, a more complicated (nonuniform) reflectance is mapped onto the shirt, the hair is made partly specular, and the diffuse lighting is replaced by a single extended light source, which induces much stronger shading effects. It will thus be much harder to remove shading without an explicit reflectance model (cartoon + texture approach), while the singleview image decomposition approach should be nonrobust to specularities.
In both cases, the competing methods IpolCartoon and Gehler2011 are applied independently to each of the images. The estimates are thus not expected to be consistent, which may be problematic if the reflectance maps should be further mapped onto the surface for, e.g., relighting applications. On the contrary, our approach simultaneously, and consistently, estimates the reflectance maps.
As we dispose of the reflectance ground truth, we can numerically evaluate these results by estimating the root mean square error (RMSE) for each method, over the whole set of images. The values are presented in Table 1. In order to compare comparable things, the reflectance estimated by each method is scaled, in each channel, by a factor common to the reflectance maps, so as to minimize the RMSE. This should thus highlight inconsistencies between the reflectance maps.
Based on the qualitative results from Figures 6 and 7, and the quantitative evaluations shown in Table 1, we can make the following three observations:
1) Considering an explicit image formation model improves cartoon + texture decomposition.
Actually, the cartoon part from the cartoon + texture decomposition is far less uniform than the reflectance estimated using both other methods. Shading is only blurred, and not really removed. This could be improved by augmenting the regularization weight, but the price to pay would be a loss of detail in the parts containing thinner details (as the shirt, in the example of Figure 7).
2) Simultaneously estimating the multiview reflectance maps makes them consistent and improves robustness to specularities.
When estimating each reflectance map individually, inconsistencies arise, which is obvious for the hair in the third line of Figure 6, and explains the RMSE values in Table 1. In contrast, our results confirm our basic idea i.e., that reflectance estimation benefits in two ways from the multiview framework: this allows us not only to estimate the 3Dshape, but also to constrain the reflectance of each surface point to be the same in all the pictures where it is visible. In addition, since the location of bright spots due to specularity depends on the viewing angle, they usually occur in some places on the surface only under certain viewing angles. Considering multiview data should thus improve robustness to specularities. This is confirmed in Figure 7 by the reflectance estimates in the hair, where the specularities are slightly better removed than with singleview methods.
3) A sparsitybased prior for the reflectance should be preferred over total variation.
As we use a TVsmoothing term, which favors piecewisesmooth reflectance, the satisfactory results of Figure 6 were predictable. However, some penumbra remains visible around the neck. Since we also know the object geometry, it seems that we could compensate for penumbra. However, this would require that the lighting is known as well, which is not the case in the framework of the targeted usecase, since an outdoors lighting is uncontrolled. Moreover, we would have to consider not only the primary lighting, but also the successive bounces of light on the different parts of the scene (these were taken into account by the raytracing algorithm, when synthesizing the images). In contrast, the sparsitybased approach Gehler2011 is able to eliminate penumbra rather well, without modeling secundary reflections. It is also able to more appropriately remove shading on the face in the example of Figure 7, while not degrading as much as total variation the thin structures of the shirt. Hence, the relative simplicity of the numerical solution, which is a consequence of the choice of replacing the Potts prior by a total variation one (see Section 4.3), comes with a price. In future works, it may be important to design a numerical strategy handling the original nonsmooth, nonconvex problem (32).
6.2 Handling Inaccurate Geometry
In the previous experiments, the geometry was perfectly known. In realworld scenarios, errors in the 3Dshape estimation using SfM and MVS are unavoidable. Therefore, it is necessary to evaluate the ability of our method to handle inaccurate geometry.
Thus, we use for the next experiment the surface shown in Figure 5b (zoomed in Figure 5d), which is obtained by smoothing the original 3Dshape of Figure 5a (zoomed in Figure 5c), using a tool from the meshlab
software. The results provided in Figure 8 show that our method seems robust to such small inaccuracies in the object geometry, and is thus relevant for the intended application.
In Figure 9, we qualitatively evaluate our method on the outputs of an SfM/MVS pipeline applied to a realworld dataset, which provides estimates of the camera parameters and a rough geometry of the scene. These experiments confirm that small inaccuracies in the geometry input can be handled. The specularities are also appropriately removed, and the reflectance maps present the expected cartoonlike aspect. However, the reflectance is underestimated in the sides of the nose and around the chin. Indeed, since lighting is fixed, these areas are selfshadowed in all the images. Two workarounds could be used: forcing the regularization term (and, possibly, losing finescale details), or actively controlling the lighting in order to be sure that no point on the surface is shadowed in all the views. This is further discussed in the next subsection.
6.3 Tuning the Hyperparameters and
In the previous experiments, we arbitrarily chose the values of parameters and which provided the “best” results. Of course, such a tuning, which may be tedious, must be discussed.
In order to highlight the influence of these parameters, let us first question what would happen without neither regularization nor multiview consistency i.e., when . In that case, only the photometric term would be optimised, which corresponds to the maximum likelihood case. If lighting is not varying, then we are in a degenerate case which may result in estimating diffuse lighting (see Equation (20)) and replacing the reflectance maps by the images. Lighting will thus be “baked in” the reflectance maps, which is precisely what we pretend not to do.
To avoid this effect, the smoothness term must be activated by setting . If we still consider , then the variational problem (37) comes down to independent image restoration problems. In fact, these problems are similar to TV denoising problems, except that a physically plausible fidelity term is used to help removing the illumination artifacts not only from the total variation regularization, but also by incorporating prior knowledge of the surface geometry. However, because the photometric term is invariant by the transformation , , each reflectance map is estimated only up to a scale factor, hence the maps will not be consistent, as this is the case for the competing singleview methods.
The latter issue is solved by activating the multiview consistency term i.e., by setting . In that case, there is still an ambiguity , , but it is now global i.e., independent from . To solve this ambiguity, it is enough in practice to set one reflectance value arbitrarily, or to normalize the reflectance values.
Overall, it is necessary to ensure that both and are strictly positive. The choice of is not really critical. Indeed, the multiview consistency regularizer which is controlled by arises from relaxing a hard constraint (compare (32) and (37)). Hence, only needs to be chosen “high enough” so that the regularizer approximates fairly well a hard constraint. In all the experiments, we used and did not face any particular problem. Obviously, if the correspondences were not appropriately computed by SfM, then this value should be reduced, but SfM solutions such as Moulon are now mature enough to provide accurate correspondences.
The choice of is much more critical. This is illustrated in Figure 10, which shows the RMSE in each channel, using images from the same dataset as that of Figure 7, at convergence of our algorithm, as a function of . This graph shows that the “optimal” value of is very hard to find: in this example, a high value of would diminish the RMSE in the face and the hair (which are mostly red), because this would make them uniform as expected (see Figure 11, last rows). However, a much lower value of is required in order to preserve the thin shirt details, which mostly contain green and blue components (see Figure 11, first rows).
There is one situation where this tuning is much easier. It is when the lighting is not fixed, but strongly varying. As discussed in Section 3, the problem of jointly estimating reflectance and lighting is then overdetermined, which theoretically makes the regularization unnecessary. In Figure 12, we show the results obtained in the case where each image is obtained under a different lighting. In that case, the thin structures of the shirt are preserved, while shading on the face is largely reduced, despite the choice of a very low regularization weight . Note that we cannot use the limit case because not all pixels have correspondences in all images: there may thus be a few pixels for which the problem remains underdetermined, and for which diffusion is required. Overall, this experiment shows that, without any prior knowledge on the lighting, the only way to avoid introducing an empirical prior on the reflectance, and thus its tuning, is to actively control lighting during the acquisition process. This means, combining multiview and photometric stereo.
It happens that this problem is actively being addressed by the computer vision community Park2017. Interestingly, in this research the focus is put on highly accurate geometry estimation, and not so much on reflectance estimation (no reflectance estimation result is shown). Therefore, it may be an interesting future research direction to incorporate our reflectance estimation framework in such multiview, multilighting approaches. Both highly accurate geometry and reflectance could indeed be expected.
7 Conclusion and Perspectives
We have proposed a variational framework for estimating the reflectance of a scene from a series of multiview images. We advocate a 2Dparameterization of reflectance, turning the problem into that of converting the input images into reflectance maps. Invoking a Bayesian rationale leads to a variational model comprising a normbased photometric data term, a Potts regularizer and a multiview consistency constraint. For simplicity, both the latter are relaxed into a total variation term and a norm term, respectively. Numerical solving is carried out using an alternating majorizationminimization algorithm. Empirical results on both synthetic and realworld datasets demonstrate the interest of considering multiview images for reflectance estimation, as it allows to benefit from prior knowledge of the geometry, to improve robustness to specularities and to guarantee consistency of the reflectance estimates.
However, the critical analysis of our results also highlighted some limitations and possible future research directions. For instance, avoiding the relaxation of the nonsmooth, nonconvex regularization, seems to be necessary in order to really ensure that the estimated reflectance maps are piecewiseconstant. In addition, the choice of parameterizing reflectance in the image (2D) domain is advocated for reasons of numerical simplicity, yet it seems somewhat more natural to work directly on the surface (this would avoid the multiview consistency constraint). However, this would require turning our simple variational framework into a more arduous optimization problem over a manifold.
Finally, we could disambiguate the problem by measuring upstream the incoming light, using, for instance, environment maps. Without prior measurement, it seems that the only way to avoid resorting to an arbitrary prior for limiting the arising ambiguities consists in actively controlling the lighting (this would avoid resorting to spatial regularization). Therefore, another extension of our work consists in estimating reflectance from multiview, multilighting data, in the spririt of multiview photometric stereo techniques. However, this would require appropriately modifying the SfM/MVS pipeline, which relies on the constant brightness assumption.