Variational Reflectance Estimation from Multi-view Images

09/25/2017 ∙ by Jean Mélou, et al. ∙ 0

We tackle the problem of reflectance estimation from a set of multi-view images, assuming known geometry. The approach we put forward turns the input images into reflectance maps, through a robust variational method. The variational model comprises an image-driven fidelity term and a term which enforces consistency of the reflectance estimates with respect to each view. If illumination is fixed across the views, then reflectance estimation remains under-constrained: a regularization term, which ensures piecewise-smoothness of the reflectance, is thus used. Reflectance is parameterized in the image domain, rather than on the surface, which makes the numerical solution much easier, by resorting to an alternating majorization-minimization approach. Experiments on both synthetic and real datasets are carried out to validate the proposed strategy.

READ FULL TEXT VIEW PDF

Authors

page 2

page 3

page 13

page 14

page 15

page 16

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Acquiring the shape and the reflectance of a scene is a key issue, e.g., for the movie industry, as it allows proper relighting. The current proposed solutions focus on small objects and stand on multiple priors UnityDelight or need very controlled environments (ReinhardWardPattanaikDebevec, Chapter 9). Well-established shape acquisition techniques such as multi-view stereo exist for accurate 3D-reconstruction. Nevertheless, they do not aim at recovering the surface reflectance. Hence, the original input images are usually mapped onto the 3D-reconstruction as texture. Since the image graylevel mixes shading information (induced by lighting and geometry) and reflectance (which is characteristic of the surface), relighting based on this approach usually lacks realism. To improve the results, reflectance needs to be separated from shading.

In order to more precisely illustrate our purpose, let us take the example of a Lambertian surface. In a 2D-point (pixel) conjugate to a 3D-point of a Lambertian surface, the graylevel is written

(1)

In the right-hand side of (1), is the albedo111Since the albedo suffices to characterize the reflectance of a Lambertian surface, we will name it “reflectance” as well.,

the lighting vector, and

the outer unit-length normal to the surface. All these elements a priori depend on i.e., they are defined locally. Whereas is always supposed to be given, different situations can occur, according to which are also known, among , and .

(a) (b) (c) (d)
Figure 1: The “workshop metaphor” (extracted from a paper by Adelson and Pentland Adelson). Image (a) may be interpreted either by: (b) incorporating all the brightness variations inside the reflectance; (c) modulating the lighting of a white planar surface; (d) designing a uniformly white 3D-shape illuminated by a parallel and uniform light beam. This last interpretation is one of the solutions of the shape-from-shading problem.

One equation (1) per pixel is not enough to simultaneously estimating the reflectance , the lighting and the geometry, represented here by , because there are much more unknowns than equations. Figure 1 illustrates this source of ill-posedness through the so-called “workshop metaphor” introduced by Adelson and Pentland in Adelson: among three plausible interpretations (b), (c) and (d) of image (a), we are particularly interested in (d), which illustrates the principle of photometric 3D-reconstruction. This class of methods usually assume that the lighting is known. Still, there remains three scalar unknowns per equation (1): and

, which has two degrees of freedom. Assuming moreover that the reflectance

is known, the shape-from-shading technique Horn uses the shading as unique clue to recover the shape from Equation (1), but the problem is still ill-posed.

A classical way to make photometric 3D-reconstruction well-posed is to use images taken using a single camera pose, but under varying known lighting:

(2)

In this variant of shape-from-shading called photometric stereo Woodham1980a, the reflectance and the normal can be estimated without any ambiguity, as soon as non-coplanar lighting vectors are used.

Symmetrically to (2), solving the problem:

(3)

allows to estimate the lighting , as soon as the reflectance and non-coplanar normals , , are known. This can be carried out, for instance, by placing a small calibration pattern with known color and known shape near each 3D-point  Queau2017bis.

Figure 2: Overview of our contribution. From a set of images of a surface acquired under different angles, and a coarse geometry obtained for instance using multi-view stereo, we estimate a shading-free reflectance map per view.

The problem we aim at solving in this paper is slightly different. Suppose we are given a series of images of a scene taken using a single lighting, but camera poses. According to Lambert’s law, this ensures that a 3D-point looks equally bright in all the images where it is visible. Such invariance is the basic clue of multi-view stereo (MVS), which has become a very popular technique for 3D-reconstruction MVS. Therefore, since an estimate of the surface shape is available, is known. Now, we have to index the pixels by the image number . Fortunately, additional data provided by MVS are the correspondences between the different views, taking the form of -tuples of pixels which are conjugate to a common 3D-point .

Our problem is written222Even if they look very similar, Problems (2), (3) and (4) have completely different peculiarities.:

(4)

where is the projection of in the -th image, and and are unknown. Obviously, this system reduces to Equation (1), since its equations are the same one: the right-hand side of (4) does not depend on , not more than the left-hand side since, as already noticed, the lighting does not vary from one image to another, and the surface is Lambertian.

Multi-view helps estimating the reflectance, because it provides the 3D-shape via MVS. However, even if is known, Equation (1) remains ill-posed. This is illustrated, in Figure 1, by the solutions (b) and (c), which correspond to the same image (a) and to a common planar surface. In the absence of any prior, Equation (1) has an infinity of solutions in . In addition, determining from each of these solutions would give rise to another ambiguity, since is not forced to be unit-length, contrarily to .

Such a double source of ill-posedness probably explains why various methods for reflectance estimation have been designed, introducing a variety of priors in order to disambiguate the problem. Most of them assume that brightness variations induced by reflectance changes are likely to be strong but sparsely distributed, while the lighting is likely to induce smoother changes 

Land1971.

This suggests to separate a single image into a piecewise smooth layer and a more oscillating one. In the computer vision literature, this is often referred to as “intrinsic image decomposition”, while the terminology “cartoon + texture decomposition” is more frequently used by the mathematical imaging community (both these problems will be discussed in Section

2).

Contributions.

In this work, we show the relevance of using multi-view images for reflectance estimation. Indeed, this enables a prior shape estimation using MVS, which essentially reduces the decomposition problem to the joint estimation of a set of reflectance maps, as illustrated in Figure 2. We elaborate on the variational approach to multi-view decomposition into reflectance and shading, which we initially presented in SSVM_2017_Jean. The latter introduced a robust -TV framework for the joint estimation of piecewise-smooth reflectance maps and of spherical harmonics lighting, with an additional term ensuring the consistency of the reflectance maps. The present paper extends this approach by developing the theoretical foundations of this variational model. In this view, our parameterization choices are further discussed and the underlying ambiguities are exhibited. The variational model is motivated by a Bayesian rationale, and the proposed numerical scheme is interpreted in terms of a majorization-minimization algorithm. Finally, we conclude that, besides a preliminary measurement of the incoming lighting, varying the lighting along with the viewing angle, in the spirit of photometric stereo, is the only way to estimate the reflectance without resorting to any prior.

Organization of the Paper.

After reviewing related approaches in Section 2, we formalize in Section 3 the problem of multi-view reflectance estimation. Section 4 then introduces a Bayesian-to-variational approach to this problem. A simple numerical strategy for solving the resulting variational problem, which is based on alternating majorization-minimization, is presented in Section 5. Experiments on both synthetic and real-world datasets are then conducted in Section 6, before summarizing our achievements and suggesting future research directions in Section 7.

2 Related Works

Studied since the 1970s Land1971, the problem of decomposing an image (or a set of images) into a piecewise-smooth component and an oscillatory one is a fundamental computer vision problem, which has been addressed in numerous ways.

Cartoon + Texture Decomposition.

Researchers in the field of mathematical imaging have suggested various variational models for this task, using for instance non-smooth regularization and Fourier-based frequency analysis Aujol2006, or -TV variational models IpolCartoon. However, such techniques do not use an explicit photometric model for justifying the decomposition, whereas photometric analysis, which is another important branch of computer vision, may be a source of inspiration for motivating new variational models.

Photometric Stereo.

As discussed in the introduction, photometric stereo techniques Woodham1980a are able to unambiguously estimate the reflectance and the geometry, by considering several images obtained from the same viewing angle but under calibrated, varying lighting. Photometric stereo has even been extended to the case of uncalibrated, varying lighting Basri2007. In the same spirit as uncalibrated photometric stereo, our goal is to estimate reflectance under unknown lighting. However, the problem is less constrained in our case, since we cannot ensure that the lighting is varying. Our hope is that this can be somewhat compensated by the prior knowledge of geometry, and by the resort to appropriate priors. Various priors for reflectance have been discussed in the context of intrinsic image decomposition.

Intrinsic Image Decomposition.

Separating reflectance from shading in a single image is a challenging problem, often referred to as intrinsic image decomposition. Given the ill-posed nature of this problem, prior information on shape, reflectance and/or lighting must be introduced. Most of the existing works are based on the “retinex theory” Land1971, which states that most of the slight brightness variations in an image are due to lighting, while reflectance is piecewise-constant (as for instance a Mondrian image). A variety of clustering-based GMLG12; Shen2011 or sparsity-enhancing methods Gehler2011; Nadian-Ghomsheh2016; Shen2011; Song2017 have been developed based on this theory. Among others, the work of Baron and Malik Barron, which presents interesting results, stands on multiple priors to solve the fundamental ambiguity of shape-from-shading, that we aim at revoking in the multi-view context. Some other methods disambiguate the problem by requiring the user to “brush” uniform reflectance parts BPD09; Nadian-Ghomsheh2016, or by resorting to a crowdsourced database bell14intrinsic. Still, these works require user interactions, which may not be desirable in certain cases.

Multi-view 3D-reconstruction.

Instead of introducing possibly unverifiable priors, or relying on user interactions, ambiguities can be reduced by assuming that the geometry of the scene is known. Intrinsic image decomposition has for instance been addressed using an RGB-D camera Chen2013 or, closer to our proposal, multiple views of the same scene under different angles Laffont2013; Laffont2012. In the latter works, the geometry is first extracted from the multi-view images, before the problem of reflectance estimation is addressed. Geometry computation can be achieved using multi-view stereo (MVS). MVS techniques Seitz have seen significant growth over the last decade, an expansion which goes hand in hand with the development of structure-from-motion (SfM) solutions Moulon. Indeed, MVS requires the parameters of the cameras, outputs of the SfM algorithm. Nowadays, these mature methods are commonly used in uncontrolled environments, or even with large-scale Internet data RomeInADay. For the sake of completeness, let us also mention that some efforts in the direction of multi-view and photometrically consistent 3D-reconstruction have been devoted recently Jin-et-al-08; Kim; Langguth; Robert; Maurer. Similar to these methods, we will resort to a compact representation of lighting, namely the spherical harmonics model.

Spherical Harmonics Lighting Model.

Let us consider a point lying on the surface of the observed scene, and let be the outer unit-length normal vector to in . Let be the hemisphere centered in , having as basis plane the tangent plane to in . Each light source visible from can be associated to a point on . If we describe by the vector the corresponding elementary light beam (oriented towards the source), then by definition of the reflectance (or BRDF) of the surface, denoted , the luminance of in the direction is given by

(5)

where is the surface illuminance. In general, depends both on the direction of the light , and on the viewing direction , relatively to .

This expression of the luminance is intractable in the general case. However, if we restrict our attention to Lambertian surfaces, the reflectance reduces to the albedo , which is independent of any direction, and does not depend on the viewing direction anymore. If the light sources are further assumed to be distant enough from the object, then is independent of i.e., the light beams are the same for the whole (supposedly convex) object, and thus the lighting is completely defined on the unit sphere. Therefore, the integral (5) acts as a convolution on , having as kernel . Spherical harmonics, which can be considered as the analogue to the Fourier series on the unit sphere, have been shown to be an efficient low-dimensional representation of this convolution Basri; Ramamoorthi. Many vision applications Kim; Wu use second order spherical harmonics, which can capture over of the natural lighting Frolova using only nine coefficients. This yields an approximation of the luminance of the form

(6)

where is the albedo (reflectance), is a compact lighting representation, and stores the local geometric information. The latter is deduced from the normal according to:

(7)

In (6), the lighting vector is the same in all the points of the surface, but the reflectance and the geometric vector vary along the surface of the observed scene. Hence we will write (6) as:

(8)

Our aim in this paper is to estimate the reflectance in each point , as well as the lighting vector , given a set of multi-view images and the geometric vector . We formalize this problem in the next section.

3 Multi-view Reflectance Estimation

In this section, we describe with more care the problem of reflectance estimation from a set of multi-view images. First, we need to explicit the relationship between graylevel, reflectance, lighting and geometry.

3.1 Image Formation Model

Let be a point on the surface of the scene. Assume that it is observed by a graylevel camera with linear response function and let be the image, where is the projection of onto the image plane. Then, the graylevel in the pixel conjugate to is proportional to the luminance of in the direction of observation :

(9)

where the coefficient , referred to in the following as the “camera coefficient”, is unknown333This coefficient depends on several factors such as the lens aperture, the magnification, the exposure time, etc.. By assuming Lambertian reflectance and the light sources distant enough from the object, Equations (8) and (9) yield:

(10)

Now, let us assume that images of the surface, , obtained while moving a single camera, are available, and discuss how to adapt (10).

Case 1: unknown, yet fixed lighting and camera coefficient.

If all the automatic settings of the camera are disabled, then the camera coefficient is independent from the view. We can thus incorporate this coefficient and the denominator into the lighting vector: . Moreover, if the illumination is fixed, the lighting vector  is independent from the view. In any point which is visible in the -th view, Equation (10) becomes:

(11)

where we denote by the 3D-to-2D projection associated to the -th view. In (11), the unknowns are the reflectance and the lighting vector . Equations (11), , constitute a generalization of (4) to more complex illumination scenarios. For the whole scene, this is a problem with unknowns and up to equations, where is the number of 3D-points which have been estimated by multi-view stereo. However, as for System (4), only equations are linearly independent, hence the problem of reflectance and lighting estimation is under-constrained.

Case 2: unknown and varying lighting and camera coefficient.

If lighting is varying, then we have to make the lighting vector view-dependent. If it is also assumed to vary, the camera coefficient can be integrated into the lighting vector with the denominator i.e., , since the estimation of each will include that of . Equation (10) then becomes:

(12)

There are even more unknowns (), but this time the equations are linearly independent, at least as long as the are not proportional i.e., if not only the camera coefficient or the lighting intensity vary across the views, but also the lighting direction444Another case, which we do not study here, is when the lighting and camera coefficient are both varying, yet only lighting is calibrated. This is known as “semi-calibrated” photometric stereo Cho2016.. Typically, is of the order of , hence the problem is over-constrained as soon as at least two out of the lighting vectors are non-collinear. This is a situation similar to uncalibrated photometric stereo Basri2007, but much more favorable: the geometry is known, hence the ambiguities arising in uncalibrated photometric stereo are likely to be reduced. However, contrarily to uncalibrated photometric stereo, lighting is not actively controlled in our case. Lighting variations are likely to happen e.g., in outdoor scenarios, yet they will be limited. The lighting vectors , , will thus be close to each other: lighting variations will not be sufficient in practice for disambiguation (ill-conditioning).

Since (11) is under-constrained and (12) is ill-conditioned, additional information will have to be introduced either ways, and we can restrict our attention to the varying lighting case (12).

So far, we have assumed that graylevel images were available. To extend our study to RGB images, we abusively assume channel separation, and apply the framework independently in each channel . We then consider the expression:

(13)

where and denote, respectively, the colored reflectance and the -th colored lighting vector, relatively to the response of the camera in channel . A more complete study of Model (13) is presented in JMIV2017_LEDS.

Since we will apply the same framework independently in each color channel, we consider hereafter the graylevel case only i.e., we consider the image formation model (12) instead of (13). The question which arises now is how to estimate the reflectance from a set of equations such as (12), when the geometry is known but the lighting is unknown.

3.2 Reflectance Estimation on the Surface

We place ourselves at the end of the multi-view 3D-reconstruction pipeline. Thus, the projections are known (in practice, they are estimated using SfM techniques), as well as the geometry, represented by a set of 3D-points , , and the corresponding normals (obtained for instance using SFM techniques), from which the geometric vectors are easily deduced according to (7).

The unknowns are then the reflectance values and the lighting vectors , which are independent from the 3D-point number due to the distant light assumption. At first glance, one may think that their estimation can be carried out by simultaneously solving (12) in all the 3D-points , in a purely data-driven manner, using some fitting function :

(14)

where we denote , and is a visibility boolean such that if is visible in the -th image, and otherwise.

Let us consider, for the sake of pedagogy, the simplest case of least-squares fitting () and perfect visibility (). Then, Problem (14) is rewritten in matrix form:

(15)

where the Kronecker product is a matrix of , being a vector of which stores the unknown reflectance values, and a matrix of which stores the unknown lighting vectors , column-wise, is a block-diagonal matrix whose -th block, , is the row vector , matrix stores the graylevels, and is the Frobenius norm.

Using the pseudo-inverse of , (15) is rewritten:

(16)

Problem (16

) is a nearest Kronecker product problem, which can be solved by singular value decomposition (SVD)

(GolubV4, Theorem 12.3.1).

However, this matrix factorization approach suffers from three shortcomings:

  • It is valid only if all 3D-points are visible under all the viewing angles, which is rather unrealistic. In practice, (15) should be replaced by

    (17)

    where is a visibility matrix containing the values , and is the Hadamard product. This yields a Kronecker product problem with missing data, which is much more arduous to solve.

  • It is adapted only to least-squares estimation. Considering a more robust fitting function would prevent a direct SVD solution.

  • If lighting is not varying (), then it can be verified that (15) is ill-posed. Among its many solutions, the following trivial one can be exhibited:

    (18)
    (19)

    where:

    (20)

    and is the mean over the view indices . This trivial solution means that the lighting is assumed to be completely diffuse555In the computer graphics community, this is referred to as “ambient lighting”., and that the reflectance is equal to the image graylevel, up to noise only. Obviously, this is not an acceptable interpretation. As discussed in the previous subsection, in real-world scenarios we will be very close to this degenerate case, hence additional regularization will have to be introduced, which makes things even harder.

Overall, the optimization problem which needs to be addressed is not as easy as (16). It is a non-quadratic regularized problem of the form:

(21)

where is the set of neigbors of on surface , and the regularization function needs to be chosen appropriately to ensure piecewise-smoothness.

However, the sampling of the points on surface  is usually non-uniform, because the shape of is potentially complex. It may thus be difficult to design appropriate fidelity and regularization functions and , and to design an appropriate numerical solving. In addition, some thin brightness variations may be missed if the sampling is not dense enough. Overall, direct estimation of reflectance on the surface looks promising at first sight, but rather tricky in practice. Therefore, we leave this as an interesting future research direction and follow in this paper a simpler approach, which consists in estimating reflectance in the image domain.

3.3 Reflectance Estimation in the Image Domain

Instead of trying to colorize the 3D-points estimated by MVS i.e., of parameterizing the reflectance over the (3D) surface , we can also formulate the reflectance estimation problem in the (2D) image domain.

Equation (12) is equivalently written, in each pixel :

(22)

where we denote and . Instead of estimating one reflectance value per estimated 3D-point, the reflectance estimation problem is thus turned into the estimation of “reflectance maps”

(23)

On the one hand, the 2D-parameterization (23) does not enforce the consistency of the reflectance maps. This will have to be explicitly enforced later on. Besides, the surface will not be directly colorized, but the estimated reflectance maps have to be back-projected and fused over the surface in a final step.

On the other hand, the question of occlusions (visibility) does not arise, and the domains are subsets of a uniform square 2D-grid. Therefore, it will be much easier to design appropriate fidelity and regularization terms. Besides, there will be as many reflectance estimates as pixels in those sets: with modern HD cameras, this number is much larger than the number of 3D-points estimated by multi-view stereo. Estimation will thus be much denser.

With such a parameterization choice, the regularized problem (21) will be turned into:

(24)

with some function to ensure multi-view consistency, and where is the set of neighbors of pixel which lie inside . Note that, since is a subset of a square, regular 2D-grid, this neighborhood is much easier to handle than that appearing in (21).

In the next section, we discuss appropriate choices for , and in (24), by resorting to a Bayesian rationale.

4 A Bayesian-to-variational Framework for Multi-view Reflectance Estimation

Following Mumford’s Bayesian rationale for the variational formulation Mumford1994, let us now introduce a Bayesian-to-variational framework for estimating reflectance and lighting from multi-view images.

4.1 Bayesian Inference

Our problem consists in estimating the reflectance maps and the lighting vectors , given the images ,

. As we already stated, a maximum likelihood approach is hopeless, because a trivial solution arises. We rather resort to Bayesian inference, estimating

as the maximum a posteriori (MAP) of the distribution

(25)

where the denominator is the evidence, which can be discarded since it depends neither on the reflectance nor on the lighting, and the factors in the numerator are the likelihood and the prior, respectively.

Likelihood.

The image formation model (22) is never strictly satisfied in practice, due to noise, cast-shadows and possibly slightly specular surfaces. We assume that such deviations from the model can be represented as independent (with respect to pixels and views) Laplace laws666

We consider the Laplace law here because: i) since it has higher tails than the Gaussian, it allows for sparse outliers to the Lambertian model such as cast-shadows or specularities; ii) it yields convex optimization problems, unlike other heavy-tailed distributions such as Cauchy or

distributions. with zero mean and scale parameter :

(26)

where , , is the -norm over and is the cardinality of .

Prior.

Since the reflectance maps are independent from the lighting vectors , the prior can be factorized to . Since the lighting vectors are independent from each other, the prior distribution of the lighting vectors factorizes to

. As each lighting vector is unconstrained, we can consider the same uniform distribution

i.e., , independently from the view index . This distribution being independent from the unknowns, we can discard the lighting prior from the inference process. Regarding the reflectance maps, we follow the retinex theory Land1971, and consider each of them as piecewise-constant. The natural prior for each such map is thus the Potts model:

(27)

where represents the gradient of at pixel (approximated, in practice, using first-order forward stencils with a Neumann boundary condition), and with a normalization coefficient and a scale parameter. Note that we use the abusive -norm notation to denote:

(28)

with if , and otherwise.

The reflectance maps are obviously not independent: the reflectance, which characterizes the surface, should be independent from the view. It follows that the parameters are the same for each Potts model (27), and that the reflectance prior can be taken as the product of independent distributions with the same parameters :

(29)

but only if the coupling between the reflectance maps is enforced by the following linear constraint:

(30)

where is a “correspondence function”, which is easily created from the (known) projection functions and the geometry, and which is defined as follows:

(31)

Since maximizing the MAP probability (25) is equivalent to minimizing its negative logarithm, we eventually obtain the following constrained variational problem, which explicits the functions , and in (24):

(32)

where and where we neglect all the normalization coefficients.

4.2 Relationship with Cartoon + Texture Decomposition

Applying a logarithm transformation to both sides of (22), we obtain:

(33)

where the tilde notation is used as a shortcut for the logarithm.

By applying the exact same Bayesian-to-variational rationale, we would end up with the following variational problem:

(34)

The variational problem (34) can be interpreted as a multi-view cartoon + texture decomposition problem, where each log-image is decomposed into a component which is piecewise-smooth (“cartoon”, here the log-reflectance), and a component which contains higher-frequency details (“texture”, here the log-shading). In contrast with conventional methods for such a task, the present one uses an explicit shading model for the texture term.

Note however that such a decomposition is justified only if the -images are considered. If using the original images , our framework should rather be considered as a multi-view cartoon “” texture decomposition framework.

4.3 Bi-convex Relaxation of the Variational Model (32)

Problem (32) is a non-convex (due to the -regularizers), non-smooth (due to the -regularizers and to the -fidelity term). Although some efforts have recently been devoted to the resolution of optimization problems involving -regularizers Storath2014, we prefer to keep the optimization simple, and approximate these by (convex, but non-smooth) anisotropic total variation terms:

(35)

Besides, the correspondence function may be slightly inaccurate in practice, due to errors in the prior geometry estimation obtained via multi-view stereo. Therefore, we turn the linear constraint in (32) into an additional term. Eventually, we replace the non-differentiable absolute values arising from the -norms by the (differentiable) Moreau envelope i.e., the Huber loss777We use , in the experiments.:

(36)

Altogether, this yields the following smooth, bi-convex variational problem:

(37)

In Equation (37

), the first term ensures photometric consistency (in the sense of the Huber loss function), the second one ensures reflectance smoothness (smoothed anisotropic total variation), and the third term ensures multi-view consistency of the reflectance estimates (again, in the sense of the Huber loss function). At last,

and are tunable hyper-parameters controlling the reflectance smoothness and the multi-view consistency, respectively.

5 Alternating Majorization-minimization for Solving (37)

To solve (37), we propose an alternating majorization-minimization method, which combines alternating and majorization-minimization optimization techniques. As sketched in Figure 3, this algorithm works as follows. Given an estimate of the solution at iteration , the lighting vectors and the reflectance maps are successively updated according to:

(38)
(39)

where and are local quadratic majorants of and around, respectively, and . Then, the process is repeated until convergence.

Figure 3: Sketch of the proposed alternating majorization-minimization solution. The partially freezed energies and are locally majorized by the quadratic functions (in red) and (in blue). Then, these quadratic majorants are (globally) minimized and the process is repeated until convergence is reached.

To this end, let us first remark that the function

(40)

is such that , and is a proper local quadratic majorant of around , . This is easily verified if , from the definition (36) of . If , the difference writes:

(41)

which is positive in any case.

Therefore, the function

(42)

with

(43)

is a local quadratic majorant of around which is suitable for the update (38).

Similarly, the function

(44)

is a local quadratic majorant of around which is suitable for the update (39).

The update (38) then comes down to solving a large sparse linear least-squares problem, which we achieve by applying conjugate gradient iterations to the associated normal equations. Regarding (39), it comes down to solving a series of independent small-scale linear least-squares problems, for instance by resorting to the pseudo-inverse.

We iterate the optimisation steps (38) and (39) until convergence or a maximum iteration number is reached, starting from the trivial solution of the non-regularized () problem. This non-regularized solution is attained by considering diffuse lighting (see (20)) and using the input images as reflectance maps. In our experiments, we found iterations were always sufficient to reach a stable solution ( relative residual between two consecutive energy values and ).

Proving convergence of our scheme is beyond the scope of this paper, but the proof could certainly be derived from that in JMIV2017_LEDS, where a similar alternating majorization-minimization called “alternating reweighted least-squares” is used. Note, however, that the convergence rate seems to be sublinear (see Figure 4), hence possibly faster numerical strategies could be explored in the future.

Figure 4: Top: evolution of the energy defined in (37), in function of iterations , concerning the test presented in Figure 8. Bottom: absolute value of the relative variation between two successive energy values. Our algorithm stops when this value is less than , which happens in less than 50 iterations and takes around 3 minutes on a recent i7 processor, with non-optimized Matlab codes for images of size .

6 Results

In this section, we evaluate the proposed variational method for multi-view reflectance estimation, on a variety of synthetic and real-world datasets. We start by a quantitative comparison of our results with two single-view methods, namely, the cartoon + texture decomposition method from IpolCartoon and the intrinsic image decomposition method from Gehler2011.

6.1 Quantitative Evaluation on a Synthetic Dataset

  
(a) (b) (c)   (d)
Figure 5: (a) 3D-shape used in the tests (the well-known “Joyful Yell” 3D-model), which will be imaged under two scenarios (see Figures 6 and 7). (b) Same, after smoothing, thus less accurate. (c)-(d) Zooms of (a) and (b), respectively, near the neck.
  Input images
Cartoon + texture IpolCartoon
Intrinsic decomposition Gehler2011
    Ours
  Ground truth
Figure 6: First row: three (out of ) synthetic views of the object of Figure 5-a, computed with a purely-Lambertian reflectance taking only four different values (hair, face, shirt and plinth), illuminated by a “skydome”. Second row: estimation of the reflectance using the cartoon + texture decomposition described in IpolCartoon (with its parameter fixed to ). Third row: estimation of the reflectance using the method proposed in Gehler2011 (with 4 clusters). Forth row: estimation of the reflectance using the proposed approach (with and ). Fifth row: ground truth.
  Input images
Cartoon + texture IpolCartoon
Intrinsic decomposition Gehler2011
    Ours
  Ground truth
Figure 7: First row: three (out of ) synthetic views of the object of Figure 5-a, computed with a non-uniform shirt reflectance, a uniform, but partly specular hair reflectance, illuminated by a single extended light source. Second row: estimation of the reflectance using the cartoon + texture decomposition described in IpolCartoon (with its parameter fixed to ). Third row: estimation of the reflectance using the method proposed in Gehler2011 (with 6 clusters). Forth row: estimation of the reflectance using the proposed approach (with and ). Fifth row: ground truth.
Test Channel Cartoon + texture IpolCartoon Intrinsic decomposition Gehler2011 Ours
Purely-Lambertian surface
+ Piecewise-constant reflectance
+ Skydome lighting
(see Figure 6)
R
G
B
0.62
0.23
0.38
0.26
0.14
0.24
0.07
0.04
0.07
Non-uniform shirt reflectance
+ Partly specular hair reflectance
+ Single extended light source
(see Figure 7)
R
G
B
0.60
0.32
0.24
0.29
0.22
0.21
0.22
0.13
0.12
Table 1: RMSE on the reflectance estimates (the estimated and ground truth reflectance maps are scaled to ), with respect to each channel and to the whole set of images, for our method and two single-view approaches. Our method overcomes the latter on the two considered datasets. See text for details.

We first test our reflectance estimation method using synthetic images, of size , of an object whose geometry is perfectly known (see Figure 5-a). Two scenarios are considered:

  • In Figure 6, a purely-Lambertian, piecewise-constant reflectance is mapped onto the surface of the object, which is then illuminated by a “skydome” i.e., an almost diffuse lighting. Shading effects are thus rather limited, hence applying to each image an estimation method which does not use an explicit reflectance model e.g., the cartoon + texture decomposition method from IpolCartoon, should already provide satisfactory results. The reflectance being perfectly piecewise constant, applying sparsity-based intrinsic image decomposition methods such as Gehler2011 to each image should also work well.

  • In Figure 7, a more complicated (non-uniform) reflectance is mapped onto the shirt, the hair is made partly specular, and the diffuse lighting is replaced by a single extended light source, which induces much stronger shading effects. It will thus be much harder to remove shading without an explicit reflectance model (cartoon + texture approach), while the single-view image decomposition approach should be non-robust to specularities.

In both cases, the competing methods IpolCartoon and Gehler2011 are applied independently to each of the images. The estimates are thus not expected to be consistent, which may be problematic if the reflectance maps should be further mapped onto the surface for, e.g., relighting applications. On the contrary, our approach simultaneously, and consistently, estimates the reflectance maps.

As we dispose of the reflectance ground truth, we can numerically evaluate these results by estimating the root mean square error (RMSE) for each method, over the whole set of images. The values are presented in Table 1. In order to compare comparable things, the reflectance estimated by each method is scaled, in each channel, by a factor common to the reflectance maps, so as to minimize the RMSE. This should thus highlight inconsistencies between the reflectance maps.

Figure 8: Same test as in Figure 7, using a coarse version of the 3D-shape (see Figures 5-b and 5-d), with and . Results are qualitatively similar to those shown in Figure 7, obtained with perfect geometry. The RMSE in the RGB channels are, respectively: 0.24, 0.14 and 0.13, which are only slightly higher than those attained with perfect geometry (see Table 1).

Based on the qualitative results from Figures 6 and 7, and the quantitative evaluations shown in Table 1, we can make the following three observations:

1) Considering an explicit image formation model improves cartoon + texture decomposition.

Actually, the cartoon part from the cartoon + texture decomposition is far less uniform than the reflectance estimated using both other methods. Shading is only blurred, and not really removed. This could be improved by augmenting the regularization weight, but the price to pay would be a loss of detail in the parts containing thinner details (as the shirt, in the example of Figure 7).

2) Simultaneously estimating the multi-view reflectance maps makes them consistent and improves robustness to specularities.

When estimating each reflectance map individually, inconsistencies arise, which is obvious for the hair in the third line of Figure 6, and explains the RMSE values in Table 1. In contrast, our results confirm our basic idea i.e., that reflectance estimation benefits in two ways from the multi-view framework: this allows us not only to estimate the 3D-shape, but also to constrain the reflectance of each surface point to be the same in all the pictures where it is visible. In addition, since the location of bright spots due to specularity depends on the viewing angle, they usually occur in some places on the surface only under certain viewing angles. Considering multi-view data should thus improve robustness to specularities. This is confirmed in Figure 7 by the reflectance estimates in the hair, where the specularities are slightly better removed than with single-view methods.

3) A sparsity-based prior for the reflectance should be preferred over total variation.

As we use a TV-smoothing term, which favors piecewise-smooth reflectance, the satisfactory results of Figure 6 were predictable. However, some penumbra remains visible around the neck. Since we also know the object geometry, it seems that we could compensate for penumbra. However, this would require that the lighting is known as well, which is not the case in the framework of the targeted usecase, since an outdoors lighting is uncontrolled. Moreover, we would have to consider not only the primary lighting, but also the successive bounces of light on the different parts of the scene (these were taken into account by the ray-tracing algorithm, when synthesizing the images). In contrast, the sparsity-based approach Gehler2011 is able to eliminate penumbra rather well, without modeling secundary reflections. It is also able to more appropriately remove shading on the face in the example of Figure 7, while not degrading as much as total variation the thin structures of the shirt. Hence, the relative simplicity of the numerical solution, which is a consequence of the choice of replacing the Potts prior by a total variation one (see Section 4.3), comes with a price. In future works, it may be important to design a numerical strategy handling the original non-smooth, non-convex problem (32).

6.2 Handling Inaccurate Geometry

Figure 9: Test on a real-world dataset. First row: three (out of ) views of the scene. Second row: estimated reflectance maps using the proposed approach (with and ). Geometry and camera parameters were estimated using an SfM/MVS pipeline.

In the previous experiments, the geometry was perfectly known. In real-world scenarios, errors in the 3D-shape estimation using SfM and MVS are unavoidable. Therefore, it is necessary to evaluate the ability of our method to handle inaccurate geometry.

Thus, we use for the next experiment the surface shown in Figure 5-b (zoomed in Figure 5-d), which is obtained by smoothing the original 3D-shape of Figure 5-a (zoomed in Figure 5-c), using a tool from the meshlab software. The results provided in Figure 8 show that our method seems robust to such small inaccuracies in the object geometry, and is thus relevant for the intended application.

In Figure 9, we qualitatively evaluate our method on the outputs of an SfM/MVS pipeline applied to a real-world dataset, which provides estimates of the camera parameters and a rough geometry of the scene. These experiments confirm that small inaccuracies in the geometry input can be handled. The specularities are also appropriately removed, and the reflectance maps present the expected cartoon-like aspect. However, the reflectance is under-estimated in the sides of the nose and around the chin. Indeed, since lighting is fixed, these areas are self-shadowed in all the images. Two workarounds could be used: forcing the regularization term (and, possibly, losing fine-scale details), or actively controlling the lighting in order to be sure that no point on the surface is shadowed in all the views. This is further discussed in the next subsection.

6.3 Tuning the Hyper-parameters and

Figure 10: Quantitative influence of parameter , using images from the same dataset as that of Figure 7, with .

In the previous experiments, we arbitrarily chose the values of parameters and which provided the “best” results. Of course, such a tuning, which may be tedious, must be discussed.

In order to highlight the influence of these parameters, let us first question what would happen without neither regularization nor multi-view consistency i.e., when . In that case, only the photometric term would be optimised, which corresponds to the maximum likelihood case. If lighting is not varying, then we are in a degenerate case which may result in estimating diffuse lighting (see Equation (20)) and replacing the reflectance maps by the images. Lighting will thus be “baked in” the reflectance maps, which is precisely what we pretend not to do.

To avoid this effect, the smoothness term must be activated by setting . If we still consider , then the variational problem (37) comes down to independent image restoration problems. In fact, these problems are similar to -TV denoising problems, except that a physically plausible fidelity term is used to help removing the illumination artifacts not only from the total variation regularization, but also by incorporating prior knowledge of the surface geometry. However, because the photometric term is invariant by the transformation , , each reflectance map is estimated only up to a scale factor, hence the maps will not be consistent, as this is the case for the competing single-view methods.

The latter issue is solved by activating the multi-view consistency term i.e., by setting . In that case, there is still an ambiguity , , but it is now global i.e., independent from . To solve this ambiguity, it is enough in practice to set one reflectance value arbitrarily, or to normalize the reflectance values.

Overall, it is necessary to ensure that both and are strictly positive. The choice of is not really critical. Indeed, the multi-view consistency regularizer which is controlled by arises from relaxing a hard constraint (compare (32) and (37)). Hence, only needs to be chosen “high enough” so that the regularizer approximates fairly well a hard constraint. In all the experiments, we used and did not face any particular problem. Obviously, if the correspondences were not appropriately computed by SfM, then this value should be reduced, but SfM solutions such as Moulon are now mature enough to provide accurate correspondences.

The choice of is much more critical. This is illustrated in Figure 10, which shows the RMSE in each channel, using images from the same dataset as that of Figure 7, at convergence of our algorithm, as a function of . This graph shows that the “optimal” value of is very hard to find: in this example, a high value of would diminish the RMSE in the face and the hair (which are mostly red), because this would make them uniform as expected (see Figure 11, last rows). However, a much lower value of is required in order to preserve the thin shirt details, which mostly contain green and blue components (see Figure 11, first rows).

Figure 11: Qualitative influence of parameter , using images from the same dataset as that of Figure 7, with .
Figure 12: First row: three (out of ) synthetic images computed under varying lighting (which comes here from the right, from the front and from the left, respectively). Second row: estimated reflectance maps using the proposed approach (with and ). The thin structures of the shirt are preserved, while shading on the face is largely reduced. These results must be compared with those of the first row in Figure 11, obtained with the same value of but under fixed lighting.

There is one situation where this tuning is much easier. It is when the lighting is not fixed, but strongly varying. As discussed in Section 3, the problem of jointly estimating reflectance and lighting is then over-determined, which theoretically makes the regularization unnecessary. In Figure 12, we show the results obtained in the case where each image is obtained under a different lighting. In that case, the thin structures of the shirt are preserved, while shading on the face is largely reduced, despite the choice of a very low regularization weight . Note that we cannot use the limit case because not all pixels have correspondences in all images: there may thus be a few pixels for which the problem remains under-determined, and for which diffusion is required. Overall, this experiment shows that, without any prior knowledge on the lighting, the only way to avoid introducing an empirical prior on the reflectance, and thus its tuning, is to actively control lighting during the acquisition process. This means, combining multi-view and photometric stereo.

It happens that this problem is actively being addressed by the computer vision community Park2017. Interestingly, in this research the focus is put on highly accurate geometry estimation, and not so much on reflectance estimation (no reflectance estimation result is shown). Therefore, it may be an interesting future research direction to incorporate our reflectance estimation framework in such multi-view, multi-lighting approaches. Both highly accurate geometry and reflectance could indeed be expected.

7 Conclusion and Perspectives

We have proposed a variational framework for estimating the reflectance of a scene from a series of multi-view images. We advocate a 2D-parameterization of reflectance, turning the problem into that of converting the input images into reflectance maps. Invoking a Bayesian rationale leads to a variational model comprising a -norm-based photometric data term, a Potts regularizer and a multi-view consistency constraint. For simplicity, both the latter are relaxed into a total variation term and a -norm term, respectively. Numerical solving is carried out using an alternating majorization-minimization algorithm. Empirical results on both synthetic and real-world datasets demonstrate the interest of considering multi-view images for reflectance estimation, as it allows to benefit from prior knowledge of the geometry, to improve robustness to specularities and to guarantee consistency of the reflectance estimates.

However, the critical analysis of our results also highlighted some limitations and possible future research directions. For instance, avoiding the relaxation of the non-smooth, non-convex regularization, seems to be necessary in order to really ensure that the estimated reflectance maps are piecewise-constant. In addition, the choice of parameterizing reflectance in the image (2D) domain is advocated for reasons of numerical simplicity, yet it seems somewhat more natural to work directly on the surface (this would avoid the multi-view consistency constraint). However, this would require turning our simple variational framework into a more arduous optimization problem over a manifold.

Finally, we could disambiguate the problem by measuring upstream the incoming light, using, for instance, environment maps. Without prior measurement, it seems that the only way to avoid resorting to an arbitrary prior for limiting the arising ambiguities consists in actively controlling the lighting (this would avoid resorting to spatial regularization). Therefore, another extension of our work consists in estimating reflectance from multi-view, multi-lighting data, in the spririt of multi-view photometric stereo techniques. However, this would require appropriately modifying the SfM/MVS pipeline, which relies on the constant brightness assumption.

Acknowledgements.
Yvain Quéau and Daniel Cremers were supported by the ERC Consolidator Grant “3D Reloaded”.

References