Deep Photometric Stereo on a Sunny Day

by   Yannick Hold-Geoffroy, et al.

Photometric Stereo in outdoor illumination remains a challenging, ill-posed problem. Indeed, it has been shown that the scene structure cannot be recovered unambiguously from the photometric cues alone when it is lit by the sun over the course of a single day. In this paper, we present a CNN-based technique for photometric stereo on a single sunny day. We solve the ambiguity issue by combining photometric cues with prior knowledge on material properties, local surface geometry and the natural variations in outdoor lighting. To train the CNN, we create a dataset of realistic synthetic renders with both diffuse and specular materials. Given eight outdoor images taken during a single sunny day, our method robustly estimates the scene surface normals. Our approach does not require precise geolocation to work and significantly outperforms several state-of-the-art methods on images with real lighting. This shows that our CNN can combine efficiently learned priors and photometric cues available during a single sunny day.



There are no comments yet.


page 6

page 9

page 14


Lighting, Reflectance and Geometry Estimation from 360^∘ Panoramic Stereo

We propose a method for estimating high-definition spatially-varying lig...

Deep Outdoor Illumination Estimation

We present a CNN-based technique to estimate high-dynamic range outdoor ...

All-Weather Deep Outdoor Lighting Estimation

We present a neural network that predicts HDR outdoor illumination from ...

Spatially-Varying Outdoor Lighting Estimation from Intrinsics

We present SOLID-Net, a neural network for spatially-varying outdoor lig...

On the well-posedness of uncalibrated photometric stereo under general lighting

Uncalibrated photometric stereo aims at estimating the 3D-shape of a sur...

A CNN Based Approach for the Near-Field Photometric Stereo Problem

Reconstructing the 3D shape of an object using several images under diff...

Investigating the performance of Correspondence Algorithms in Vision based Driver-assistance in Indoor Environment

This paper presents the experimental comparison of fourteen stereo match...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Photometric stereo (PS) is a popular, dense shape reconstruction technique that has matured extensively over nearly 40 years [1] to work with complex materials and lighting conditions [2, 3, 4, 5]. Given the excellent PS results obtained in carefully designed laboratory setups, and the success of multiview 3D reconstruction outdoors [6], recent investigations have turned to the more challenging problem of outdoor PS under uncontrolled, natural illumination.

The biggest challenge in outdoor PS remains the fact that the sun, our main light source, moves along a trajectory that nearly lies on a plane throughout the course of the day, leaving the photometric reconstruction under-constrained [1]. Researchers have investigated different approaches to address this problem, which include collecting data for several months [7, 8], waiting for the best time of the year (around summer solstice in the North Hemisphere [9]) or, more recently, for a day with favorable atmospheric conditions (partly cloudy sky [10, 11]). In general, the practitioner has no control over these elements and is therefore left with two options: () a potentially long wait for the ideal conditions to arise; or () the use of a smarter reconstruction technique that does not rely solely on the photometric cue.

This paper investigates the second option above. It uses machine learning to aggregate prior knowledge on the geometry and reflectance of common classes of objects and resolves ambiguities in purely data-driven, single-day outdoor PS. To avoid restricting the algorithm to a small family of objects, we build knowledge on local geometry based on the fact that complex surfaces are made up of simpler, piecewise smooth patches. Our approach is further motivated by the fact that material properties are also locally correlated within small surface patches. Working with a broad class of small surface patches keeps the approach general and also facilitates the collection of data for machine learning.

As we reason in terms of surface patches, a natural choice is to follow a deep learning approach with a novel convolutional neural network (CNN) architecture to automatically learn features from image patches and guide 3D normal estimation. To our knowledge, this is the first CNN to address the problem of under-constrained, single-day outdoor PS.

Our CNN design seeks a balance between fully-calibrated PS (too restrictive) and uncalibrated PS (which introduces further ambiguities). We follow a semi-calibrated outdoor PS approach and only assume that: (1) the object is imaged at roughly predefined times of the day; (2) the sun is unobstructed by clouds at these times; and (3) an orthographic camera facing approximately north (or south). The latter assumption could be relaxed using recent advances on sun position estimation in the wild to automatically detect the camera orientation. One could then train our model on multiple camera calibrations and select the right one according to the detected orientation. Of note, our approach does not require known sun intensities nor complete geolocation data, as in previous work on outdoor PS [12]. Together, our assumptions define an “8-figure” subspace for the position of the sun through the year, known as an analemma, whose shape also varies with geographical location (fig. 1).

Therefore, our outdoor PS network is required to learn priors not only on object materials and local geometry (diffuse, specular, smooth, …), but also priors on lighting (variability in sun intensity and elevation with respect to geolocation and time of year). This knowledge is encoded in the network, which learns to output 3D normals as a non-linear function of input image patches taken throughout a day. This rich prior knowledge allows us to obtain a reconstruction performance that outperforms the current state-of-the-art on real lighting data with only 3 images, which is much lower than the 6-18 typically used by previous work [13, 12]. Our fully trained method will be made publicly available.

We summarize our contributions as follows:

  • [noitemsep,nolistsep]

  • a state-of-the-art method for single-day photometric stereo on sunny days that is robust to shadows, specularities and arbitrary but uniform albedo;

  • a pipeline to generate synthetic images with a mix of Lambertian and glossy materials that captures the diversity of real outdoor lighting.

2 Related work

This section focuses on the more relevant work on outdoor PS, for conciseness. For an overview of general PS, the reader is referred to the recent, excellent review in [14].

As shown in Woodham’s seminal work [1]

, for Lambertian surfaces, calibrated PS computes a (scaled) normal vector in closed form as a simple linear function of the input image pixels; this linear mapping is only well-defined for images obtained under three or more (known) non-coplanar lighting directions. Subsequent work on outdoor PS has struggled to meet this requirement since, over the course of a day, the sun shines from directions that nearly lie on a plane. These co-planar sun directions then yield an ill-posed problem known as two-source PS; despite extensive research using integrability and smoothness constraints 

[15, 16], results still present strong regularization artifacts on surfaces that are not smooth everywhere. To avoid this problem in outdoor PS, authors initially proposed gathering months of data, watching the sun elevation change over the seasons [7, 8]. More recently, Shen et al. [9] noted that the coplanarity of the daily sun directions actually varies throughout the year, with single-day outdoor PS becoming more ill-posed at high latitudes near the winter solstice, and worldwide near the equinoxes.

To compensate for limited sun motion, other approaches use richer illumination models that account for additional atmospheric factors in the sky. This is done by employing (hemi-)spherical environment maps [17] that are either real sky images [13, 18, 19] or synthesized by parametric sky models [20, 12]. Using a large database of real sky images, Hold-Geoffroy et al. [10] showed that partly cloudy days are in fact better for single-day outdoor PS since clouds obscure and further scatter sun light, causing an out-of-plane shift in the effective direction of illumination. Subsequently [11], they also showed that good cloud coverage conditions for stable solutions may be observed in the sky within very short time intervals of just above one hour.

Despite these developments, state-of-the-art approaches in calibrated [13] and semi-calibrated [12] (based on precise geolocation) outdoor PS are still prone to potentially long waits for ideal conditions to arise in the sky; and verifying the occurrence of such events is still a trial-and-error process. These facts have motivated our goal of developing a smarter approach that uses deep learning techniques to resolve ambiguities in outdoor PS by aggregating prior knowledge on object geometry, material and their interaction with natural outdoor illumination. The approach we propose is the first of its kind; so far, deep PS had only been applied in indoor scenarios with rich and controlled illumination [21, 22, 23, 14], focusing on learning inverse functions for non-Lambertian reflectances.

Finally, under more extreme ambiguity, techniques for shape-from-shading (SfS) [5, 4, 24] attempt to recover 3D normals from a single input image, in which case the shading cue alone is obviously insufficient to uniquely define a solution. Thus, SfS relies strongly on priors of different complexities and deep learning is quickly bringing advances to the field [25, 26, 27]. While this is encouraging, here we seek to improve the accuracy of 3D normal estimation by relying less heavily on priors and more strongly on the photometric cues obtained from multiple images. In addition, we also avoid restricting the approach to a specific type of object and reflectance model (e.g., human faces, Lambertian [26]).

3 Deep outdoor Photometric Stereo network

This section presents our CNN-based approach designed to address ambiguities that arise in single-day outdoor PS reconstruction. It does so by using deep learning to model prior knowledge on object geometry, material properties, as well as their local spatial correlation and interaction with natural sky light. In order to build such knowledge base, one needs a large number of images depicting various objects lit by outdoor lighting throughout the day, over different geographic locations and days over the year; finally, the surface normal map of each object is also required. Unfortunately, no such large-scale dataset currently exists, so a natural choice is to synthesize realistic data to train our network. We begin by presenting our problem formulation, CNN architecture, followed training procedure and data generation.

3.1 Image formation model

Consider an image pixel that depicts a small area of an object’s surface, with a normal vector and material reflectance described by a bidirectional reflectance distribution function (BRDF) . When viewed from direction , the RGB color of this pixel at daytime is modeled as


where is a direction of incoming light within the hemispherical domain , is the RGB intensities (color) of the incoming light at time , and is a binary visibility encoding self-shadowing. Here denotes the actual solar time at the target object location.

Our goal is to invert the above rendering equation and recover the surface normal based on the observed changes in pixel intensities , which are caused by the changing natural illumination as varies throughout the day. However, as discussed above, a solution based solely on the photometric cue is rarely uniquely defined and stable in outdoor PS due to the limited motion of the sun, leading to insufficient variability in and, thus,  [11].

Therefore, instead of considering a single pixel , we reformulate our goal and instead aggregate additional RGB image data within a neighborhood , depicting a larger surface patch centered at the pixel . Now we seek to learn a predictor , where denotes the number of input images and has the patch normals. In this paper, but we experiment with other values in sec. 5.2. This approach is motivated by the fact that complex object geometry is often made up of simpler, small surface patches presenting highly correlated surface normals and material properties. A natural way to obtain this predictor is to train a CNN that learns a nonlinear function of local surface features that are highly correlated with the normal at the center of the patch. We train our network on a large synthetic database of surface patches realistically rendered with a standard GGX shader for , with varying diffuse and specular parameters, and using the Hošek-Wilkie physically-based sky model [28] for the spherical function , as described next.

3.2 Illumination model: the solar analemma

We follow a semi-calibrated PS approach that does not require known lighting environments [13] nor complete camera geolocation data [12]. Our method only assumes that: (1) the object images are captured at roughly predefined times of the day, ; (2) the sun is unobstructed by clouds at these times; and (3) the camera is orthographic and faces approximately North (or South). We later (sec. 5) analyze the robustness of our network with respect to departures from these ideal conditions and discuss how assumption (3) can be relaxed.

Together, these assumptions constrain the sun position to lie within an “8-figure” subspace at each time , known as a solar analemma, whose shape also varies with geographical location (fig. 1). For a given time , the sun may be positioned at different locations depending upon the selected date and latitude, as prescribed by the analemma. The neural network is thus expected to adapt to this (constrained) variability in sun position and associated intensity. As shown in fig. 1-(a,b), for a given timestamp and latitude, the sun position spans relatively small angular ranges, which still remain quite constrained even when considering geographical locations sampled over the Northern Hemisphere (fig. 1-(c)) (note that a similar plot would be obtained by sampling the Southern Hemisphere with the camera facing south).

(a) (b)  (c)

Figure 1:

Solar analemma: position of the sun in the sky at a specific time of the day and throughout a year over (a) Paris and (b) the Tropic of Cancer. Note how the analemmas spread over a wide range of zenith and azimuth angles over the course of a year. (c) Probability of the sun location in the sky for our training set.

The physically-based parametric sky model of Hošek and Wilkie [28] is used to obtain the spherical illumination function in eq. (1). The model represents the spectral sky radiance as a parametric function of the sun position, sky turbidity and ground albedo. Here, turbidity is set to 2, which corresponds to a clear day, and ground albedo to 0.3. Note that we do not model light scattering caused by clouds obscuring the sun and thus assume the sun is fully visible in the sky.

3.3 Network architecture

We now turn to the design of the function introduced above, which is done through a Convolutional Neural Network (CNN). An overview of its proposed architecture is shown in fig. 2. The network takes input image patches, extracted from 8 images captured at regular intervals between 9:00 and 16:00 solar time throughout a single sunny day. The first layer is composed of 32 channels of filters with shared weights across the 8 inputs. The resulting feature maps are subsequently concatenated in a single

feature tensor. A second convolutional layer is then used, yielding 256 channels, followed by 3 residual blocks as defined in the resnet-18 architecture 

[29]. Lastly, 2 fully-connected layers (FC) are used to produce a patch of estimated normals . Note that we experimented with fully-convolutional architectures [23]

but found the FC layers to yield better results. The ELU activation function 

[30] is used at every convolutional and fully connected layer, except the output layer where a

function is used. Batch normalization 

[31] is applied at every layer except the first one and the output layer, where it could otherwise break low-level feature detectors and output distributions.

Figure 2: Our novel CNN architecture for deep single-day outdoor PS on sunny days. The network operates on patches of the input image, captured at 8 time intervals regularly spaced throughout a single day. The network uses convolutional (blue) and residual (red) layers before estimating the normals using fully-connected layers (green). Two losses are used to train our method, one based on the cosine distance with the ground truth and another to constrain the norm of the output vector.

The estimated normals are represented by cartesian components of the surface normal of the input patch. We experimented with both cartesian and spherical coordinates

parameterization, but found the cartesian parameterization to be more stable despite its additional degree of freedom. We hypothesize this may be due to the “wrap-around” issue with the azimuth angle


To process entire images, we crop overlapping tiles from the image with a stride of 8 pixels. Since a pixel can belong to up to 4 patches, the network produces several estimates

that are then merged together using a weighted average. We use a Gaussian kernel with

centered on the middle of the patch as weighting function to perform the linear interpolation across overlapping patches.

3.4 Training

The network learns a function that estimates the patch normals . We define the loss to be minimized between the estimated and ground truth patch normals and

respectively as the sum of two separate loss functions defined on individual patch normals

, where . The total loss is the sum over all individual normals:


The first term is the cosine distance between the estimated and ground truth normal :


where denotes the dot product. The second term enforces the unit-length constraint on the recovered normal:


The loss in eq. (2

) is minimized via stochastic gradient descent using the Adam optimizer 

[32] with an initial learning rage of , a weight decay and the recommended values and

. Mini-batches of 128 samples were used during training and regularized via early stopping. Training typically converges in around 250 epochs on our dataset, which is described next.

3.5 Training dataset

To train our predictor function , we rely on a large training dataset of synthetic objects, lit by a physically-based outdoor daylight model. To generate a single 8-images set of inputs, we randomly select a combination of: 1) object shape, 2) material properties, and 3) geo-temporal coordinates for lighting conditions. We now detail how each of these 3 choices are made.

Since the neural network only sees patches of pixels, its receptive field is, by design, not large enough to learn priors on whole object shapes. Therefore, our dataset contains a wide variety of local surface curves. We used the blob dataset from [4] as training models. We also added simple primitives (cube, sphere, icosahedron, cone) to the dataset. A validation set, comprised of one of the blobs models that was kept from the training set as well as some models from the Stanford 3D Scanning Repository [33] and the owl model used in [10], was also created. All blobs and geometric primitives are randomly rotated about their centroid.

To model a wide range of surface appearance ranging from diffuse to glossy, we employ a linear combination of a lambertian and a microfacet model:


where is the surface color, and is the GGX microfacet model [34] which is parameterized by the surface roughness .

The albedo is generated in HSV space, where , , and , where

is a uniform distribution in the

interval and is a triangular distribution in the interval with mode . This generates colors that are in general bright and prevents an abundance of strongly saturated colors. The surface roughness is sampled as to avoid mirror-like surfaces. Finally, the mixing coefficient is sampled as .

Accurately capturing outdoor illumination requires a carefully calibrated setup [35], as such there exists limited number of real datasets (one notable exception being [36]). To light the scene with a wide variety of realistic outdoor lighting conditions, we instead rely on the Hošek-Wilkie physically-based sky model [28] as described in sec. 3.2. We also placed a ground plane of albedo 0.3 outside the field of view of the camera, to generate a light bounce from below the object. 11 random locations in the Northern Hemisphere between latitude (Equator) and (Moscow) were selected. Furthermore, 6 random days throughout the year were chosen in addition to the equinoxes and solstices. This results in 110 pairs of geographical locations and dates, which are used to compute the sun position in the sky throughout the day using [37]. The distribution of the resulting sun positions throughout our training set is shown in fig. 1. For every pair of geographical location and day, 8 timestamps ranging from 9:00 to 16:00 are used to perform the renders. Timestamps are aligned to the solar noon instead of the political time zone of the geographic location. Note that, even though we sample only geographical locations in the northern hemisphere, our dataset represents equally well days in the southern hemisphere. Indeed, flipping the images left-right, reversing the image order (from 16:00 to 9:00) and pointing the camera southward would generate data identical to our training dataset.

The resulting images are rendered with the Cycles physically-based rendering engine, effectively performing eq. (1) with the BRDF defined in eq. (5). This results in a dataset of 369,440 renders corresponding to 23,090 combinations of geo-temporal coordinates and materials properties, which we then split into 21,220 and 1870 for training and validation, respectively. Each render has a resolution of pixels, which amounts to over 10 millions input-output pairs of patches to train on. Special care was taken into ensuring no 3D model nor material properties were shared between both the training and validation datasets. Please consult the supplementary material for example training images obtained with this procedure.

4 Evaluation

  9:19 10:19 11:39 13:04 14:04 15:27

  ground truth ours [12] [13] [22] [27] [25]

Figure 3:

(top) An example of the lighting environment maps and renders throughout a day. (bottom) Qualitative results (odd rows) and errors in degrees (even rows) of our technique and the state-of-the-art on single-day photometric stereo in the semi-calibrated 

[12] and calibrated [13] cases, deep photometric stereo [22] and single image normal estimation [27, 25] (averaged over the day) on our real lighting dataset. More results available in the supplementary material.

In this section, we assess the performance of our method and compare it extensively to the state-of-the-art methods in single-day and regular photometric stereo, as well as some recent single image normal estimation techniques.

4.1 Evaluation dataset

To evaluate and compare the techniques, we rely on a dataset of synthetic objects, lit by real skies. To generate the images, we manually selected 3 sunny days over 2 geographical locations from the Laval HDR sky database [36], which contains unsaturated HDR, omnidirectional photographs of the sky captured with the approach proposed in [35]. We build a virtual 3D scene containing the HDR sky environment map as the sole light source, a 3D object viewed by an orthographic camera, and a 0.3 albedo ground plane placed under the object, outside the field of view of the camera. We used the 3D models from the validation set which the neural network never saw during training. This results in a dataset of 960 renders yielding 60 normal maps to evaluate. Example images obtained with this technique are shown in fig. 3.

4.2 Results and comparisons

We compare our method to several state-of-the-art techniques relying on photometric stereo and/or deep learning to estimate surface normals from images. For PS techniques, we compare to the calibrated technique of Yu et al. [13], which requires knowledge of the full environment map used to light the object. In our work, we use the variant proposed by Hold-Geoffroy et al. [11] and without the low-rank matrix completion preprocessing, which was shown to yield slightly improved results over the original formulation. We also compare to the semi-calibrated method of Jung et al. [12], which requires only knowledge of the capture geolocation. For deep learning techniques, we compare to the recent Deep Photometric Stereo Network (DPSN) [22], which operates on one pixel at a time. Since it assumes known point light source lighting, we re-trained this model using the sun position from a geographical location and date representative of our training dataset. In addition, we also compare to single image networks: Eigen and Fergus [25] and MarrNet [27]. Since they rely on a single image, we take the mean of their results averaged over all 8 inputs.

The comparative results, shown qualitatively in fig. 3 and quantitatively in fig. 4, clearly demonstrate that our approach significantly outperforms all other techniques. We observe that both single image techniques do not work well and result in very high median errors of around and for [27] and [25], respectively. For [25], this is probably due to the fact that they cannot handle the harsh shadows created by outdoor lighting during sunny days, since they train with indoor lighting only. In addition, MarrNet [27] outputs a voxel occupation grid and only produces normals as a byproduct (in its latent stage). As such, this method may not be fully optimized for normal estimation.

The PS techniques yield much better results but still yield quite significant error since sunny days do not contain sufficient constraints to accurately recover surface normals. The calibrated method of Yu et al. [13] is comparable to the results obtained by DPSN, with a median normal angular estimation error of . Interestingly, the semi-calibrated method of Jung et al. [12] actually yields better results with a median error of , despite needing less information than the calibrated methods. This could be due to its reliance on a parametric clear sky model to estimate lighting, which closely matches the actual ground truth lighting, and to its reliance on an intensity profile matching algorithm.

Figure 4: Median reconstruction error on our real lighting dataset displayed vertically as “box-percentile plots” [38]; the center horizontal bars indicate the median, while the bottom (top) bars are the 25th (75th) percentiles. Our proposed method (green) provides state-of-the-art performance compared to non-learned methods for single-day PS (blue [12], orange [13]), deep learning methods on calibrated photometric stereo (red [22]) and single image normals reconstruction (purple [27], brown [25]).

It is interesting to note that most PS techniques capture with some degree of success the left/right component of the surface normals (roughly speaking, the red and blue tints in the normal maps). This axis happens to be the same as the sun trajectory through the day when the camera is facing north or south. This results in strong photometric constraints on this axis. On the other hand, the recovery of the up/down axis is much less successful on most techniques as outdoor photometric cues lack information in this direction through a single sunny day.

In contrast, our method yields a normal map that is, although a bit smoother, qualitatively very similar to the ground truth. Quantitatively, our approach achieves a median error of over the evaluation set, with the majority of errors being completely below that of the next-best performing method, Jung et al. [12] (see fig. 4). Even if it is trained on purely synthetic data, our network is able to generalize well to images rendered with real lighting. The difference in performance with respect to DPSN shows the usefulness of dealing with image patches, which allows the network to learn appropriate patch-based shape priors which can be exploited when the photometric cue alone is not sufficient.

5 Analysis

We now analyze further our network, and in particular explore the robustness of our network to departures from the assumptions that were made in sec. 3.

5.1 Camera calibration error

To constrain the set of possible lighting directions (sec. 3.2), we made the assumption that the camera is pointing north. We analyzed the impact on reconstruction performance when this hypothesis is infringed by rotating the real environment maps used to render the evaluation dataset (sec. 4.1), and show the results of this experiment in fig. 5. The slight improvement around west calibration error is due to the timestamps of our real lighting dataset that are not perfectly aligned with the neural network expected timestamps. We observe that the median reconstruction error increases of approximately per error on camera calibration, showing that the network has some built-in robustness to these errors.

Figure 5: Median normal estimation error as box-percentile plots (see fig. 4) in function of the camera deviation from north in degrees on our real lighting dataset. Positive means camera going toward west, negative means camera going toward east.

5.2 Number of images

We now study the normal estimation performance in function of the number of inputs to the CNN (see sec. 3.3). Results ranging from a single input image (, effectively performing shape-from-shading) to input images all uniformly taken from 9:00 to 16:00 are shown in fig. 6. We observe an rapid improvement in performance from one to three images, which is coherent with Photometric Stereo theory [1]. Performance continues to increase until , probably because added constraints improves robustness to noise and non-diffuse materials. Interestingly, the normal estimation error starts to increase slightly with . This could be due to an increase in the number of parameters to train in our model (the output tensor after concatenation is of dimension , thereby increasing the number of parameters in the second convolutional layer), making the model harder to train.

Figure 6: Median normal estimation error as box-percentile plots [38] (see fig. 4) on our evaluation dataset as a function of , the number of input images.

5.3 Feature analysis

We use SmoothGrad [39] to visualize the regions of the image that have a larger impact on normal estimation. Since the network operates on patches and not entire images, we use the same linear blending strategy as in sec. 3.5, and report qualitative results in fig. 7. We observe how the neural network tends to ignore darker regions and focuses on brighter alternatives in other images when available. This result suggests that the network learns to avoid shadowed areas, where, indeed, the photometric cue is not reliable due to low signal-to-noise ratio and to occlusion of the main light source.

Figure 7: Back-propagating the gradient through our network using SmoothGrad [39] on the input images shown in (a) generates a map of the pixels that affects the most the normal estimation (b). Notice how the regions in shadow have generally less influence (blue) than regions in direct sunlight (yellow).

6 Discussion


Figure 8: Limitation of our approach. Our network is trained on spatially uniform BRDFs, so testing it on spatially-varying albedo maps increases the estimation error. (left) Spatially-uniform albedos results in low error, while checkerboard albedo maps with (center) small and (right) large patterns increase the error.

In this paper, we present what we believe to be the first learned single-day photometric stereo method. Two key ideas were used to train our approach: first, local spatiality is important in the single-day case and can be leveraged using convolutional layers; second, large synthetic data with different surface reflectance can generalize well to real lighting and allows the training of deep learning methods. This results in a method robust to shadows, specular highlights and different albedos. We show that our method significantly outperforms previous work on a challenging evaluation dataset of virtual objects lit by real sunny lighting conditions.

Despite offering state-of-the-art performance, our method suffers from some limitations, which opens the way for interesting future work. The first limitation is that the camera is assumed to be pointing north. Although the network shows some resilience to errors in camera calibration (see fig. 5), large deviations from the assumed direction are not well-handled. One possible way to circumvent this limitation would be to train direction-specific models and select the right one by detecting the camera orientation. Furthermore, while our approach is robust to non-Lambertian reflections, it assumes the scene to have a spatially-uniform BRDF. Fig. 8 shows the behavior of our approach when the object is texture mapped with two spatially-varying albedo maps: a checkerboard pattern with small and large squares. Unsurprisingly, the resulting normal maps appear distorted since the constant albedo assumption is broken. One interesting direction for future work here would be to train a network on the ratio between pairs of images (e.g. as in [13]), which effectively cancels out the albedo. Lastly, we chose to focus on sunny days, since this is the most challenging case for outdoor photometric stereo [10, 11]. Training the network on partially-cloudy days (for instance, by increasing the turbidity of the Hošek-Wilkie model) would be one potential way forward.

7 Acknowledgments

This work was partially supported by the REPARTI Strategic Network and the NSERC Discovery Grant RGPIN-2014-05314. We gratefully acknowledge the support of Nvidia with the donation of the GPUs used for this research.


  • [1] Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Optical Engineering 19(1) (1980) 139–144
  • [2] Alldrin, N.G., Zickler, T., Kriegman, D.J.: Photometric stereo with non-parametric and spatially-varying reflectance.

    In: IEEE Conference on Computer Vision and Pattern Recognition. (2008)

  • [3] Basri, R., Jacobs, D., Kemelmacher, I.: Photometric stereo with general, unknown lighting. International Journal of Computer Vision 72(3) (jun 2007) 239–257
  • [4] Johnson, M.K., Adelson, E.H.: Shape estimation in natural illumination. In: IEEE Conference on Computer Vision and Pattern Recognition. (2011)
  • [5] Oxholm, G., Nishino, K.: Shape and reflectance from natural illumination. In: European Conference on Computer Vision. (2012)
  • [6] Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from internet photo collections. International Journal of Computer Vision 80(2) (2008) 189–210
  • [7] Abrams, A., Hawley, C., Pless, R.: Heliometric stereo: Shape from sun position. In: European Conference on Computer Vision. (2012)
  • [8] Ackermann, J., Langguth, F., Fuhrmann, S., Goesele, M.: Photometric stereo for outdoor webcams. In: IEEE Conference on Computer Vision and Pattern Recognition. (2012)
  • [9] Shen, F., Sunkavalli, K., Bonneel, N., Rusinkiewicz, S., Pfister, H., Tong, X.: Time-lapse photometric stereo and applications. Computer Graphics Forum 33(7) (2014) 359–367
  • [10] Hold-Geoffroy, Y., Zhang, J., Gotardo, P.F.U., Lalonde, J.F.: What is a good day for outdoor photometric stereo? In: International Conference on Computational Photography. (2015)
  • [11] Hold-Geoffroy, Y., Zhang, J., Gotardo, P.F.U., Lalonde, J.F.: -hour outdoor photometric stereo. In: International Conference on 3D Vision. (2015)
  • [12] Jung, J., Lee, J.y., Kweon, I.S.: One-day outdoor photometric stereo via skylight estimation. In: IEEE Conference on Computer Vision and Pattern Recognition. (2015)
  • [13] Yu, L.F., Yeung, S.K., Tai, Y.W., Terzopoulos, D., Chan, T.F.: Outdoor photometric stereo. In: IEEE International Conference on Computational Photography. (2013)
  • [14] Shi, B., Mo, Z., Wu, Z., Duan, D., Yeung, S.K., Tan, P.: A benchmark dataset and evaluation for non-lambertian and uncalibrated photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018) to appear.
  • [15] Onn, R., Bruckstein, A.: Integrability disambiguates surface recovery in two-image photometric stereo. International Journal of Computer Vision 5(1) (1990) 105–113
  • [16] Hernández, C., Vogiatzis, G., Cipolla, R.: Overcoming shadows in 3-source photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(2) (feb 2011) 419–426
  • [17] Debevec, P.: Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography. In: Proceedings of ACM SIGGRAPH 1998. (1998) 189–198
  • [18] Shi, B., Inose, K., Matsushita, Y., Tan, P., Yeung, S.K., Ikeuchi, K.: Photometric stereo using internet images. In: International Conference on 3D Vision. (2014)
  • [19] Hung, C.H., Wu, T.P., Matsushita, Y., Xu, L., Jia, J., Tang, C.K.: Photometric stereo in the wild. In: IEEE Winter Conference on Applications of Computer Vision. (2015)
  • [20] Inose, K., Shimizu, S., Kawakami, R., Mukaigawa, Y., Ikeuchi, K.: Refining outdoor photometric stereo based on sky model. Information and Media Technologies 8(4) (dec 2013) 1095–1099
  • [21] Yu, Y., Smith, W.A.P.: PVNN: A Neural Network Library for Photometric Vision. In: IEEE International Conference on Computer Vision. (2017)
  • [22] Santo, H., Samejima, M., Sugano, Y., Shi, B., Matsushita, Y.: Deep photometric stereo network. In: Proceedings of the IEEE International Conference on Computer Vision. (2017)
  • [23] Taniai, T., Maehara, T.: Neural photometric stereo reconstruction for general reflectance surfaces. arXiv preprint arXiv:1802.10328 (2018)
  • [24] Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(8) (August 2015) 1670–1687
  • [25] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision. (2015)
  • [26] Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: IEEE Conference on Computer Vision and Pattern Recognition. (2017)
  • [27] Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W.T., Tenenbaum, J.B.: Marrnet: 3d shape reconstruction via 2.5d sketches. In: Advances In Neural Information Processing Systems. (2017)
  • [28] Hošek, L., Wilkie, A.: An analytic model for full spectral sky-dome radiance. ACM Transactions on Graphics 31(4) (2012) 1–9
  • [29] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. (2016)
  • [30] Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations (2016)
  • [31] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. (2015)
  • [32] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International Conference for Learning Representations. (2015)
  • [33] Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Computer Graphics (SIGGRAPH Proceedings). (1996)
  • [34] Walter, B., Marschner, S., Li, H., Torrance, K.: Microfacet models for refraction through rough surfaces. Eurographics (2007)
  • [35] Stumpfel, J., Tchou, C., Jones, A., Wenger, A., Tchou, C., Hawkins, T., Debevec, P.: Direct hdr capture of the sun and sky. Proceedings of AFRIGRAPH (2004)
  • [36] Lalonde, J.F., Asselin, L.P., Becirovski, J., Hold-Geoffroy, Y., Garon, M., Gardner, M.A., Zhang, J.: The Laval HDR sky database. (2016)
  • [37] Bretagnon, P., Francou, G.: Planetary theories in rectangular and spherical variables-vsop 87 solutions. Astronomy and Astrophysics 202 (1988) 309–315
  • [38] Esty, W.W., Banfield, J.D.: The Box-Percentile Plot. Journal Of Statistical Software 8 (2003) 1–14
  • [39] Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)