CNN-PS: CNN-based Photometric Stereo for General Non-Convex Surfaces

by   Satoshi Ikehata, et al.

Most conventional photometric stereo algorithms inversely solve a BRDF-based image formation model. However, the actual imaging process is often far more complex due to the global light transport on the non-convex surfaces. This paper presents a photometric stereo network that directly learns relationships between the photometric stereo input and surface normals of a scene. For handling unordered, arbitrary number of input images, we merge all the input data to the intermediate representation called observation map that has a fixed shape, is able to be fed into a CNN. To improve both training and prediction, we take into account the rotational pseudo-invariance of the observation map that is derived from the isotropic constraint. For training the network, we create a synthetic photometric stereo dataset that is generated by a physics-based renderer, therefore the global light transport is considered. Our experimental results on both synthetic and real datasets show that our method outperforms conventional BRDF-based photometric stereo algorithms especially when scenes are highly non-convex.



There are no comments yet.


page 5

page 8

page 9

page 11

page 12

page 13

page 14


Deep Photometric Stereo for Non-Lambertian Surfaces

This paper addresses the problem of photometric stereo, in both calibrat...

A CNN Based Approach for the Near-Field Photometric Stereo Problem

Reconstructing the 3D shape of an object using several images under diff...

Self-calibrating Deep Photometric Stereo Networks

This paper proposes an uncalibrated photometric stereo method for non-La...

Uncalibrated Neural Inverse Rendering for Photometric Stereo of General Surfaces

This paper presents an uncalibrated deep neural network framework for th...

PS-FCN: A Flexible Learning Framework for Photometric Stereo

This paper addresses the problem of photometric stereo for non-Lambertia...

Semantic 3D Reconstruction with Continuous Regularization and Ray Potentials Using a Visibility Consistency Constraint

We propose an approach for dense semantic 3D reconstruction which uses a...

Conceptual and algorithmic development of Pseudo 3D Graphics and Video Content Visualization

The article presents a general concept of the organization of pseudo thr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In 3-D computer vision problems, the input data is often

unstructured (i.e

., the number of input images is varying and the images are unordered). A good example is the multi-view stereo problem where the scene geometry is recovered from unstructured multi-view images. Due to this unstructuredness, 3-D reconstruction from multiple images less relied on the supervised learning-based algorithms except for some structured problems such as binocular stereopsis 

[1] and two-view SfM [2] whose number of input images is always fixed. However, recent advances in deep convolutional neural network (CNN) have motivated researchers to address unstructured 3-D computer vision problems with deep neural networks. For instance, a recent work from Kar et al[3] presented an end-to-end learned system for the multi-view stereopsis while Kim et al[4]

presented a learning-based surface reflectance estimation from multiple RGB-D images. Either work intelligently merged all the unstructured input to a structured, intermediate representation (

i.e., 3-D feature grid [3] and 2-D hemispherical image [4]). This work was supported by JSPS KAKENHI Grant Number JP17H07324.

Photometric stereo is another 3-D computer vision problem whose input is unstructured, where surface normals of a scene are recovered from appearance variations under different illuminations. Photometric stereo algorithms typically solved an inverse problem of the pointwise image formation model which was based on the Bidirectional Reflectance Distribution Function (BRDF). While effective, a BRDF-based image formation model generally cannot account the global illumination effects such as shadows and inter-reflections, which are often problematic to recover non-convex surfaces. Some algorithms attempted the robust outlier rejection to suppress the non-Lambertian effects 

[5, 6, 7, 8], however the estimation failed when the non-Lambertian observation was dominant. This limitation inevitably occurs due to the fact that multiple interactions of light and a surface are difficult to be modeled in a mathematically tractable form.

To tackle this issue, this paper presents an end-to-end CNN-based photometric stereo algorithm that learns the relationships between surface normals and their appearances without physically modeling the image formation process. For better scalability, our approach is still pixelwise and rather inherit from conventional robust approaches [5, 6, 7, 8], which means that we learn the network that automatically “neglects” the global illumination effects and estimate the surface normal from “inliers” in the observation. To achieve this goal, we will train our network on as much as possible synthetic patterns of the input that is “corrupted” by global effects. Images are rendered with different complex objects under the diverse material and illumination condition.

Our challenge is to apply the deep neural network to the photometric stereo problem whose input is unstructured. In similar with recent works [3, 4], we merge all the photometric stereo data to an intermediate representation called observation map that has a fixed shape, therefore is naturally fed to a standard CNN. As many photometric stereo algorithms were, our work is also primarily concerned with isotropic materials, whose reflections are invariant under rotation about the surface normal. We will show that this isotropy can be taken advantages of in a form of the rotational pseudo-invariance of the observation map for both augmenting the input data and reducing the prediction errors. To train the network, we create a synthetic photometric stereo dataset (CyclesPS) by leveraging the physics-based Cycles renderer [9] to simulate the complex global light transport. For covering diverse real-world materials, we adopt the Disney’s principled BSDF [10] that was proposed for artists to render various scenes by controlling small number of parameters.

We evaluate our algorithm on the DiLiGenT Photometric Stereo Dataset [11] which is a real benchmark dataset containing images and calibrated lightings. We compare our method against conventional photometric stereo algorithms [12, 13, 14, 15, 5, 6, 16, 7, 17, 18, 8, 19, 20, 21] and show that our end-to-end learning-based algorithm most successfully recovers the non-convex, non-Lambertian surfaces among all the algorithms concerned.

The summary of contributions is following:
(1) We firstly propose a supervised CNN-based calibrated photometric stereo algorithm that takes unstructured images and lighting information as input.
(2) We present a synthetic photometric stereo dataset (CyclesPS) with a careful injection of the global illumination effects such as cast shadows, inter-reflections.
(3) Our extensive evaluation shows that our method performs best on the DiLiGenT benchmark dataset [11] among various conventional algorithms especially when the surfaces are highly non-convex and non-Lambertian.

Henceforth we rely on the classical assumptions on the photometric stereo problem (i.e., fixed, linear orthographic camera and known directional lighting).

2 Related Work

Diverse appearances of real world objects can be encoded by a BRDF , which relates the observed intensity to the associated surface normal , the -th incoming lighting direction , its intensity , and the outgoing viewing direction via


where accounts for attached shadows and is an additive error to the model. Eq. (1) is generally called image formation model. Most photometric stereo algorithms assumed the specific shape of and recovered the surface normals of a scene by inversely solving Eq. (1) from a collection of observations under different lighting conditions . All the effects that are not represented by a BRDF (image noises, cast shadows, inter-reflections and so on) are typically put together in . Note that when the BRDF is Lambertian and the additive error is removed, it is simplified to the traditional Lambertian image formation model [12].

Since Woodham firstly introduced the Lambertian photometric stereo algorithm, the extension of its work to non-Lambertian scenes has been a problems of significant interest. Photometric stereo approaches to dealing with non-Lambertian effects are mainly categorized into four classes: (a) robust approach, (b) reflectance modeling with non-Lambertian BRDF, (c) example-based reflectance modeling and (d) learning-based approach.

Many photometric stereo algorithms recover surface normals of a scene via a simple diffuse reflectance modeling (e.g., Lambertian) while treating other effects as outliers. For instance, Wu et al[5] have proposed a rank-minimization based approach to decompose images into the low-rank Lambertian image and non-Lambertian sparse corruptions. Ikehata et al.  extended their method by constraining the rank-3 Lambertian structure [6] (or the general diffuse structure [7]) for better computational stability. Recently, Queau et al[8] have presented a robust variational approach for inaccurate lighting as well as various non-Lambertian corruptions. While effective, a drawback of this approach is that if it were not for dense diffuse inliers, the estimation fails.

Despite their computational complexity, various algorithms arrange the parametric or non-parametric models of non-Lambertian BRDF. In recent years, there has been an emphasis on representing a material with a small number of fundamental BRDF. Goldman 

et al[22] have approximated each fundamental BRDF by the Ward model [23] and Alldrin et al[13] later extended it to non-parametric representation. Since the high-dimensional ill-posed problem may cause the instability of the estimation, Shi et al[18] presented a compact biquadratic representation of isotropic BRDF. On the other hand, Ikehata et al[17] introduced the sum-of-lobes isotropic reflectance model [24] to account all frequencies in isotropic observations. For improving the efficiency of the optimization, Shen et al[25] presented a kernel regression approach, which can be transformed to an eigen decomposition problem. This approach works well as far as a resultant image formation model is correct without model outliers.

A few amount of photometric stereo algorithms are grouped into the example-based approach, which takes advantages of the surface reflectance of objects with known shape, captured under the same illumination environment with the target scene. The earliest example-based approach [26] requires a reference object whose material is exactly same with that of target object. Hertzmann et al[27] have eased this restriction to handle uncalibrated scenes and spatially varying materials by assuming that materials can be expressed as a small number of basis materials. Recently, Hui et al[20] presented an example-based method without a physical reference object by taking advantages of virtual spheres rendered with various materials. While effective, this approach also suffers from model outliers and has a drawback that the lighting configuration of the reference scene must be taken over at the target scene.

Machine learning techniques have been applied in a few very recent photometric stereo works [21, 19]. Santo et al[19]

presented a supervised learning-based photometric stereo method using a neural network that takes as input a normalized vector where each element corresponds to an observation under specific illumination. A surface normal is predicted by feeding the vector to one dropout layer and adjacent six dense layers. While effective, this method has limitation that lightings remain the same between training and test phases, making it inapplicable to the unstructured input. One another work by Taniai and Maehara 


presented an unsupervised learning framework where surface normals and BRDFs are predicted by the network trained by minimizing reconstruction loss between observed and synthesized images with a rendering equation. While their network is invariant to the number and permutation of the images, the rendering equation is still based on a point-wise BRDF and intolerant to the model outliers. Furthermore, they reported slow running time (

i.e., 1 hour to do 1000 SGD iterations for each scene) due to its self-supervision manner.

In summary, there is still a constant struggle in the design of the photometric stereo algorithm among its complexity, efficiency, stability and robustness. Our goal is to solve this dilemma. Our end-to-end learning-based algorithm builds upon the deep CNN trained on synthetic datasets, abandoning the modeling of complicated image formation process. Our network accepts the unstructured input (i.e., our network is invariant to both number and order of input images) and works for various real-world scenes where non-Lambertian reflections are intermingled with global illumination effects.

3 Proposed Method

Our goal is to recover surface normals of a scene of (a) spatially-varying isotropic materials and with (b) global illumination effects (e.g., shadows and interreflections) (c) where the scene is illuminated by unknown number of lights. To achieve this goal, we propose a CNN architecture for the calibrated photometric stereo problem which is invariant to both the number and order of input images. The tolerance to global illumination effects is learned from the synthetic images of non-convex scenes rendered with the physics-based renderer.

3.1 2-D observation map for unstructured photometric stereo input

Figure 1: We project pairs of images and lightings to a fixed-size observation map based on the bijective mapping of a light direction from a hemisphere to the 2-D coordinate system perpendicular to the viewing axis. This figure shows observation maps for (a) a point on a smooth convex surface and (b) a point on a rough non-convex surface. We also projected the true surface normal at the point onto the same coordinate system of the observation map for reference.

We firstly present the observation map which is generated by a pixelwise hemispherical projection of observations based on known lighting directions. Since a lighting direction is a vector spanned on a unit hemisphere, there is a bijective mapping from to (s.t., ) by projecting a vector onto the - coordinate system which is perpendicular to a viewing direction ().222We preliminarily tried the projection on the spherical coordinate system (), but the performance was worse than one on the standard x-y coordinate system. Then we define an observation map as


where “int” is an operator to round a floating value to an integer and is a scaling factor to normalize data (i.e., we simply use ). Once all the observations and lightings are stored in the observation map, we take it as an input of the CNN. Despite its simplicity, this representation has three major benefits. First, its shape is independent of the number and size of input images. Second, the projection of observations is order-independent (i.e., the observation map does not change when swapping -th and -th images). Third, it is unnecessary to explicitly feed the lighting information into the network.

Fig. 1 illustrates examples of the observation map of two objects namely SPHERE and PAPERBOWL, one is purely convex and the other is highly non-convex. Fig. 1-(a) indicates that the target point could be on the convex surface since the values of the observation map gradually decrease to zero as the light direction is going apart from the true surface normal (). The local concentration of large intensity values also indicates the narrow specularity on the smooth surface. On the other hand, the abrupt change of values in Fig. 1-(b) evidences the presence of cast shadows or inter-reflections on the non-convex surface. Since there is no local concentration of intensity values, the surface is likely to be rough. In this way, an observation map reasonably encodes the geometry, material and behavior of the light at around a surface point.

Figure 2: (a) Isotropy guarantees that the appearance of a surface from is invariant of the rotation of and around the view axis. (b) Our network architecture is a variation of DenseNet [28] that outputs a normalized surface normal from a observation map. Numbers of the filter are presented below each layer.

3.2 Rotation pseudo-invariance for the isotropy constraint

An observation map is sparse in a general photometric stereo setup (e.g., assuming that and we have 100 images as input, the ratio of non-zero entries in is about

). The missing data is generally considered problematic as CNN input and often interpolated 

[4]. However, we empirically found that smoothly interpolating missing entries degrades the performance since an observation map is often non-smooth and zero values have an important meaning (i.e., shadows). Therefore we alternatively try to improve the performance by taking into account the isotropy of the material.

Many real-world materials exhibit identically same appearance when the surface is rotated along a surface normal. The presence of this behavior is referred to as isotropy [29, 30]. Isotropic BRDFs are parameterized in terms of three values instead of four [31] as


where is an arbitrary reflectance function.333Note that there are other parameterizations of an isotropic BRDF [32]. Combining Eq. (3) with Eq. (1), we get following image formation model.


Note that lighting index and model error are omitted for brevity. Let’s consider the rotation of surface normal and lighting direction around the z-axis (i.e., viewing axis) as where and is an arbitrary rotation matrix. Then,


Feeding them into Eq. (4) gives following equation,

Therefore, the rotation of lighting and surface normal around -axis does not change the appearance as illustrated in Fig. 2-(a). Note that this theorem holds even for the indirect illumination in non-convex scenes by rotating all the geometry and environment illumination around the viewing axis. This result is important for our CNN-based algorithm. We suppose that a neural network is a mapping function that maps (i.e., a set of images and lightings) to (i.e., a surface normal) and is a rotation operator of lighting/normal at the same angle around -axis. From Eq. (3.2), we get . We call this relationship as rotational pseudo-invariance (the standard rotation invariance is ). Note that this rotational pseudo-invariance is also applied on the observation map since the rotation of lightings around the viewing axis results in the rotation of the observation map around the z-axis444Strictly speaking, we rotate the lighting directions instead of the observation map itself. Therefore, we do not need to suffer from the boundary issue unlike the standard rotational data augmentation..

We constrain the network with the rotational pseudo-invariance in the similar manner that the rotation invariance is achieved. Within the CNN framework, two approaches are generally adopted to encode the rotation invariance. One is applying rotations to the input image [33] and the other is applying rotations to the convolution kernels [34]. We adopt the first strategy due to its simplicity. Concretely, we augment the training set with many rotated versions of lightings and surface normal, which allows the network to learn the invariance without explicitly enforcing it. In our implementation, we rotate the vectors at regular intervals from 0 to 360.

Figure 3: The illustration of the prediction module. For each surface point, we generate observation maps taking into account the rotational pseudo-invariance. Each observation map is fed into the network and all the output normals are averaged.

3.3 Architecture details

In this section, we describe the framework of training and prediction. Given images and lightings, we produce observation maps followed by Eq. (2). Data is augmented to achieve the rotational pseudo-invariance by rotating both lighting and surface normal vectors around the viewing axis. Note that a color image is converted to a gray-scale image. The size of the observation map () should be chosen carefully. As increases, the observation map becomes sparser. On the other hand, the smaller observation map has less respresentability. Considering this trade-off, we empirically found that is a reasonable choice (we tried and showed the best performance when the number of images is less than one thousand).

A variation of densely connected convolutional neural network (DenseNet [28]) architecture is used to estimate a surface normal from an observation map. The network architecture is shown in Fig. 2

-(b). The network includes two 2-layer dense blocks, each consists of one activation layer (relu), one convolution layer (

) and a dropout layer (

drop) with a concatenation from the previous layers. Between two dense blocks, there is a transition layer to change feature-map sizes via convolution and pooling. We do not insert a batch normalization layer that was found to degrade the performance in our experiments. After the dense blocks, the network has two dense layers followed by one normalization layer which convert a feature to an unit vector. The network is trained with a simple mean squared loss between predicted and ground truth surface normals. The loss function is minimized using Adam solver 

[35]. We should note that since our input data size is relatively small (i.e., ), the choice of the network architecture is not a critical component in our framework.555We compared architectures of AlexNet, VGG-NET and densenet as well as much simpler architectures with only two or three convolutoinal layers and the dense layer(s). Among the architectures we tested, the current architecture was slightly better.

The prediction module is illustrated in Fig. 3. Given observation maps, we predict surface normals based on the trained network. Since it is practically impossible to train the perfect rotational pseudo-invariant network, estimated surface normals for differently rotated observation maps were not identical (typically the difference of angular errors between every two different rotations was less than 10%-20% of their average). For further emphasizing the rotational pseudo-invariance, we again augment the input data by rotating lighting vectors at a certain angle and then merge the outputs into one. Suppose the surface normal () is a prediction from the input data rotated by , then we simply average the inversely rotated surface normals as follows,


3.4 Training dataset (CyclesPS dataset)

In this section, we present our CyclesPS training dataset. DiLiGenT [11], the largest real photometric stereo dataset contains only ten scenes with fixed lighting configuration. Some works [18, 17, 19] attempted to synthesize images with MERL BRDF database [29], however only one hundred measured BRDFs cannot cover the tremendous real-world materials. Therefore, we decided to create our own training dataset that has diverse materials, geometries and illumination.

For rendering scenes, we collected high quality 3-D models under royalty free license from the internet.666References to each 3-D model are included in supplementary.

We carefully chose fifteen models for training and three models for test whose surface geometry is sufficiently complex to cover the diverse surface normal distribution. Note that we empirically found 3-D models in ShapeNet 

[36] which was used in a previous work [4] are generally too simple (e.g., models are often low-polygonal, mostly planar) to train the network.

Figure 4: (a) The range of each parameter in the principled BSDF [10] is restricted by three different material configurations (Diffuse, Specular, Metallic). (b) The material parameters are passed to the renderer in the form of a 2-D texture map.

The representation of the reflectance is also important to make the network robust to wide varieties of real-world materials. Due to its representability, we choose Disney’s principled BSDF [10] which integrates five different BRDFs controlled by eleven parameters (baseColor, subsurface, metallic, specular, specularTint, roughness, anisotropic, sheen, sheenTint, clearcoat, clearcoatGloss). Since our target is isotropic materials without subsurface scattering, we neglect parameters such as subsurface and anisotropic. We also neglect specularTint that artistically colorizes the specularity and clearcort and clearcoatGloss that does not strongly affect the rendering results. While principled BSDF is effective, we found that there are some unrealistic combinations of parameters that we want to skip (e.g., metallic = 1 and roughness = 0, or metallic = 0.5). For avoiding those unrealistic parameters, we divide the entire parameter sets into three categories, (a) Diffuse, (b) Specular and (c) Metallic. We generate three datasets individually and evenly merge them when training the network. The value of each parameter is randomly selected within specific ranges for each parameter (see Fig. 4-(a)). To realize spatially varying materials, we divide the object region in the rendered image into (i.e., 5000 for the training data) superpixels and use the same set of parameters at pixels within a superpixel (See Fig. 4-(b)).

For simulating complex light transport, we use Cycles [9] renderer bundled in Blender [37]. The orthographic camera and the directional light are specified. For each rendering, we choose a set of an object, BSDF parameter maps (one for each parameter), and lighting configuration (i.e

., Once roughly 1300 lights are uniformly distributed on the hemisphere, small random noises are added to each light). Once images were rendered, we create

CyclesPS dataset by generating observation maps pixelwisely. For making the network robust to the test data of any number of images, observation maps are generated from a pixelwisely different number of images. Concretely, when generating an observation map, we pick a random subset of images whose number is whithin to and whose corresponding elevation angle of the light direction is more than a random threshold value within to degrees.777The minimum number of images is 50 for avoiding too sparse observation map and we only picked the lights whose elevation angles were more than 20 degrees since it is practically less possible that the scene is illuminated from the side.

The training process takes 10 epochs for 150 image sets (

i.e., 15 objects 10 rotations for the rotational pseudo-invariance). Each image set contains around 50000 samples (i.e., number of pixels in the object mask).

4 Experimental Results

Figure 5: Evaluation on the MERLSphere dataset. A sphere is rendered with 100 measured BRDF in MERL BRDF database [29]. Our CNN-based method was compared against a model-based algorithm (IA14 [7]) based on the mean angular errors of predicted surface normals in degree. We also showed some examples of rendered images and observation maps for further analysis (See Section 4.2).

We evaluate our method on synthetic and real datasets. All experiments were performed on a machine with 3

GeForce GTX 1080 Ti and 64GB RAM. For training and prediction, we use Keras library 


with Tensorflow background and use default learning parameters. The training process took around 3 hours.

4.1 Datasets

We evaluated our method on three datasets, two are synthetic and one is real.

MERLSphere is a synthetic dataset where images are rendered with one hundred isotropic BRDFs in MERL database [29] from diffuse to metallic. We generated 32-bit HDR images of a sphere () with a ground truth surface normal map and a foreground mask. There is no cast shadow and inter-reflection.

CyclesPSTest is a synthetic dataset of three objects, SPHERE, TURTLE and PAPERBOWL. TURTLE and PAPERBOWL are non-convex objects where the inter-reflection and cast shadow appear on rendered images. This dataset was generated in the same manner with the CyclesPS training dataset except that the number of superpixels in the parameter map was and the material condition was either Specular or Metallic (Note that objects and parameter maps in CyclesPSTest are NOT in CyclesPS). Each data contains 16-bit integer images with a resolution of under 17 or 305 known uniform lightings.

DiLiGenT [11] is a public benchmark dataset of 10 real objects of general reflectance. Each data provides 16-bit integer images with a resolution of from different known lighting directions. The ground truth surface normals for the orthographic projection and the single-view setup are also provided.

4.2 Evaluation on MERLSphere dataset

We compared our method (with in Eq. (9)) against one of the state-of-the-art isotropic photometric stereo algorithms (IA14 [17]888We used the authors’ implementation of [17] with and turning on the retro-reflection handling. Attached shadows were removed by a simple thresholding. Note that our method takes into account all the input information unlike [17].) on the MERLSphere dataset. Without global illumination effects, we simply evaluate the ability of our network in representing wide varieties of materials compared to the sum-of-lobes BRDF [24] introduced in IA14. The results are illustrated in Fig. 5. We observed that our CNN-based algorithm performs comparably well, though not better than IA14, for most of materials, which indicates that Disney’s principled BSDF [10] covers various real-world materials. We should note that as was commented in [10], some of very shiny materials, particularly the metals (e.g., chrome-steel and tungsten-carbide), exhibited asymmetric highlights suggestive of lens flare or perhaps anisotropic surface scratches. Since our network was trained on purely isotropic materials, they inevitably degrade the performance.

4.3 Evaluation on CyclesPSTest dataset

Table 1: Evaluation on the CyclesPSTest dataset. Here is the number of input images in each dataset and are types of material i.e., Specular (S) or Metallic (M) (See Fig. 4 for details). For each cell, we show the average angular errors in degrees.
Table 2: Evaluation on the DiLiGenT dataset. We show the angular errors averaged within each object and over all the objects. (*) Our method discarded first 20 images in BEAR since they are corrupted (We explain about this issue in the supplementary).

To evaluate the ability of our method in recovering non-convex surfaces, we tested our method on CyclesPSTest. Our method was compared against two robust algorithms IW12 [6] and IW14 [7]999We used authors’ implementation and set parameters of [6] as and parameters of [7] as ., two model-based algorithms ST14 [18]101010We used our implementation of [18] and set . and IA14 [17] and BASELINE [12]. When running algorithms except for ours, we discarded samples whose intensity values were less than in a 16-bit integer image for the shadow removal. In this experiment, we also studied the effects of number of images and rotational merging in the prediction.111111We still augument data by rotations in the training step. Concretely, we tested our method on 17 or 305 images with and in Eq. (9). We show the results in Table 1 and Fig. 6. We observed that all the algorithms worked well on the convex specular SPHERE dataset. However, when surfaces were non-convex, all the algorithms except ours failed in the estimation due to strong cast shadow and inter-reflections. It is interesting to see that even the robust algorithms (IA12 [6] and IA14 [7]) could not deal with the global effects as outliers. We also observed that the rotational averaging based on the rotational pseudo-invariance definitely improved the accuracy, though not very much.

4.4 Evaluation on DiLiGenT dataset

Figure 6: Recovered surface normals and error maps for (a) TURTLE and (b) PAPERBOWL of Specular material. Images were rendered under uniform 305 lightings.
Figure 7: Recovered surface normals and error maps for (a) HARVEST and (b) READING in the DiLiGenT dataset.

Finally, we present a side-by-side comparison on the DiLiGenT dataset [11]. We collected existing benchmark results for the calibrated photometric stereo algorithms [12, 13, 14, 15, 5, 6, 16, 7, 17, 18, 8, 19, 20, 21]. Note that we compared the mean angular errors of [12, 13, 14, 15, 5, 16, 17, 18] reported in [11], ones reported in their own works [19, 20, 21] and ones from our experiment using authors’ implementation [6, 7, 8].121212As for [8], we used the default setting of their package except that we gave the camera intrinsics provided by [11]

and changed the noise variance to zero.

The results are illustrated in Table 2. Due to the space limit, we only show the top-10 algorithms131313Please find the full comparison in our supplementary. w.r.t the overall mean angular, and BASELINE [12]. We observed that our method achieved the smallest errors averaged over 10 objects, best scores for 6 of 10 objects. It is valuable to note that other top-ranked algorithms [20, 21] are time-consuming since HS17 [20] requires the dictionary learning for every different light configuration and TM18 [21] needs the unsupervised training for every estimation while our inference time is less than five seconds (when ) for each dataset on CPU. Taking a close look at each object, Fig. 7 provides some important insights. HARVEST is the most non-convex scene in DiLiGenT and other state-of-the art algorithms (TM18 [21], IW14[7], ST14 [18]) failed in the estimation of normals inside the “bag” due to strong shadow and inter-reflections. Our CNN-based method estimated much more reasonable surface normals there thanks to the network trained based on the carefully created CyclesPS dataset. On the other hand, our method did not work best (though not bad) for READING which is another non-convex scene. Our analysis indicated that this is because of the inter-reflection of high-intensity narrow specularities that were rarely observed in our training dataset (Narrow specularities appear only when roughness in the principled BSDF is near zero).

5 Conclusion

In this paper, we have presented a CNN-based photometric stereo method which works for various kind of isotropic scenes with global illumination effects. By projecting photometric images and lighting information onto the observation map, unstructured information is naturally fed into the CNN. Our detailed experimental results have shown the state-of-the-art performance of our method for both synthetic and real data especially when the surface is non-convex. To make better training set for handling narrow inter-reflections is our future direction.


  • [1] Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. Proc. ICCV (2017)
  • [2] Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017)
  • [3] Kar, A., Häne, C., Malik, J.: Learning multi-view stereo machine. Proc. NIPS (2017)
  • [4] Kim, K., Gu, J., Tyree, S., Molchanov, P., Niessner, M., Kautz, J.: A lightweight approach for on-the-fly reflectance estimation. Proc. ICCV (2017)
  • [5] Wu, L., Ganesh, A., Shi, B., Matsushita, Y., Wang, Y., Ma, Y.: Robust photometric stereo via low-rank matrix completion and recovery. In: Proc. ACCV. (2010)
  • [6] Ikehata, S., Wipf, D., Matsushita, Y., Aizawa, K.: Robust photometric stereo using sparse regression. In: Proc. CVPR. (2012)
  • [7] Ikehata, S., Wipf, D., Matsushita, Y., Aizawa, K.: Photometric stereo using sparse bayesian regression for general diffuse surfaces. IEEE Trans. Pattern Anal. Mach. Intell. 36(9) (2014) 1816–1831
  • [8] Qu au, Y., Wu, T., Lauze, F., Durou, J.D., Cremers, D.: A non-convex variational approach to photometric stereo under inaccurate lighting. In: Proc. CVPR. (2017)
  • [9] Cycles.
  • [10] Burley, B.: Physically-based shading at disney, part of practical physically based shading in film and game production. SIGGRAPH 2012 Course Notes (2012)
  • [11] Shi, B., Mo, Z., Wu, Z., D.Duan, Yeung, S.K., Tan, P.: A benchmark dataset and evaluation for non-lambertian and uncalibrated photometric stereo. IEEE Trans. Pattern Anal. Mach. Intell. (2018) (to appear)
  • [12] Woodham, P.: Photometric method for determining surface orientation from multiple images. Opt. Engg 19(1) (1980) 139–144
  • [13] Alldrin, N., Zickler, T., Kriegman, D.: Photometric stereo with non-parametric and spatially-varying reflectance. In: Proc. CVPR. (2008)
  • [14] Goldman, D.B., Curless, B., Hertzmann, A., Seitz, S.M.: Shape and spatially-varying brdfs from photometric stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(6) (2010) 1060–1071
  • [15] Higo, T., Matsushita, Y., Ikeuchi, K.: Consensus photometric stereo. In: Proc. CVPR. (2010)
  • [16] Shi, B., Tan, P., Matsushita, Y., Ikeuchi, K.: Elevation angle from reflectance monotonicity. In: Proc. ECCV. (2012)
  • [17] Ikehata, S., Aizawa, K.: Photometric stereo using constrained bivariate regression for general isotropic surfaces. In: Proc. CVPR. (2014)
  • [18] Shi, B., Tan, P., Matsushita, Y., Ikeuchi, K.: Bi-polynomial modeling of low-frequency reflectances. IEEE Trans. Pattern Anal. Mach. Intell. 36(6) (2014) 1078–1091
  • [19] Santo, H., Samejima, M., Sugano, Y., Shi, B., Matsushita, Y.: Deep photometric stereo network.

    In: International Workshop on Physics Based Vision meets Deep Learning (PBDL) in Conjunction with IEEE International Conference on Computer Vision (ICCV). (2017)

  • [20] Hui, Z., Sankaranarayanan, A.C.: Shape and spatially-varying reflectance estimation from virtual exemplars. IEEE Trans. Pattern Anal. Mach. Intell. 39(10) (2017) 2060–2073
  • [21] Taniai, T., Maehara, T.: Neural Inverse Rendering for General Reflectance Photometric Stereo. In: Proc. ICML. (2018)
  • [22] Goldman, D., Curless, B., Hertzmann, A., Seitz, S.: Shape and spatially-varying brdfs from photometric stereo. In: Proc. ICCV. (October 2005)
  • [23] Ward, G.: Measuring and modeling anisotropic reflection. Computer Graphics 26(2) (1992) 265–272
  • [24] Chandraker, M., Ramamoorthi, R.: What an image reveals about material reflectance. In: Proc. ICCV. (2011)
  • [25] Shen, H.L., Han, T.Q., Li, C.: Efficient photometric stereo using kernel regression. IEEE Transactions on Image Processing 26(1) (2017) 439–451
  • [26] Silver, W.M.: Determining shape and reflectance using multiple images. Master’s thesis, MIT (1980)
  • [27] Hertzmann, A., Seitz, S.: Example-based photometric stereo: shape reconstruction with general, varying brdfs. IEEE Trans. Pattern Anal. Mach. Intell. 27(8) (2005) 1254–1264
  • [28] G. Huang, Z. Liu, L.M.K.W.: Densely connected convolutional networks. In: Proc. CVPR. (2017)
  • [29] Matusik, W., Pfister, H., Brand, M., McMillan, L.: A data-driven reflectance model. ACM Trans. on Graph. 22(3) (2003) 759–769
  • [30] Alldrin, N., Kriegman, D.: Toward reconstructing surfaces with arbitrary isotropic reflectance: A stratified photometric stereo approach. In: Proc. ICCV. (2007)
  • [31] Stark, M., Arvo, J., Smits, B.: Barycentric parameterizations for isotropic brdfs. IEEE Trans. on Visualization and Computer Graphics 11(2) (2011) 126–138
  • [32] Montes, R., Urena, C.: An overview of brdf models. Technical report, LSI-2012-001 en Digibug Coleccion: TIC167 - Articulos (2012)
  • [33] Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In Proc. ICDAR (2003)
  • [34] Schmidt, U., Roth, S.: Learning rotation-aware features: From invariant priors to equivariant descriptors. Proc. CVPR (2012)
  • [35] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In proc. ICLR (2014)
  • [36] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)
  • [37] Blender.
  • [38] Chollet, F., et al.: Keras. (2015)