Physics-based Neural Networks for Shape from Polarization

03/25/2019 ∙ by Yunhao Ba, et al. ∙ 10

How should prior knowledge from physics inform a neural network solution? We study the blending of physics and deep learning in the context of Shape from Polarization (SfP). The classic SfP problem recovers an object's shape from polarized photographs of the scene. The SfP problem is special because the physical models are only approximate. Previous attempts to solve SfP have been purely model-based, and are susceptible to errors when real-world conditions deviate from the idealized physics. In our solution, there is a subtlety to combining physics and neural networks. Our final solution blends deep learning with synthetic renderings (derived from physics) in the framework of a two-stage encoder. The lessons learned from this exemplary problem foreshadow the future impact of physics-based learning.



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

How can an uncertain

physical prior can be blended into a deep learning framework? We address this question by rethinking a classic computer vision problem for which the physics are approximate. The Shape from Polarization (SfP) problem involves the capture of polarized photographs of a scene to estimate the shape of an object. The motivation is easy to grasp: light reflecting off an object has a polarization state that relates to the object’s shape. This problem is interesting because the physics of polarized light reflections are idealized leading to unusual forms of model mismatch. This special uncertainty in the physics-based prior makes it difficult to follow previous strategies of blending priors with deep learning 

(Karpatne et al., 2017; Le et al., 2017; Jin et al., 2017; Diamond et al., 2017; Stewart and Ermon, 2017; Li et al., 2018a; Pan et al., 2018; Che et al., 2018; Shi et al., 2018; Chen et al., 2018a; Goy et al., 2018a; Goy et al., 2018b). Figure 1 is conceptual, but reflects our observation that the suitability of a blending deep learning method is dependent on the robustness of model versus data.

Figure 1. Blending physical priors with deep learning requires a subtle touch. The fusion algorithm depends heavily on the quality of the physical prior. Here, we’ve selected a problem where the physics is highly approximate (shape from polarization). A multi-stream encoder is found to be a viable blending approach. Previous blending approaches, e.g. unrolled networks, have been used when the physical models are well-characterized.
Figure 2. Ordinary neural networks are unable to solve complicated model-based problems. Here, we use physics-based neural networks to address the shape from polarization (SfP) problem. SfP is a very unique imaging problem that has significant model-based uncertainty. We study SfP as a test case that highlights the importance of combining physical priors with neural networks.

We now expand on the unique uncertainties present in SfP, starting with the ambiguity problem. This problem arises because a linear polarizer cannot distinguish between polarized light that is rotated by radians. This results in two confounding estimates for the azimuth angle. Previous work in SfP has used additional information to constrain the ambiguity problem. For instance, (Smith et al., 2016) use both polarization and shading constraints as linear equations when solving object depth, and (Mahmoud et al., 2012) use shape from shading constraints to correct the ambiguities. Other authors assume surface convexity to constrain the azimuth angle (Miyazaki et al., 2003; Atkinson and Hancock, 2006). Yet another solution is to use a coarse depth map to constrain the ambiguity (Kadambi et al., 2015, 2017). Figure 3 compares the tradeoffs of our proposed technique with these alternatives.

Another physical challenge in SfP is the refractive problem. SfP requires knowledge of per-pixel refractive indices. Previous work has used hard-coded values to estimate the refractive index of scenes. This leads to a relative shape that is recovered with refractive distortion. Another physical challenge is the noise problem. SfP is ill-conditioned, requiring input images that are relatively noise-free. Ironically, a polarizer reduces the captured light intensity by 50 percent, worsening the effects of Poisson shot noise.

We address these SfP pitfalls by moving away from a physics-only solution, toward the realm of data-driven techniques. A reasonable first attempt could apply vanilla convolutional neural networks (CNN) to the SfP problem. Unfortunately, machine learning alone is not a satisfactory solution. As illustrated in Figure 

2, a naive CNN implementation does not work even on the simplest of scenes. In contrast to prior work, we fuse both physics and deep learning in symbiosis. This hybrid approach outperforms previous SfP methods.

1.1. Contributions

In context of prior work in SfP, this paper demonstrates two technical first attempts:

  1. using deep learning techniques to solve the SfP problem; and

  2. blending approximate physics into the deep learning approach;


Because this is only a first attempt at blending SfP with deep learning, the proposed solution is not perfect, particularly when obtaining the shape of objects with mixed reflectivities. However, all prior methods in SfP also fail in this scenario. While our physics-based approach to neural networks does outperform the individual strategy of physics and learning alone, this may just be a first attempt at the problem.

2. Related Work

Figure 3. Summarizing the tradeoffs of our proposed physics-based neural networks (NN) versus physics-only and learning-only approaches.

Polarization cues have been employed previously for different tasks, such as reflectometry estimation (Ghosh et al., 2010), facial geometry reconstruction (Ghosh et al., 2011), dynamic interferometry (Maeda et al., 2018), polarimetric spatially varying surface reflectance functions (SVBRDF) recovery (Baek et al., 2018), and object shape acquisition (Ma et al., 2007; Guarnera et al., 2012; Riviere et al., 2017). This paper sits at the seamline of deep learning and SfP, offering unique performance tradeoffs from prior work. Refer to Figure 3 for an overview.

Shape from polarization   infers the shape (usually represented in surface normal) of a surface by observing the correlated changes of image intensity with the polarization information. Changes of polarization information could be captured by rotating a linear polarizer in front of an ordinary camera (Wolff, 1997; Atkinson and Ernst, 2018) or polarization cameras using a single shot in real time (e.g., PolarM (PolarM polarization camera, 2017) camera used in (Yang et al., 2018)

). Conventional shape from polarization decodes such information to recover the surface normal up to some ambiguity. If only images with different polarization information are available, heuristic priors such as the surface normals along the boundary and convexity of the objects are employed to remove the ambiguity 

(Miyazaki et al., 2003; Atkinson and Hancock, 2006). Photometric constraints from shape from shading (Mahmoud et al., 2012) and photometric stereo (Drbohlav and Sara, 2001; Ngo et al., 2015; Atkinson, 2017) complements polarization constraints to make the normal estimates unique. If multi-spectral measurements are available, surface normal and its refractive index could be estimated at the same time (Huynh et al., 2010, 2013). More recently, a joint formulation of shape from shading and shape from polarization in a linear manner is shown to be able to directly estimate the depth of the surface (Smith et al., 2016; Tozza et al., 2017; Smith et al., 2018). This paper is the first attempt at studying deep learning and SfP together.

Polarized 3D   involves stronger assumptions than SfP and has different inputs and outputs. Recognizing that SfP alone is a limited technique, the Polarized 3D class of methods integrate shape from polarization with a low resolution depth estimate. This additional constraint allows not just recovery of shape but also a high-quality 3D model. The low resolution depth could be achieved by employing two-view (Miyazaki et al., 2004; Atkinson and Hancock, 2005; Berger et al., 2017), three-view (Chen et al., 2018c), multi-view (Miyazaki et al., 2016; Cui et al., 2017) stereo, or even in real time by using a SLAM system (Yang et al., 2018). These depth estimates from geometric methods are not reliable in textureless regions where finding correspondence for triangulation is difficult. Polarimetric cues could be jointly used to improve such unreliable depth estimates to obtain a more complete shape estimation. A depth sensor such as the Kinect can also provide coarse depth prior to disambiguate the ambiguous normal estimates given by SfP (Kadambi et al., 2015, 2017). The key step that characterizes Polarized 3D is a holistic approach that rethinks both SfP and the depth-normal fusion process. The main limitation of Polarized 3D is the strong requirement that there be a coarse depth map, which is not true for our proposed technique.

Data-driven computational imaging   approaches draw much attention in recent years thanks to the powerful modeling ability of deep neural networks. Various types of convolutional neural networks (CNNs) are designed and trained to enable 3D imaging for different types of sensors and measurements. From single photon sensor measurements, a multi-scale denoising and upsampling CNN is proposed to refine depth estimates (Lindell et al., 2018). CNNs also show advantage in solving phase unwrapping, multipath interference, and denoising jointly from the raw time-of-flight measurements (Marco et al., 2017; Su et al., 2018). From multi-directional lighting measurements, a fully-connected network is first proposed to solve photometric stereo for general reflectance with a pre-defined set of light directions (Santo et al., 2017)

. Then the fully-convolutional network with an order-agnostic max-pooling operation 

(Chen et al., 2018b) and the observation map invariant to the number and permutation of the images (Ikehata, 2018) are concurrently proposed to deal with an arbitrary set of light directions. Normal estimates from photometric stereo can also be learned in an unsupervised manner by minimizing the reconstruction loss (Taniai and Maehara, 2018). Other than 3D imaging, deep learning has been used to solve several inverse problems in the field of computational imaging (Satat et al., 2017; Tancik et al., 2018a, b). Separation of shape, reflectance and illuminance maps for wild facial images can be achieved with the assistance of CNNs as well (Sengupta et al., 2018). Besides, CNNs also exhibit potentials for modeling SVBRDF of a near-planar surface (Li et al., 2017; Ye et al., 2018; Li et al., 2018b; Deschaintre et al., 2018), and more complex objects (Li et al., 2018c). The challenge with existing deep learning frameworks is that they do not leverage the unique physics of polarization.

3. Proposed Method

Figure 4. Overview of our proposed physics-based neural network. The network is designed according to the encoder-decoder architecture in a fully convolutional manner. We use addition operation as the mixer to integrate both low-level and high-level features from polarized images and ambiguous surface normals.

In this section, we first introduce some basic knowledge of SfP, and then present our physics-based convolutional neural network architecture. The blending of physics into deep learning helps improve the performance and generalizability of the method.

3.1. Image Formation and Physical Solution

Figure 5. SfP lacks a unique solution due to the ambiguity problem. Here, two different surface orientations could result in the same exact polarization signal, represented by dots and hashes. The dots represent polarization out of the plane of the paper and the hashes represent polarization within the plane of the board. Based on the measured data, it is unclear which orientation is correct.

Our objective is to reconstruct surface normals from a set of polarized images {, , …, } with different rotations of polarizer angles. For a specific polarizer angle , the intensity at a pixel of a captured image follows a sinusoid variation under unpolarized illumination:


where denotes the phase angle, and and are lower and upper bounds for the observed intensity. Equation 1 has a -ambiguity in context of : two phase angles, with a shift, will result in the same intensity in the captured images. Based on the phase angle , the azimuth angle can be retrieved with -ambiguity as follows (Cui et al., 2017):


The zenith angle is related to the degree of polarization , which can be written as:


When diffuse reflection is dominant, the degree of polarization can be expressed with the zenith angle and the refractive index as follows (Atkinson and Hancock, 2006):


The effect of is not decisive, and we assume throughout the rest of this paper. With this known , Equation 4 can be rearranged to obtain a close-form estimation of the zenith angle for the diffuse dominant case.

When specular reflection is dominant, the degree of polarization can be written as (Atkinson and Hancock, 2006):


Equation 5

can not be inverted analytically, and solving the zenith angle with numerical interpolation will produce two solutions if there are no additional constraints. For real world objects, specular reflection and diffuse reflection are mixed depending on the surface material of the object. As shown in Figure 

5, the ambiguity in the azimuth angle and uncertainty in the zenith angle are fundamental limitations of SfP. Overcoming these limitations through physics-based neural networks is the primary focus of this paper.

3.2. Learning with Physics

Large amounts of labeled data are critical to the success of neural networks. To alleviate the burden of data requirement, one possible method is to blend physical priors during learning. However, it is essentially difficult to use physical information for SfP tasks due to the following reasons: 1. Polarization normals contain ambiguous azimuth angles. 2. Specular reflection and diffuse reflection coexist simultaneously, and determining the proportion of each type is complicated. 3. Polarization normals are usually noisy, especially when the degree of polarization is low. Shifting the azimuth angles by or could not reconstruct the surface normals properly for noisy images.

Due to the above reasons, regularization from the physical azimuth angle or the physical zenith angle will degrade the network performance and lead to a fragile model. Therefore, instead of using physical solutions as regularization, we directly feed both the polarized images and the ambiguous normal maps into the network, and leave the network to learn how to combine physical solutions with the polarized images effectively. The estimated surface normals can be structured as following:


where is the proposed prediction model, {, , …, } is a set of polarized images, and is the estimated surface normals. We use diffuse model in Section 3.1 to calculate , and are the two solutions from specular model.

The remaining question is to contrive a way to combine ambiguous surface normals with polarized images in the network. Simply concatenating with polarized images did not show us the expected enhancement based on our testing results. One explanation for that is the low-level features from polarized images and the low-level features from ambiguous normals are different, and it is burdensome for convolutional layers to learn these two types of features concurrently. An alternative method is to use two separate encoder streams to encode these two types of features at the low-level stage, and merge the high-level features in deeper layers. With the proposed two-stream encoder, ambiguous normals can implicitly direct the network to learn some physical information and serve as a good initialization to improve generalizability.

3.3. Network Architecture

Layer Encoder block
1 Conv[(), ,

, stride=2], BN, LeakyReLU

2 Conv[(), , , stride=1], BN, LeakyReLU
3 Conv[(), , , stride=1], BN, LeakyReLU
Layer Decoder block
1 Deconv[(), , , stride=2], BN, LeakyReLU
2 Conv[(), , , stride=1], BN, LeakyReLU
3 Conv[(), , , stride=1], BN, LeakyReLU
Table 1. Convolutional layers in each encoder-block and decoder-block. Conv[(), , , stride=] represents a 2D convolutional layer with kernel size of (), input channels, output channels, and

stride. Deconv denotes a 2D transposed convolutional layer, and BN denotes a batch normalization layer. We use LeakyReLU 

(Maas et al., 2013)

with a negative slope of 0.1 as the activation function.

Our network structure is illustrated in Fig. 4. It consists of two independent encoders to extract features from polarized images and ambiguous surface normals separately and a common decoder to output surface normal . A variation of U-Net (Ronneberger et al., 2015) and LinkNet (Chaurasia and Culurciello, 2017) is used to connect encoder block and decoder block at the same hierarchical level. We argue that addition is superior to concatenation when merging feature maps, since it achieves comparable performance, yet requires less memory and computational power in general based on our testing results.

There are 7 encoder blocks to encode the input to a tensor of dimensionality

to guarantee the receptive field, where is the minibatch size. The encoded tensor is then decoded by the same number of decoder blocks to produce the estimated surface normals . An L2-normalization layer is appended after the last decoder block to convert corresponding feature maps into surface normals. Table 1 shows the structure of each encoder and decoder block. Two additional feature extractors containing 3 convolutional layer of kernel size are placed before the first encoder block to prepare feature maps suitable for downsampling purpose. We use convolutional layers with stride of 2 for downsampling, and transposed convolutional layers for upsampling. Batch normalization layers (Ioffe and Szegedy, 2015) are inserted after each layer, except the output layer, where batch normalization would cause distortion of the estimated surface normals . After batch normalization, LeakyReLU with a negative slope of 0.1 is used for the activation function.

For the image encoder, pictures captured with a polarizer at angles are selected for training and testing. It is sufficient to solve the polarization cues with three values of , nevertheless we use four values to ensure the robustness over noise. The four polarized images are stacked to form a tensor of dimensionality , where

is the spatial resolution of polarized images. Our motivation is that, since the relative 3D information from polarization is essentially from the the intensity difference between polarized images, it is beneficial for convolutional layers to learn this difference by concatenating images along the channel dimension as input. For the normal encoder, we use the identical architecture for the sake of feature map addition. We use ground truth surface normals to supervise the physics-based neural networks with the cosine similarity loss function:


where denotes the dot product, is the estimated surface normal at pixel location , and is the corresponding ground truth of surface normal. This loss is minimized when and have identical orientation.

4. Dataset and Implementation Details

In what follows, we describe the dataset capture and organization as well as software implementation details, including comparison implementations.

4.1. Dataset

Figure 6. Physical setup to capture polarized images. We use a polarization camera to capture four polarized images of an object in a single shot. The scanner is put next to the camera for obtaining the 3D shape of the object. The setup is put in an indoor environment with typical office lighting.
Figure 7. Overview of our real (upper part) and synthetic (lower part) datasets. We show 10 objects (out of 58) in the training set of our real dataset, and 10 objects (out of 10) of our synthetic dataset. In each example, we show on top of with thumbnail sizes, and the corresponding normal maps are shown below the polarization images. Note the polarization camera captures gray scale images, which are used as input for computation.

To train the physics-based neural network, polarization images with corresponding normal maps are needed. However, neither synthetic nor real datasets for such a purpose are publicly available. We therefore create the first real and synthetic datasets for data-driven SfP as illustrated in Fig. 7.

Real dataset:

A camera with a layer of polarizers above the photodiodes (Lucid Vision Phoenix polarization camera, 2018) is used to capture four polarized images at angles and in a single shot. Then a structured light based 3D scanner (SHINING 3D scanner, 2018) (with single shot accuracy no more than 0.1 mm, point distance from 0.17 mm to 0.2 mm, and a synchronized turntable for automatically registering scanning from multiple viewpoints) is used to obtain high-quality 3D shapes. Our real data capture setup is shown in Fig 6. The scanned 3D shapes are aligned from the scanner’s coordinate system to the image coordinate system of the polarization camera by using the shape-to-image alignment method adopted in (Shi et al., 2019). Finally, we compute for surface normals of the aligned shapes by using the Mitsuba renderer (Jakob, 2010) as ground truth. In total, we capture 65 sets (with 4 polarized images plus a surface normal map) of real data, and we use 58 sets of them for training and the remaining 7 sets for testing and quantitative evaluation.

Synthetic dataset:

The scanned real data are not sufficient in terms of scale and lighting variation for training a deep neural network. We further create a synthetic dataset to complement the real one. We use the normal maps provided in (Shi et al., 2019), since they cover a great diversity of geometry from a simple sphere to surfaces with highly delicate structures. Given a normal map, we calculate the its diffuse shading by assuming the Lambertian reflectance and a distant environment map (Debevec, 2008), as . are calculated using Equation 1. By using 10 different environment maps on 10 different normal maps, we obtain 100 sets of synthetic data, and all these data are used for training.

4.2. Software Implementation

Our model was implemented in PyTorch 

(Paszke et al., 2017)

, and trained for 500 epochs with a batch size of 64. It took 8 hours for the network to converge with a single NVIDIA Titan V GPU. We used Adam optimizer 

(Kingma and Ba, 2014) with default parameters ( = 0.9 and = 0.999), and the base learning rate was set to be 0.01. The learning rate was multiplied with a factor of 0.8 when loss reached the plateau regions during the training process. We tried both He initialization (He et al., 2015) and Xavier initialization (Glorot and Bengio, 2010) on the convolutional weights, and the performance of Xavier initialization is slightly better. For data augmentation, images patches of size are randomly cropped during training, and a patch is discarded if its foreground ratio is less than 20%. No random rescaling is used to preserve the original high-resolution details and aspect ratio. The final prediction is the average of 32 shifted input to preserve the accuracy at boundaries of each patch.

4.3. Comparisons to Physics-only SfP

We used a test dataset consisting of scenes that include ball, horse, vase, half painted vase, Christmas, flamingo, rabbit. On this test set, we compared performance between our proposed method and three physics-only methods for SfP: 1. (Smith et al., 2016). 2. (Mahmoud et al., 2012). 3. (Miyazaki et al., 2003; Atkinson and Hancock, 2006). The first method recovers the depth map directly, and we only use the diffuse model due to the lack of specular reflection masks. The surface normals are obtained from the estimated depth with bicubic fit. Both the first and the second methods require lighting input, and we use the estimated lighting from the first method during comparison. The second method also requires known albedo, and following convention, we assume an uniform albedo of 1. All the comparison codes were provided by Smith et al. (Smith et al., 2016) 111 Source codes of (Tozza et al., 2017; Smith et al., 2018) are not currently publicly available, therefore we are not able to conduct a fair comparison with these two methods.

5. Results

In this section, we evaluate our model with the presented challenging real-world scene benchmark, and compare it against three physics-only methods for SfP. Mean angular error (MAE) is selected as the metric to quantify the accuracy of the estimated surface normals during comparison.

Figure 8. Our method can handle shiny scenes with high-frequency details. Although the proposed method does not recover all of the detail that was present in ground truth, global errors in shape are not present. By comparison, the physics-only methods exhibit large errors in shape recovery.
Figure 9. Choice of neural network loss function affects result quality. Motivated by this example, we choose the cosine loss function as it returns the lowest error and appears to recover relatively more detail. Compared results are obtained on a small training set with 32 training samples (16 synthetic samples and 16 real samples).
Figure 10. The proposed method handles cases when the input images are noisy. Noise-tolerant performance is particularly important when using polarizing optics. Note a polarizing filter reduces the light intensity by 50 percent.

5.1. Machine Learning Alone is Insufficient (Ball Scene)

As illustrated in Figure 2, a naive approach to deep learning that does not blend physics is insufficient. On one of the simplest scenes possible (a white ping-pong ball), the naive neural network cannot recover accurate surface normals. There is only slight difference between images with different polarized angles, and it is difficult for a naive neural net to learn from these differences with limited number of training samples. The proposed method incorporates multiple physical solutions. Therefore, apart from learning from pure polarized images, which is difficult, the network can also learn from physical solutions, which may be easier. Generalizability of the network is thus improved, and it becomes realistic for the network to predict high-quality normals in this case.

5.2. Choice of Loss Function is Important (Vase Scene)

As shown in Figure 9, the choice of loss function affects both the quantitative error and the recovery of qualitative detail. Use of the loss function results in an overall smoothened result, while the shows widening of the ridges in the vase. The cosine loss function is closest to the ground truth and is used in all other scenes from the paper. The success of cosine loss may come from its emphasis on the orientation information. Both and loss will penalize the length of estimated surface normals, however, the normalization layer at the end has already constrained the normal length.

5.3. Improved Performance on Shiny and Detailed Scene (Horse Scene)

Here, we show improved performance on a relatively shiny scene with surface details. As illustrated in Figure 8, the proposed method of physics-based NN achieves the highest qualitative and quantitative accuracy. Worth noting is that, the result from (Smith et al., 2016) does not perform well on the horse scene because the simple hybrid reflection model and spherical harmonics based lighting model are not well satisfied for horse scene, and the estimated depth becomes inaccurate, which results in a normal map with a large error.

5.4. Improved Performance in Noise-degraded Environments (Vase Scene)

Here, we show that the physics-based NN approach outperforms physics-only approaches when the signal-to-noise level drops. As illustrated in Figure 10, the input to each of the methods are noisy polarization images. This noise was generated in simulation to mimic low light levels (when shot noise dominates). The proposed physics-based NN approach shows a qualitative and quantitative improvement over the physics-only methods. Our proposed approach of using a physics-based neural network works in low noise levels because of the encoder-decoder architecture. Both polarized images and physical solutions will be downsampled into a condensed feature map by the encoder, and the decoder has to use this condensed feature map to recover the normal map. With limited number of parameters, the network has to learn some intrinsic representation of the input, which gives us the robustness over noise.

Figure 11. The proposed method has the lowest angular error in recovering normal maps. We compare with SfP papers from (Smith et al., 2016)(Mahmoud et al., 2012) and (Miyazaki et al., 2003). Not shown is the performance from (Atkinson and Hancock, 2006), which behaves similarly to (Miyazaki et al., 2003).

5.5. Additional Scenes

Over all tested scenes in the paper, the proposed physics-based neural network outperforms physics-only methods from (Miyazaki et al., 2003; Mahmoud et al., 2012; Smith et al., 2016). In particular, Figure 11 shows that the proposed method recovers surface normals that are quantitatively and qualitatively closest to ground truth. The large region-wise anomalies on many of the results from (Miyazaki et al., 2003) are to do with the region-growing constraint on the convexity that is imposed. The method of (Mahmoud et al., 2012) uses shading constraints which require a distant light source, which is not the case for tested scenes. Finally, the results in (Smith et al., 2016) are explained both by the use of 4 polarized images as input (ordinarily the method requires 18), as well as change in the lighting direction.

5.6. SfP Still Fails on Mixed Material Scenes

Figure 12. All SfP methods, including the proposed method, fail on a scene with mixed paints. A texture copy artifact is seen in all the SfP methods at the point of material transition. While all SfP methods can be seen as failing in that regard, the proposed method still has the lowest error.

This paper, like other SfP methods, is unable to solve the mixed material problem. This problem occurs when the polarimetric signal is not just due to surface geometry, but also material effects. Figure 12 shows one such scene, consisting of a vase painted with two different styles of paint. While the physics-based NN result has the lowest quantitative error, none of the SfP methods are correct. There is a texture copy artifact at the point where the paints change.

6. Discussion

In summary, we have presented a first attempt at blending the physics of SfP with deep learning. This blending is very unique because of the uncertain physics inherent to SfP. This special uncertainty in the physics-based prior motivates our use of a novel, multi-stream encoder, as compared to existing deep learning approaches.

In addition, we report a performance improvement over existing methods to solve SfP. However, there are still open problems. We find that existing SfP methods (including this paper) fail on scenes with mixed reflectivity. It would be interesting to study how material properties could be incorporated into the physics-based NN architecture. Part of the solution may also rely on expanding the training dataset, to include a wider variety of object materials and paints. For these types of computational photography problems, where the capture procedure is labor intensive, it is likely that dataset sizes will be small. This underscores the importance of including physical priors in the network model. With this inclusion, we were able to obtain results from a relatively small dataset size.

The lessons learned in this ”Deep Shape from Polarization” study may also apply to a future “Deep Polarized 3D” study. The physics-only family of Polarized 3D techniques benefit from robust integration of surface normals with a depth prior. The state-of-the-art Polarized 3D integration has been performed with a simplistic matrix inversion (Kadambi et al., 2015). A physics-based NN approach might be able to learn this elementary function to potentially obtain state-of-the-art results. Overall, this paper’s results appear to validate the direction of jointly studying deep learning and SfP.


  • (1)
  • Atkinson (2017) Gary A. Atkinson. 2017. Polarisation photometric stereo. Computer Vision and Image Understanding 160 (2017), 158–167.
  • Atkinson and Ernst (2018) Gary A. Atkinson and Jürgen D. Ernst. 2018. High-sensitivity analysis of polarization by surface reflection. Machine Vision and Applications 29, 7 (2018), 1171–1189.
  • Atkinson and Hancock (2005) Gary A. Atkinson and Edwin R. Hancock. 2005. Multi-view Surface Reconstruction using Polarization. In Proc. of International Conference on Computer Vision.
  • Atkinson and Hancock (2006) Gary A Atkinson and Edwin R Hancock. 2006. Recovery of surface orientation from diffuse polarization. IEEE Transactions on Image Processing 15, 6 (2006), 1653–1664.
  • Baek et al. (2018) Seung-Hwan Baek, Daniel S Jeon, Xin Tong, and Min H Kim. 2018. Simultaneous acquisition of polarimetric SVBRDF and normals. ACM Trans. Graph 37, 6 (2018).
  • Berger et al. (2017) Kai Berger, Randolph Voorhies, and Larry H. Matthies. 2017. Depth from stereo polarization in specular scenes for urban robotics. In Proc. of International Conference on Robotics and Automation.
  • Chaurasia and Culurciello (2017) Abhishek Chaurasia and Eugenio Culurciello. 2017. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proc. of IEEE International Conference on Visual Communications and Image Processing.
  • Che et al. (2018) Chengqian Che, Fujun Luan, Shuang Zhao, Kavita Bala, and Ioannis Gkioulekas. 2018. Inverse Transport Networks. arXiv preprint arXiv:1809.10820 (2018).
  • Chen et al. (2018b) Guanying Chen, Kai Han, and Kwan-Yee K. Wong. 2018b. PS-FCN: A flexible learning framework for photometric stereo. In Proc. of European Conference on Computer Vision.
  • Chen et al. (2018a) Huaijin Chen, Jinwei Gu, Orazio Gallo, Ming-Yu Liu, Ashok Veeraraghavan, and Jan Kautz. 2018a.

    Reblur2deblur: Deblurring videos via self-supervised learning. In

    2018 IEEE International Conference on Computational Photography (ICCP). IEEE, 1–9.
  • Chen et al. (2018c) Lixiong Chen, Yinqiang Zheng, Art Subpa-asa, and Imari Sato. 2018c. Polarimetric Three-View Geometry. In Proc. of European Conference on Computer Vision.
  • Cui et al. (2017) Zhaopeng Cui, Jinwu Gu, Boxin Shi, Ping Tan, and Jan Kautz. 2017. Polarimetric Multi-View Stereo. In

    Proc. of Computer Vision and Pattern Recognition

  • Debevec (2008) Paul Debevec. 2008. Rendering Synthetic Objects into Real Scenes: Bridging Traditional and Image-based Graphics with Global Illumination and High Dynamic Range Photography. In ACM SIGGRAPH 2008 Classes. 32:1–32:10.
  • Deschaintre et al. (2018) Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018. Single-image svbrdf capture with a rendering-aware deep network. ACM Transactions on Graphics (TOG) 37, 4 (2018), 128.
  • Diamond et al. (2017) Steven Diamond, Vincent Sitzmann, Felix Heide, and Gordon Wetzstein. 2017. Unrolled optimization with deep priors. arXiv preprint arXiv:1705.08041 (2017).
  • Drbohlav and Sara (2001) Ondrej Drbohlav and Radim Sara. 2001. Unambiguous determination of shape from photometric stereo with unknown light sources. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 1. IEEE, 581–586.
  • Ghosh et al. (2010) Abhijeet Ghosh, Tongbo Chen, Pieter Peers, Cyrus A Wilson, and Paul Debevec. 2010. Circularly polarized spherical illumination reflectometry. ACM Transactions on Graphics (TOG) 29, 6 (2010), 162.
  • Ghosh et al. (2011) Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and Paul Debevec. 2011. Multiview face capture using polarized spherical gradient illumination. In ACM Transactions on Graphics (TOG), Vol. 30. ACM, 129.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    . 249–256.
  • Goy et al. (2018a) Alexandre Goy, Kwabena Arthur, Shuai Li, and George Barbastathis. 2018a. Low photon count phase retrieval using deep learning. Physical review letters 121, 24 (2018), 243902.
  • Goy et al. (2018b) Alexandre Goy, Girish Roghoobur, Shuai Li, Kwabena Arthur, Akintunde I Akinwande, and George Barbastathis. 2018b. High-Resolution Limited-Angle Phase Tomography of Dense Layered Objects Using Deep Neural Networks. arXiv preprint arXiv:1812.07380 (2018).
  • Guarnera et al. (2012) Giuseppe Claudio Guarnera, Pieter Peers, Paul Debevec, and Abhijeet Ghosh. 2012. Estimating surface normals from spherical stokes reflectance fields. In European Conference on Computer Vision. Springer, 340–349.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In

    Proc. of International Conference on Computer Vision.
  • Huynh et al. (2010) Cong Phuoc Huynh, A. Robles-Kelly, and Edwin R. Hancock. 2010. Shape and refractive index recovery from single-view polarisation images. In Proc. of Computer Vision and Pattern Recognition.
  • Huynh et al. (2013) Cong Phuoc Huynh, A. Robles-Kelly, and Edwin R. Hancock. 2013. Shape and refractive index from single-view spectro-polarimetric images. International Journal of Computer Vision 101, 1 (2013), 64.
  • Ikehata (2018) Satoshi Ikehata. 2018. CNN-PS: CNN-based photometric stereo for general non-convex surfaces. In Proc. of European Conference on Computer Vision.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
  • Jakob (2010) Wenzel Jakob. 2010. Mitsuba renderer. (2010).
  • Jin et al. (2017) Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. 2017. Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing 26, 9 (2017), 4509–4522.
  • Kadambi et al. (2015) Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Raskar. 2015. Polarized 3d: High-quality depth sensing with polarization cues. In Proceedings of the IEEE International Conference on Computer Vision. 3370–3378.
  • Kadambi et al. (2017) Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Raskar. 2017. Depth sensing using geometrically constrained polarization normals. International Journal of Computer Vision 125, 1-3 (2017), 34–51.
  • Karpatne et al. (2017) Anuj Karpatne, William Watkins, Jordan Read, and Vipin Kumar. 2017. Physics-guided neural networks (pgnn): An application in lake temperature modeling. arXiv preprint arXiv:1710.11431 (2017).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Le et al. (2017) Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. 2017.

    Coordinated multi-agent imitation learning. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1995–2003.
  • Li et al. (2018a) Lerenhan Li, Jinshan Pan, Wei-Sheng Lai, Changxin Gao, Nong Sang, and Ming-Hsuan Yang. 2018a. Learning a discriminative prior for blind image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6616–6625.
  • Li et al. (2017) Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017. Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Transactions on Graphics (TOG) 36, 4 (2017), 45.
  • Li et al. (2018b) Zhengqin Li, Kalyan Sunkavalli, and Manmohan Chandraker. 2018b. Materials for masses: SVBRDF acquisition with a single mobile phone image. In Proceedings of the European Conference on Computer Vision (ECCV). 72–87.
  • Li et al. (2018c) Zhengqin Li, Zexiang Xu, Ravi Ramamoorthi, Kalyan Sunkavalli, and Manmohan Chandraker. 2018c. Learning to reconstruct shape and spatially-varying reflectance from a single image. In SIGGRAPH Asia 2018 Technical Papers. ACM, 269.
  • Lindell et al. (2018) David B. Lindell, Matthew O’Toole, and Gordon Wetzstein. 2018. Single-Photon 3D Imaging with Deep Sensor Fusion. ACM Transactions on Graphics (Proc. of ACM SIGGRAPH) 37, 4 (2018), 113.
  • Lucid Vision Phoenix polarization camera (2018) Lucid Vision Phoenix polarization camera. 2018.
  • Ma et al. (2007) Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and Paul Debevec. 2007. Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumination. In Proceedings of the 18th Eurographics conference on Rendering Techniques. Eurographics Association, 183–194.
  • Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30. 3.
  • Maeda et al. (2018) Tomohiro Maeda, Achuta Kadambi, Yoav Y Schechner, and Ramesh Raskar. 2018. Dynamic heterodyne interferometry. In 2018 IEEE International Conference on Computational Photography (ICCP). IEEE, 1–11.
  • Mahmoud et al. (2012) Ali H Mahmoud, Moumen T El-Melegy, and Aly A Farag. 2012. Direct method for shape recovery from polarization and shading. In Proc. of International Conference on Image Processing. IEEE.
  • Marco et al. (2017) Julio Marco, Quercus Hernandez, Adolfo Munoz, Yue Dong, Adrian Jarabo, Min H Kim, Xin Tong, and Diego Gutierrez. 2017. DeepToF: off-the-shelf real-time correction of multipath interference in time-of-flight imaging. ACM Transactions on Graphics (Proc. of ACM SIGGRAPH Asia) 36, 6 (2017), 219.
  • Miyazaki et al. (2004) Daisuke Miyazaki, Masataka Kagesawa, and Katsushi Ikeuchi. 2004. Transparent surface modeling from a pair of polarization images. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1 (2004), 73–82.
  • Miyazaki et al. (2016) Daisuke Miyazaki, Takuya Shigetomi, Masashi Baba, Ryo Furukawa, Shinsaku Hiura, and Naoki Asada. 2016. Surface normal estimation of black specular objects from multiview polarization images. Optical Engineering 56, 4 (2016), 041303.
  • Miyazaki et al. (2003) Daisuke Miyazaki, Robby T Tan, Kenji Hara, and Katsushi Ikeuchi. 2003. Polarization-based inverse rendering from a single view. In Proc. of International Conference on Computer Vision.
  • Ngo et al. (2015) Trung Thanh Ngo, Hajime Nagahara, and R. Taniguchi. 2015. Shape and light directions from shading and polarization. In Proc. of Computer Vision and Pattern Recognition.
  • Pan et al. (2018) Jinshan Pan, Yang Liu, Jiangxin Dong, Jiawei Zhang, Jimmy Ren, Jinhui Tang, Yu-Wing Tai, and Ming-Hsuan Yang. 2018. Physics-based generative adversarial models for image restoration and beyond. arXiv preprint arXiv:1808.00605 (2018).
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
  • PolarM polarization camera (2017) PolarM polarization camera. 2017.
  • Riviere et al. (2017) Jérémy Riviere, Ilya Reshetouski, Luka Filipi, and Abhijeet Ghosh. 2017. Polarization imaging reflectometry in the wild. ACM Transactions on Graphics (TOG) 36, 6 (2017), 206.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proc. of International Conference on Medical image computing and computer-assisted intervention. Springer.
  • Santo et al. (2017) Hiroaki Santo, Masaki Samejima, Yusuke Sugano, Boxin Shi, and Yasuyuki Matsushita. 2017. Deep photometric stereo network. In Proc. of International Conference on Computer Vision Workshops.
  • Satat et al. (2017) Guy Satat, Matthew Tancik, Otkrist Gupta, Barmak Heshmat, and Ramesh Raskar. 2017. Object classification through scattering media with deep learning on time resolved measurement. Optics express 25, 15 (2017), 17466–17479.
  • Sengupta et al. (2018) Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. 2018. SfSNet: Learning Shape, Reflectance and Illuminance of Facesin the Wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6296–6305.
  • Shi et al. (2019) Boxin Shi, Zhipeng Mo, Zhe Wu, Dinglong Duan, Sai-Kit Yeung, and Ping Tan. 2019. A Benchmark Dataset and Evaluation for Non-Lambertian and Uncalibrated Photometric Stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2019), 271–284.
  • Shi et al. (2018) Guanya Shi, Xichen Shi, Michael O’Connell, Rose Yu, Kamyar Azizzadenesheli, Animashree Anandkumar, Yisong Yue, and Soon-Jo Chung. 2018. Neural lander: Stable drone landing control using learned dynamics. arXiv preprint arXiv:1811.08027 (2018).
  • SHINING 3D scanner (2018) SHINING 3D scanner. 2018.
  • Smith et al. (2016) William A. P. Smith, Ravi Ramamoorthi, and Silvia Tozza. 2016. Linear depth estimation from an uncalibrated, monocular polarisation image. In Proc. of European Conference on Computer Vision.
  • Smith et al. (2018) William A. P. Smith, Ravi Ramamoorthi, and Silvia Tozza. 2018. Height-from-Polarisation with Unknown Lighting or Albedo. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
  • Stewart and Ermon (2017) Russell Stewart and Stefano Ermon. 2017. Label-free supervision of neural networks with physics and domain knowledge. In Thirty-First AAAI Conference on Artificial Intelligence.
  • Su et al. (2018) Shuochen Su, Felix Heide, Gordon Wetzstein, and Wolfgang Heidrich. 2018. Deep End-to-End Time-of-Flight Imaging. In Proc. of Computer Vision and Pattern Recognition.
  • Tancik et al. (2018a) Matthew Tancik, Guy Satat, and Ramesh Raskar. 2018a. Flash Photography for Data-Driven Hidden Scene Recovery. arXiv preprint arXiv:1810.11710 (2018).
  • Tancik et al. (2018b) Matthew Tancik, Tristan Swedish, Guy Satat, and Ramesh Raskar. 2018b. Data-Driven Non-Line-of-Sight Imaging With A Traditional Camera. In Imaging Systems and Applications. Optical Society of America, IW2B–6.
  • Taniai and Maehara (2018) Tatsunori Taniai and Takanori Maehara. 2018. Neural inverse rendering for general reflectance photometric stereo. In Proc. of International Conference on Machine Learning.
  • Tozza et al. (2017) Silvia Tozza, William A. P. Smith, Dizhong Zhu, Ravi Ramamoorthi, and Edwin R. Hancock. 2017. Linear Differential Constraints for Photo-polarimetric Height Estimation. In Proc. of International Conference on Computer Vision.
  • Wolff (1997) Lawrence B. Wolff. 1997. Polarization vision: A new sensory approach to image understanding. Image Vision Computing 15, 2 (1997), 81–93.
  • Yang et al. (2018) Luwei Yang, Feitong Tan, Ao Li, Zhaopeng Cui, Yasutaka Furukawa, and Ping Tan. 2018. Polarimetric Dense Monocular SLAM. In Proc. of Computer Vision and Pattern Recognition.
  • Ye et al. (2018) Wenjie Ye, Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2018. Single Image Surface Appearance Modeling with Self-augmented CNNs and Inexact Supervision. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 201–211.