1 Introduction
3D shape recovery from images is a central problem in computer vision. While geometric approaches such as binocular (Kendall et al., 2017; Taniai et al., 2017) and multiview stereo (Furukawa & Ponce, 2010) use images from different viewpoints to triangulate 3D points, photometric stereo (Woodham, 1980) uses varying shading cues of multiple images to recover 3D surface normals. It is well known that photometric methods prevail in recovering fine details of surfaces, and play an essential role for highly accurate 3D shape recovery in combined approaches (Nehab et al., 2005; Esteban et al., 2008; Park et al., 2017). Although there exists a closedform least squares solution to the simplest Lambertian surfaces, such ideally diffuse materials rarely exist in the real word. Photometric stereo for surfaces with unknown general reflectance properties (i.e., bidirectional reflectance distribution functions or BRDFs) still remains as a fundamental challenge (Shi et al., 2018).
Meanwhile, deep learning technologies have drastically pushed the envelope of stateoftheart in many computer vision tasks such as image recognition
(Krizhevsky et al., 2012; He et al., 2015, 2016), segmentation (He et al., 2017b) and stereo vision (Kendall et al., 2017). As for photometric stereo, it is promising to replace handcrafted reflectance models with deep neural networks to learn complicated BRDFs. However, studies in this direction so far are surprisingly limited (Santo et al., 2017; HoldGeoffroy et al., 2018). This is possibly due to difficulties of making a large amount of training data with ground truth. Accurately measuring surface normals of real objects is very difficult, because we need highly accurate 3D shapes to reliably compute surface gradients. In fact, a realworld scene benchmark of photometric stereo with ground truth has only recently been introduced by precisely registering laserscanned 3D meshes onto 2D images (Shi et al., 2018). Using synthetic training data is possible (Santo et al., 2017), but we need photorealistic rendering that should ideally account for various realistic BRDFs and object shapes, spatiallyvarying BRDFs and materials, presence of cast shadows and interreflections, etc. This is more demanding than trainingdata synthesis for stereo and optical flow (Mayer et al., 2016) where rendering by the simplest Lambertian reflectance often suffices. Also, measuring BRDFs of real materials requires efforts and an existing BRDF database (Matusik et al., 2003) provides only a limited number of materials.As another difficulty of applying deep learning to photometric stereo, when networks are pretrained, they need to be invariant to permutation of inputs, i.e., permuting input images (and corresponding illuminations) should not change the resulting surface normals. Existing neural network methods (Santo et al., 2017) avoid this problem by assuming the same illumination patterns throughout training and testing phases, which limits application scenarios of methods.
In this paper, we propose a novel convolutional neural network (CNN) architecture for general BRDF photometric stereo. Given observed images and corresponding lighting directions, our network inverse renders surface normals and spatiallyvarying BRDFs from the images, which are further fed into the reflectance (or rendering) equation to synthesize observed images (see Fig. 1). The network weights are optimized by minimizing reconstruction loss between observed and synthesized images, enabling unsupervised learning that does not use ground truth normals. Furthermore, learning is performed directly on a test scene during the testing phase without any pretraining. Therefore, the permutation invariance problem does not matter in our framework. Our method is evaluated on a challenging realworld scene benchmark (Shi et al., 2018) and is shown to outperform stateoftheart learningbased (Santo et al., 2017) and other classical unsupervised methods (Shi et al., 2014, 2012; Ikehata & Aizawa, 2014; Ikehata et al., 2012; Wu et al., 2010; Goldman et al., 2010; Higo et al., 2010; Alldrin et al., 2008). We summarize the advantages of our method as follows.

Existing neural network methods require pretraining using synthetic data, whenever illumination conditions of test scenes change from the trained ones. In contrast, our physicsbased approach can directly fit network weights for a test scene in an unsupervised fashion.

Compared to classical physicsbased approaches, we leverage deep neural networks to learn complicated reflectance models, rather than manually analyzing and inventing reflectance properties and models.

Yet, our physicsbased network architecture allows us to exploit prior knowledge about reflectance properties that have been broadly studied in the literature.
2 Preliminaries
Before presenting our method, we recap basic settings and approaches in photometric stereo. Suppose a reflective surface with a unit normal vector
is illuminated by a point light source (where has an intensity and a unit direction ), without interreflection and ambient lighting. When this surface is observed by a linearresponse camera in a view direction , its pixel intensity is determined as follows.(1) 
Here, is a binary function for the presence of a cast shadow, is a BRDF, and represents an attached shadow. Figure 2 illustrates this situation.
The goal of photometric stereo is to recover the surface normal from intensities , when changing illuminations . Here, we usually assume a camera with a fixed viewpoint and an orthogonal projection model, in which case the view direction is constant and typically . Also, light sources are assume to be infinitely distant so that is uniform over the entire object surfaces.
2.1 Lambertian model and least squares method
When the BRDF is constant as , the surface is purely diffuse. Such a model is called the Lambertian reflectance and the value is called albedo. In this case, estimation of is relatively easy, because for bright pixels () the reflectance equation of Eq. (1) becomes a linear equation: where . Therefore, if we know at least three intensity measurements () and their lighting conditions , then we obtain a linear system
(2) 
which is solved by least squares as
(3) 
Here, is the pseudo inverse of , and the resulting vector is then L2normalized to obtain the unit normal .
In practice, images are contaminated as due to sensor noises, interreflections, etc. Therefore, we often set a threshold for selecting inlier observation pixels .
When the lighting conditions are unknown, the problem is called uncalibrated photometric stereo. It is known that the problem has the socalled basrelief ambiguity (Belhumeur et al., 1999), and is difficult even for the Lambertian surfaces. In this paper, we focus on the calibrated photometric stereo settings that assume known lighting conditions.
2.2 Photometric stereo for general BRDF surfaces
When the BRDF has unknown nonLambertian properties, photometric stereo becomes very challenging, because we essentially need to know the form of the BRDF by assuming some reflectance model to it or by directly estimating along with the surface normal . Below we briefly review existing such approaches and their limitations. For more comprehensive reviews, please refer to a recent excellent survey by Shi et al. (2018).
Ourlier rejection based methods.
A group of methods treat nonLambertian reflectance components including specular highlights and shadows as outliers to the Lambertian model. Thus, Eq. (
2) is rewritten to(4) 
where nonGaussian outliers are assume to be sparse. Recent methods solve this sparse regression problem by using robust statistical techniques (Wu et al., 2010; Ikehata et al., 2012) or using learnable optimization networks (Xin et al., 2016; He et al., 2017a). However, this approach cannot handle broad and soft specularity due to the collapse of the sparse outlier assumption (Shi et al., 2018).
Analytic BRDF models.
Another type of methods use more realistic BRDF models than the Lambertian model matured in the computer graphics literature, e.g., the TorranceSparrow model (Georghiades, 2003), the Ward model (Chung & Jia, 2008), or a Ward mixture model (Goldman et al., 2010). These models explicitly consider specularity rather than treating it as outliers, and often take a form of the sum of diffuse and specular components as follows.
(5) 
However, these methods rely on handcrafted models that can only handle narrow classes of materials.
General isotropic BRDF properties.
More advanced methods directly estimate the unknown BRDF by exploiting some general BRDF properties. For example, many materials have an isotopic BRDF that only depends on relative angles between , and . Given the isotropy, Ikehata & Aizawa (2014) further assume the following bivariate BRDF function
(6) 
with monotonicity and nonnegativity constraints. Similarly, Shi et al. (2014) exploit a lowfrequency prior of BRDFs and propose a bipolynomial BRDF:
(7) 
where , , and .
Our method is close to the last approach in that we learn broad classes of a BRDF from observations without restricting it to a particular reflectance model. However, unlike those methods that fully rely on careful human analysis of BRDF properties, we leverage the powerful expressibility of deep neural networks to learn general complicated BRDFs. Yet, our network architecture also explicitly uses the physical reflectance equation of Eq. (1) internally, which allows us to incorporate abundant wisdom about reflectance developed in the literature, into neural network based approaches.
3 Proposed method
In this section, we present our novel inverserendering based neural network architecture for photometric stereo, and explain its learning procedures with a technique of earlystage weak supervision. Here, as standard settings of calibrated photometric stereo, we assume patterns of light source directions () and corresponding image observations as inputs. We also assume that the mask of target object regions is provided. Our goal is to estimate the surface normal map of the target object regions.
Notations.
We use bold capital letters for tensors and matrices, and bold small letters for vectors. We use tensors of dimensionality to represent images, and normal and other feature maps, where is some channel number and is the spatial resolution. Thus, and , where is the number of color channels of images. We use the subscript to denote a pixel location of such tensors, e.g., is a normal vector at . The light vectors can also have color channels, in which case are matrices but we use a small letter for intuitiveness. The index is always used to denote the observation index . When we use tensors of dimensionality , the first dimension denotes a minibatch size processed in one SGD iteration.
3.1 Network architecture
We illustrate our network architecture in Fig. 3. Our method uses two subnetworks, which we name the photometric stereo network (PSNet) and image reconstruction network (IRNet), respectively. PSNet predicts a surface normal map as the desired output, given the input images. On the other hand, IRNet synthesizes observed images using the rendering equation of Eq. (1). The synthesized images are used to define reconstruction loss with the observed images, which produces gradients flowing into both networks and enables learning without ground truth supervision. We now explain these two networks in more details below.
3.1.1 Photometric stereo network
Given a tensor that concatenates all input images along the channel axis, PSNet first converts it to an abstract feature map as
(8) 
and then outputs a surface normal map given as
(9) 
Here, is a feedforward CNN of three layers with learnable parameters , where each layer applies x Conv of channels, BatchNorm (Ioffe & Szegedy, 2015)
, and ReLU. We use channels of
, and use no skipconnections or pooling. Similarly, applies x Conv and L2 normalization that makes each a unit vector.3.1.2 Image reconstruction network
IRNet synthesizes each observed image as based on the rendering equation of Eq. (1). Specifically, IRNet first predicts , the multiplication of a cast shadow and a BRDF, under a particular illumination as
(10) 
Here, we call a reflectance image, which is produced by a CNN as explained later. Then, IRNet synthesizes each image by the rendering equation below.
(11) 
Here, the inner products between light and normal vectors are computed at each pixel by . Note that when has color channels, we multiply a matrix to . Consequently, and have the same dimensions with . The is done elementwise and is implemented by ReLU, and is elementwise multiplication. We now explain details of by dividing it into three parts.
Individual observation transform.
The first part transforms each observed image (which we denote as ) into a feature map as follows.
(12) 
The network architecture of is the same with in Eq. (8), except that we use channels of for . To more effectively learn BRDFs, we use an additional specularity channel for the input as
(13) 
where is computed at each pixel as
(14) 
Here, is the direction of the specular reflection (dashed line between and in Fig. 2). It is well known by past studies that is highly correlated with the actual specular component of a BRDF. Therefore, directly giving it as a hint to the network will promote learning of complex BRDFs.
Global observation blending.
Because has limited observation information under a particular illumination , we enrich it by in Eq. (8) that has more comprehensive information of the scene. We do this similarly to global and local feature blending in (Charles et al., 2017; Iizuka et al., 2016) as
(15) 
where applies x Conv, BatchNorm, and ReLU. Note that applying Conv to is efficiently done as where Conv of is computed only once and reused for all observations .
Output.
After the blending, we finally output by
(16) 
where is 3x3 Conv, BatchNorm, ReLU, and 3x3 Conv. As explained in Eq. (11), the resulting is used to reconstruct each image as , which is the final output of IRNet.
Note that the internal channels of IRNet are all the same as . Also, IRNet simultaneously reconstructs all images during SGD iterations, by treating them as a minibatch: . This learning procedure is more explained in the next section.
3.2 Learning procedures (optimization)
We optimize the network parameters
by minimizing the following loss function using SGD.
(17) 
The first term defines reconstruction loss between the synthesized and observed images , which is explained in Sec. 3.2.1. The second term defines weak supervision loss between the predicted and some prior normal map . This term is only activated in early iterations of SGD (i.e., when ) in order to warm up randomly initialized networks and stabilize the learning. This is more explained in Sec. 3.2.2. Other implementation details and hyperparameter settings are described in Sec. 3.2.3.
Most importantly, the network is directly fit for a particular test scene without any pretraining on other data, by updating the network parameters over SGD iterations. Final results are obtained at convergence.
3.2.1 Reconstruction loss
The reconstruction loss is defined as mean absolute errors between and over target object regions as
(18) 
Here, in is the binary object mask, and is its object area size. Using absolute errors increases the robustness to highintensity specular highlights.
3.2.2 Earlystage weak supervision
If the target scene has relatively simple reflectance properties, the reconstruction loss alone can often lead to a good solution, even starting with randomly initialized networks. However, for complex scenes, we need to warm up the network by adding the following weak supervision.
(19) 
Here, the prior normal map is obtained by the simplest least squares method described in Sec. 2.1 using all observed pixels without any thresholding. Due to the presence of shadows and nonLambertian specularity, this least squares solution can be very inaccurate. However, even such priors work well in our method, because we only use them to guide the optimization in its early stage. For this, we set to for initial 50 iterations, and then set it to zero afterwards. The coefficient is to adaptively balance weights between and , and is computed as the mean intensities of over target object regions, i.e., .
3.2.3 Implementation details
We use Adam (Kingma & Ba, 2015) as the optimizer. For each test scene, we iterate SGD updates for steps. Adam’s hyperparameter is set to for first 900 iterations, and then decreased to for last 100 iterations for finetuning. We use the default values for the other hyperparameters. The convolution weights are randomly initialized by He initialization (He et al., 2015).
In each iteration, PSNet predicts a surface normal map , and then IRNet reconstructs all observed images as samples of a minibatch. Given and , we compute the loss and update the parameters of both networks.
When computing the reconstruction loss in Eq. (18), we randomly dropout 90% of its elements and rescale by a factor of 10 instead. This treatment is to compensate for the well known issue of poor local convergence of SGD by the use of a large minibatch (Keskar et al., 2017).
Because we learn network parameters during testing, we always run BatchNorm by the training mode using statistics of given data (i.e., we never use movingaverage statistics).
Before being fed into the network, the input images are cropped by a loose bounding box of the target object regions for reducing redundant computations. Then, the images are normalized by global scaling as
(20) 
where is the squareroot of mean squared intensities of over target regions. For PSNet, the normalized image tensor is further concatenated with the binary mask as input.
4 Experiments
In this section we evaluate our method using a challenging realworld scene benchmark called DiLiGenT (Shi et al., 2018). In Sec. 4.1, we show comparisons with stateoftheart photometric stereo methods. We then more analyze our network architecture in Sec. 4.2 and weak supervision technique in Sec. 4.3. In the experiments, we use of observed images for each scene provided by the DiLiGenT dataset. Our method is implemented in Chainer (Tokui et al., 2015) and is run on a single nVidia Tesla V100 GPU with 16 GB memory and 32 bit floatingpoint precision.
4.1 Realworld scene benchmark (DiLiGenT)
We show our results on the DiLiGenT benchmark (Shi et al., 2018) in Table 1, where we compare our method with ten existing methods by mean angular errors. We also show visual comparisons of the top three and baseline methods for reading and harvest in Fig. 1. Our method achieves the best average score and best individual scores for eight scenes (excepting only two scenes of goblet and harvest) that contain various materials and reflectance surfaces. This is remarkable considering that another neural network method (Santo et al., 2017) outperforms the other existing methods only for harvest
, in spite of its supervised learning. This
harvest is the most difficult scene of all due to heavy interactions of cast shows and interreflections as well as spatiallyvarying materials and complex metallic BRDFs. For such complex scenes, supervised pretraining (Santo et al., 2017) is effective. The baseline method poorly performs especially for specular objects. Although we use its results as guidance priors, its low accuracy is not critical to our method thanks to the proposed earlystage supervision. We more analyze it in Sec. 4.3.4.2 Analysis of the network architecture
In the middle part of Table 2, we show performance changes of our method by modifying its architecture. Specifically, we test two settings where we disable two connections from PSNet to IRNet, i.e., the specularity channel input and the global observation blending described in Sec. 3.1.2. As shown, the proposed full architecture performs best, while the removal of the specularity channel input has the most negative impact. As expected, directly inputting a specularity channel indeed eases learning of complex BRDFs (e.g., metallic surfaces in cow), demonstrating a strength of our physicsbased network architecture that can exploit known pysical reflectance properties for BRDF learning.
4.3 Effects of earlystage weak supervision
We here evaluate the effectiveness of our learning strategy using earlystage weak supervision, by comparing with two cases where we use no or allstage supervision (i.e., is or constant). See the bottom part of Table 2 for performance comparisons. Learning with no supervision produces comparable median scores but worse mean scores, compared to earlystage supervision. This indicates that learning with no supervision is very unstable and often gets stuck at bad local minimums, as shown in Fig. 5 (green profiles). On the other hand, learning with allstage supervision is relatively stable but is strongly biased by inaccurate leastsquares priors, often producing worse solutions as shown in Fig. 5 (blue profiles). In contrast, learning with the proposed earlystage supervision (red profiles) is more stable and persistently continues to improve accuracy even after terminating the supervision at (shown as vertical dashed lines).
5 Discussions and related work
Our method is inspired by recent work of deep image prior by Ulyanov et al. (2018)
. They show that architectures of CNNs themselves behave as good regularizers for natural images, and show successful results for unsupervised tasks such as image superresolution and inpainting by fitting a CNN for a single test image. However, their simple glasshour network does not directly apply to photometric stereo, because we here need to simultaneously consider surface normal estimation that accounts for global statistics of observations, as well as reconstruction of individual observations for defining the loss. Our novel architecture addresses this problem by resorting to ideas of classical physicsbased approaches to photometric stereo.
Our network architecture is also partly influenced by that of (Santo et al., 2017), which regresses perpixel observations to a 3D normal vector using a simple feedforward network of five fullyconnected and ReLU layers plus an output layer. Our PSNet becomes similar to theirs, if we use 1x1 Conv with more layers and channels (i.e., they use channels of and for the five internal layers). Since our method only needs to learn reflectance properties of a single test scene, our PSNet requires fewer layers and channels. More importantly, we additionally introduce IRNet, which allows direct unsupervised learning on test data.
There are some other early studies on photometric stereo using (shallow) neural networks. These methods work under more restricted conditions, e.g., assuming pretraining by a calibration sphere of the same material with target objects (Iwahori et al., 1993, 1995), special image capturing setups (Iwahori et al., 2002; Ding et al., 2009), or the Lambertian surfaces (Cheng, 2006; Elizondo et al., 2008), whereas none of them is required by our method.
Currently, our method has limitations of a slow running time (e.g., 1 hour to do 1000 SGD iterations for each scene) and limited performances to complex scenes (e.g., harvest). However, several studies (Akiba et al., 2017; You et al., 2017; Goyal et al., 2017) show fast training of CNNs using extremely large minibatches and tuned scheduling of SGD stepsizes. Since our dense prediction method can use at most a large minibatch of pixel samples, the use of such acceleration schemes may improve the convergence speed. Also, a pretraining approach similar to (Santo et al., 2017) is still feasible for our method, which will accelerate the convergence and will also increase accuracy to complex scenes (with the loss of permutation invariance). Thorough analyses in such directions are left as our future work.
6 Conclusions
In this paper, we have presented a novel CNN architecture for photometric stereo. The proposed unsupervised learning approach bridges a gap between existing supervised neural network methods and many other classical physicsbased unsupervised methods. Consequently, our method can learn complicated BRDFs by leveraging both powerful expressibility of deep neural networks and physical reflectance properties known by past studies, achieving the stateoftheart performance in an unsupervised fashion just like classical methods. We also hope that our idea of physicsbased unsupervised learning stimulates further research on tasks that lack of ground truth data for training, because even so the physics is everywhere in the real world, which will provide strong clues for the hidden data we desire.
Acknowledgements
The authors would thank Shi et al. (2018) for building a photometric stereo benchmark, Santo et al. (2017) for providing us their results, and Profs. Yoichi Sato and Ryo Yonetani and anonymous reviewers for their helpful feedback. The authors would gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan Xp GPU.
References
 Akiba et al. (2017) Akiba, T., Suzuki, S., and Fukuda, K. Extremely Large Minibatch SGD: Training ResNet50 on ImageNet in 15 Minutes. CoRR, abs/1711.04325, 2017.

Alldrin et al. (2008)
Alldrin, N., Zickler, T., and Kriegman, D.
Photometric stereo with nonparametric and spatiallyvarying
reflectance.
In
Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR)
, 2008.  Belhumeur et al. (1999) Belhumeur, P. N., Kriegman, D. J., and Yuille, A. L. The BasRelief Ambiguity. Int’l J. Comput. Vis. (IJCV), 35(1):33–44, Nov 1999.
 Charles et al. (2017) Charles, R. Q., Su, H., Kaichun, M., and Guibas, L. J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), pp. 77–85, 2017.
 Cheng (2006) Cheng, W.C. Neuralnetworkbased photometric stereo for 3d surface reconstruction. In Proc. IEEE Int’l Joint Conf. Neural Network, pp. 404–410, 2006.
 Chung & Jia (2008) Chung, H.S. and Jia, J. Efficient photometric stereo on glossy surfaces with wide specular lobes. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), 2008.
 Ding et al. (2009) Ding, Y., Iwahori, Y., Nakamura, T., Woodham, R. J., He, L., and Itoh, H. Selfcalibration and Image Rendering Using RBF Neural Network. In Proc. Int’l Conf. KnowledgeBased and Intell. Inf. and Engin. Syst., pp. 705–712, 2009.
 Elizondo et al. (2008) Elizondo, D., Zhou, S.M., and Chrysostomou, C. Surface Reconstruction Techniques Using Neural Networks to Recover Noisy 3D Scenes. In Proc. Int’l Conf. Artificial Neural Networks, pp. 857–866, 2008.
 Esteban et al. (2008) Esteban, C. H., Vogiatzis, G., and Cipolla, R. Multiview photometric stereo. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 30(3):548–554, 2008.
 Furukawa & Ponce (2010) Furukawa, Y. and Ponce, J. Accurate, Dense, and Robust Multiview Stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 32(8):1362–1376, 2010.
 Georghiades (2003) Georghiades, A. S. Incorporating the Torrance and Sparrow model of reflectance in uncalibrated photometric stereo. In Proc. Int’l Conf. Comput. Vis. (ICCV), pp. 816–823, 2003.
 Goldman et al. (2010) Goldman, D. B., Curless, B., Hertzmann, A., and Seitz, S. M. Shape and SpatiallyVarying BRDFs from Photometric Stereo. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 32(6):1060–1071, 2010.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR, abs/1706.02677, 2017.
 He et al. (2017a) He, H., Xin, B., Ikehata, S., and Wipf, D. P. From Bayesian Sparsity to Gated Recurrent Nets. In Adv. Neural Inf. Process. Syst. (NIPS), pp. 5560–5570. 2017a.

He et al. (2015)
He, K., Zhang, X., Ren, S., and Sun, J.
Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification.
In Proc. Int’l Conf. Comput. Vis. (ICCV), pp. 1026–1034, 2015.  He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), pp. 770–778, 2016.
 He et al. (2017b) He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask RCNN. In Proc. Int’l Conf. Comput. Vis. (ICCV), 2017b.
 Higo et al. (2010) Higo, T., Matsushita, Y., and Ikeuchi, K. Consensus photometric stereo. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), pp. 1157–1164, 2010.
 HoldGeoffroy et al. (2018) HoldGeoffroy, Y., Gotardo, P. F. U., and Lalonde, J. Deep Photometric Stereo on a Sunny Day. CoRR, abs/1803.10850, 2018.

Iizuka et al. (2016)
Iizuka, S., SimoSerra, E., and Ishikawa, H.
Let there be Color!: Joint Endtoend Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification.
ACM Trans. Graphics (ToG), 35(4):110:1–110:11, 2016.  Ikehata & Aizawa (2014) Ikehata, S. and Aizawa, K. Photometric Stereo Using Constrained Bivariate Regression for General Isotropic Surfaces. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), pp. 2187–2194, 2014.
 Ikehata et al. (2012) Ikehata, S., Wipf, D., Matsushita, Y., and Aizawa, K. Robust photometric stereo using sparse regression. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), pp. 318–325, 2012.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. Int’l Conf. Mach. Learn. (ICML), volume 37, pp. 448–456, 2015.
 Iwahori et al. (1993) Iwahori, Y., Woodham, R. J., Tanaka, H., and Ishii, N. Neural network to reconstruct specular surface shape from its three shading images. In Proc. 1993 Int’l Conf. Neural Networks, volume 2, pp. 1181–1184, 1993.
 Iwahori et al. (1995) Iwahori, Y., Bagheri, A., and Woodham, R. J. Neural network implementation of photometric stereo. In Proc. Vision Interface, 1995.
 Iwahori et al. (2002) Iwahori, Y., Watanabe, Y., Woodham, R. J., and Iwata, A. Selfcalibration and neural network implementation of photometric stereo. In Proc. Int’l Conf. Pattern Recognit. (ICPR), volume 4, pp. 359–362, 2002.
 Kendall et al. (2017) Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., and Bry, A. EndToEnd Learning of Geometry and Context for Deep Stereo Regression. In Proc. Int’l Conf. Comput. Vis. (ICCV), 2017.
 Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. In Proc. Int’l Conf. Learn. Repres. (ICLR), 2017.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In Proc. Int’l Conf. Learn. Repres. (ICLR), 2015.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In Adv. Neural Inf. Process. Syst. (NIPS), pp. 1097–1105. 2012.
 Matusik et al. (2003) Matusik, W., Pfister, H., Brand, M., and McMillan, L. Efficient Isotropic BRDF Measurement. In Proc. Eurographics Workshop on Rendering, 2003.
 Mayer et al. (2016) Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., and Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), pp. 4040–4048, 2016.
 Nehab et al. (2005) Nehab, D., Rusinkiewicz, S., Davis, J., and Ramamoorthi, R. Efficiently Combining Positions and Normals for Precise 3D Geometry. ACM Trans. Graphics (ToG), 24(3):536–543, 2005.
 Park et al. (2017) Park, J., Sinha, S. N., Matsushita, Y., Tai, Y. W., and Kweon, I. S. Robust Multiview Photometric Stereo using Planar Mesh Parameterization. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 39(8):1591–1604, 2017.
 Santo et al. (2017) Santo, H., Samejima, M., Sugano, Y., Shi, B., and Matsushita, Y. Deep Photometric Stereo Network. In Proc. Int’l Workshop Physics Based Vision meets Deep Learning (PBDL) in ICCV, 2017.
 Shi et al. (2012) Shi, B., Tan, P., Matsushita, Y., and Ikeuchi, K. Elevation Angle from Reflectance Monotonicity: Photometric Stereo for General Isotropic Reflectances. In Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 455–468, 2012.
 Shi et al. (2014) Shi, B., Tan, P., Matsushita, Y., and Ikeuchi, K. BiPolynomial Modeling of LowFrequency Reflectances. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 36(6):1078–1091, 2014.
 Shi et al. (2018) Shi, B., Wu, Z., Mo, Z., Duan, D., Yeung, S.K., and Tan, P. A Benchmark Dataset and Evaluation for NonLambertian and Uncalibrated Photometric Stereo. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 2018. (to appear).
 Taniai et al. (2017) Taniai, T., Matsushita, Y., Sato, Y., and Naemura, T. Continuous 3D Label Stereo Matching using Local Expansion Moves. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 2017. (accepted).
 Tokui et al. (2015) Tokui, S., Oono, K., Hido, S., and Clayton, J. Chainer: a nextgeneration open source framework for deep learning. In Proc. Workshop Mach. Learn. Syst. (LearningSys) in NIPS, 2015. URL https://chainer.org.
 Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. Deep Image Prior. In Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), 2018. (to appear).
 Woodham (1980) Woodham, R. J. Photometric method for determining surface orientation from multiple images. Optical Engineering, 19(1):139–144, 1980.
 Wu et al. (2010) Wu, L., Ganesh, A., Shi, B., Matsushita, Y., Wang, Y., and Ma, Y. Robust photometric stereo via lowrank matrix completion and recovery. In Proc. Asian Conf. Comput. Vis. (ACCV), pp. 703–717, 2010.
 Xin et al. (2016) Xin, B., Wang, Y., Gao, W., Wipf, D. P., and Wang, B. Maximal Sparsity with Deep Networks? In Adv. Neural Inf. Process. Syst. (NIPS), pp. 4340–4348. 2016.
 You et al. (2017) You, Y., Zhang, Z., Hsieh, C., and Demmel, J. ImageNet Training in Minutes. CoRR, abs/1709.05011, 2017.
A Additional visual comparisons
In Figures A1–A10, we show surface normal estimates and their angular error maps for all ten scenes from the DiLiGenT dataset, comparing our method with three methods by Santo et al. (2017), Shi et al. (2014), and Ikehata & Aizawa (2014), and also the baseline least squares method. Here, the best result for each scene is obtained by either of the top four methods (where 8 of 10 best results are by our method), whose mean angular error is shown by a bold font number.
B Additional image reconstruction results
In Figures A11–A20, we show our image reconstruction results for all ten scenes from the DiLiGenT dataset. For each scene we show 6 images from 96 observed images, and for each observed image we show our synthesized image, intermediate reflectance image, and reconstruction error map. Here, the reflectance images represent multiplication of spatiallyvarying BRDFs and cast shadows under a particular illumination condition. We can clearly see cast shadows in reflectance images appearing in the results of bear and buddha. Note that for better visualization, the image intensities are scaled by a factor of after the proposed global scaling normalization.
C Additional convergence analyses
In Figures A21–A30, we show convergence analyses for all ten scenes from the DiLiGenT dataset, where the proposed earlystage supervision is compared with no and allstage supervision. Training without supervision is very unstable, while training with allstage supervision is strongly biased by inaccurate least square priors. Note that for better comparison, the median profiles of the proposed earlystage weak supervision (red solid lines) are overlayed in the plots of no and allstage supervision.