Learning Inter- and Intra-frame Representations for Non-Lambertian Photometric Stereo

In this paper, we build a two-stage Convolutional Neural Network (CNN) architecture to construct inter- and intra-frame representations based on an arbitrary number of images captured under different light directions, performing accurate normal estimation of non-Lambertian objects. We experimentally investigate numerous network design alternatives for identifying the optimal scheme to deploy inter-frame and intra-frame feature extraction modules for the photometric stereo problem. Moreover, we propose to utilize the easily obtained object mask for eliminating adverse interference from invalid background regions in intra-frame spatial convolutions, thus effectively improve the accuracy of normal estimation for surfaces made of dark materials or with cast shadows. Experimental results demonstrate that proposed masked two-stage photometric stereo CNN model (MT-PS-CNN) performs favorably against state-of-the-art photometric stereo techniques in terms of both accuracy and efficiency. In addition, the proposed method is capable of predicting accurate and rich surface normal details for non-Lambertian objects of complex geometry and performs stably given inputs captured in both sparse and dense lighting distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

07/17/2021

SCV-Stereo: Learning Stereo Matching from a Sparse Cost Volume

Convolutional neural network (CNN)-based stereo matching approaches gene...
02/22/2017

Convolutional Neural Network-Based Block Up-sampling for Intra Frame Coding

Inspired by the recent advances of image super-resolution using convolut...
05/10/2019

SPLINE-Net: Sparse Photometric Stereo through Lighting Interpolation and Normal Estimation Networks

This paper solves the Sparse Photometric stereo through Lighting Interpo...
12/04/2020

F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation

Although deep learning based methods have achieved great progress in uns...
09/30/2017

Robust Photometric Stereo Using Learned Image and Gradient Dictionaries

Photometric stereo is a method for estimating the normal vectors of an o...
10/02/2018

Semi-dense Stereo Matching using Dual CNNs

A robust solution for semi-dense stereo matching is presented. It utiliz...
03/22/2021

Leveraging Spatial and Photometric Context for Calibrated Non-Lambertian Photometric Stereo

The problem of estimating a surface shape from its observed reflectance ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In In recent years, photometric stereo has received significant attention in the field of optical engineering and computer vision. Photometric stereo techniques estimate accurate and highly detailed surface normals of a target object based on a set of images captured in different light directions using a viewpoint fixed camera

[Woodham1980]. It can generate 3D models with rich details to facilitate various applications such as automated industrial quality inspection [3dDetection], high-fidelity 3D reconstruction/modeling [3DReconstruction]

, and face recognition and verification

[3dface].

The basic theory of photometric stereo is firstly proposed by Woodham et al. based on the assumption of ideal Lambertian reflectance [Woodham1980]. However, most of the real-world objects are non-Lambertian. Therefore, many researchers attempted to utilize flexible surface reflection functions such as bidirectional reflection distribution functions (BRDFs) to develop more applicable photometric stereo techniques which are working well for real-world objects [Goldman2005, Shi2014].

Fig. 1: Left: Photometric stereo hardware setup which captures images of a target object in different light directions using a viewpoint fixed camera. Right: Illustration of the proposed MT-PS-CNN model and the comparative results with state-of-the-art CNN-based photometric stereo methods [Chen2018ECCV_PS-FCN, Ikehata2018ECCV-CNN-PS] for harvest object on the DiLiGenT dataset.

With its powerful learning capacity, Convolutional Neural Network (CNN) has become a prevalent tool to generate high-dimensional feature representations for various computer vision tasks such as object detection/recognition [ren2015faster-rcnn, redmon2016yolo], nonlinear data mapping [huang2017wavelet, he2018single] and video sequence analysis [wang2016beyond, tokmakov2017learning]

. Santo et al. presented the first deep learning-based photometric stereo network (DPSN) to obtain the complex mapping function from observations of inter-frame lighting variations to surface normals for individual pixels

[Santo2017]. To improve the applicability of photometric stereo, Ikehata et al. merged all the input data into an observation map to effectively handle an arbitrary number of unordered images and took into account the rotational pseudo-invariance of the observation map and significantly improve the accuracy of surface normal estimation [Ikehata2018ECCV-CNN-PS]. However, the above mentioned per-pixel photometric stereo methods typically analyze inter-frame reflective signals of individual pixels, neglecting local spatial intensity variation among neighbor pixels encodes useful cues for predicting surface normals [Chen2018ECCV_PS-FCN]. Moreover, the performance of photometric stereo methods based solely on the information in the inter-frame domain drops significantly when the number of input images decreases. To overcome the limits of per-pixel photometric stereo techniques, Chen et al. proposed the first fully convolutional network based photometric stereo method (PS-FCN) which takes an arbitrary number of unordered full-size images as input and takes advantage of local context information to densely estimate the surface normals [Chen2018ECCV_PS-FCN]. A noticeable drawback of PS-FCN is that it does not explore the very important frame-to-frame lighting variations of individual pixels. Thus, PS-FCN performs unfavorably when compared with some inter-frame photometric stereo techniques [Ikehata2018ECCV-CNN-PS, Wang2020TIP]. Therefore, it is highly desirable to develop an unified method performing both inter- and intra-frame analysis for the challenging surface normal estimation task.

Given an arbitrary number of unordered images captured under different lighting configurations, we present a novel two-stage CNN architecture to explore valuable information encoded both in local image patches and cross adjacent frames and construct inter- and intra-frame feature representations for high-quality surface normal estimation of non-Lambertian objects, as illustrated in Fig. 1. More specifically, we utilize convolutional layers to analyze lighting variations of individual pixels on frames and convolutional layers to capture intra-frame intensity variation among local pixels. We experimentally evaluate the performances of various network design alternatives in an attempt to identify the optimal sequence and strategy to deploy inter-frame and intra-frame feature extraction modules. We make two important findings: (1) it is better to firstly perform inter-frame feature extraction and then intra-frame feature extraction since the frame-to-frame observations provide important information for the photometric stereo task; (2) it is desirable to divide the entire feature extraction process into two individual stages while mixing the inter- and intra-frame feature extraction steps will adversely affect the performance of surface normal estimation. Moreover, we propose to utilize the easily obtained object mask to eliminate adverse interference from invalid background regions in intra-frame spatial convolutions, thus effectively improve the accuracy of normal estimation for surfaces with insufficient reflectance observations (e.g., made of highly absorptive dark materials or with cast shadows). Compared with the state-of-the-art CNN-based photometric stereo techniques [Chen2018ECCV_PS-FCN, Ikehata2018ECCV-CNN-PS], our proposed MT-PS-CNN is capable of estimating more accurate surface normals using fewer parameters, and it can perform consistently well across various image capturing configurations. This work has the following three main contributions.

  • We study and identify an optimal scheme to deploy inter-frame and intra-frame feature extraction modules for photometric stereo problem and present a two-stage CNN architecture to perform high-quality surface normal estimation of non-Lambertian objects.

  • We propose to utilize the easily obtained object mask to eliminate adverse interference from invalid background regions during intra-frame spatial convolutions which provides an effective technique to facilitate accurate normal estimation for surfaces made of highly absorptive dark materials or with cast shadows.

  • With fewer parameters, our proposed MT-PS-CNN outperforms state-of-the-art photometric stereo techniques [Chen2018ECCV_PS-FCN, Ikehata2018ECCV-CNN-PS]. Moreover, the proposed method is capable of predicting accurate and rich surface normal details for non-Lambertian objects and performs well with sparse input frames.

Ii Related Work

In this section, we provide a brief overview of recently proposed deep-learning based photometric stereo methods for non-Lambertian objects. For a detailed introduction of recent researches of photometric stereo, readers could refer to [Ackermann2015].

Fig. 2: The overall architecture of the proposed MT-PS-CNN model.

Deep learning techniques with the ability for approximating highly non-linear mappings have been recently utilized for solving the complex photometric stereo problems. Santo et al. [Santo2017] firstly solved the photometric stereo task via deep learning and proposed a pixel-wise method which utilizes a fully-connect network (DPSN) to establish a mapping from given observations to surface normal. The assumption that light directions must be pre-defined and remain the same during training and test phases restricts the application of DPSN. Ikehata et al. [Ikehata2018ECCV-CNN-PS] introduced a CNN-based method, CNN-PS, which relaxes the limitation of lighting information and image structure through a novel network input(2-D Observation Map). A synthetic photometric stereo dataset, called Cycles PS, was presented in this paper, considering the global illumination effects. CNN-PS performs significantly better on the benchmark dataset than most traditional photometric stereo approaches. Chen et al. [Chen2018ECCV_PS-FCN] firstly proposed a fully convolutional network, called PS-FCN, that also takes unstructured images and lighting information as input. PS-FCN takes the full-size images into account and effectively utilizes the spatial information among local pixels, which is neglected in the previously proposed pixel-wise method. Thus, the frame-based PS-FCN method performs better with a sparse input setting than the pixel-wise methods. Then, Chen et al. [Chen2019] extended PS-FCN and proposed an uncalibrated PS method (SDPS-Net) for non-Lambertian surface. SDPS-Net contains two processing steps including a classification network (LCNet) is used to approximate the light information and then a subsequent prediction network (NENet) is utilized to estimate the normal.

Given enough input images, pixel-wise methods (e.g., CNN-PS [Ikehata2018ECCV-CNN-PS] and DPSN [Santo2017]) are generally more accurate on surface normal estimation than frame-wise ones (e.g. PS-FCN [Chen2018ECCV_PS-FCN] and SDPS-Net [Chen2019]). However, a sparse input setting will adversely decrease the performance of pixel-wise methods due to their ignoration of spatial information. To handle sparse input, Li et al. [Li2019] proposed a deep learning pixel-wise approach, which applies a connection table that can select relative effective light directions, to minify the input of pixel-wise photometric stereo methods. Zheng et al. [Zheng2019_SPLINE-Net]

proposed SPLINE-Net, which generates dense lighting observations by lighting interpolation, to improve the performance of sparse photometric stereo. Wang et al.

[FuLin2020] proposed a photometric stereo network which utilizes the collocated light image as supplementary information to improve the performance.

Iii Approach

Given RGB images of a target object with pixels ( and are the width and height of input images) captured under different light directions, surface normals of pixels , and light directions of images , the observation matrix can be formulated as [Chen2018ECCV_PS-FCN]

(1)

where is a complex function of surface normal, light direction, and viewing direction (the viewing direction is set to which is parallel to the z-axis of the world coordinates), and represents the element-wise dot product. In this paper, we present a two-stage CNN model MT-PS-CNN to extract inter- and intra-frame feature representations to directly estimate the normal matrix based on the observation matrix and light direction matrix without explicitly modeling the complex function.

Iii-a Network architecture

As illustrated in Fig. 2

, our proposed MT-PS-CNN model consists of three major components including initial feature extraction, inter- and intra-frame feature extraction, and normal map estimation. For each captured image, we follow the conventional practice to replicate its corresponding 3-vector light direction along the spatial directions to obtain a

lighting map [Chen2018ECCV_PS-FCN]. We apply the binary object mask to the lighting map, extracting light directions for the target object (mask pixel values equal to 1) and eliminating invalid background regions (mask pixel values equal to 0). More details on generating 2D masked lighting maps are provided in Sec. III-C. The obtained 3-channel masked lighting map is concatenated with the 3-channel RGB image to generate a 6-channel image-light data matrix. We stack image-light matrices together to obtain the input of our MT-PS-CNN which is a data matrix. Note we purposely store input images and masked lighting masks in a 2D matrix for the following spatial convolutions.

Given input , we firstly deploy the initial feature extraction module to compute feature map as

(2)

where denote a 3D convolutional layer of kernels. Note the kernel size of 3D convolution is set to

to process the concatenated 6-channel image-light data matrices. Following the 3D convolutional layer, a leaky ReLU layer

and a dropout layer are deployed to activate the values and simulate the cast shadow effects [Santo2017, Ikehata2018ECCV-CNN-PS].

Then we deploy a number of inter- and intra-frame feature extraction blocks to perform simultaneous analysis of frame-to-frame and per-frame lighting variations. Within each InteR-frame Feature Extraction (IRFE) block, we utilize a 3D convolutional layers of kernel size (i.e., ) to process lighting variations of individual pixels on

adjacent frames. Such inter-frame information provides important clues to eliminate the influence of outliers (i.e., shadows, inter-reflection, and specularity, etc.) and thus leads to accurate restoration results in photometric stereo tasks

[Mukaigawa2007]. Each convolution is followed by a leaky ReLU activation and a dropout layer. We stack IRFE blocks to compute inter-frame feature representations as

(3)

where represents the operations of IRFE block.

The computed feature map is then fed to a number IntrA-frame Feature Extraction (IAFE) blocks to exploit the spatial information in local image patches and compute inter-frame feature representations as:

(4)

where is the total number of stacked IAFE blocks, and denotes the operations of IAFE block. Note each IAFE block contains a 3D convolutional layers of kernel size (i.e., ) and a leaky ReLU activation to capture intra-frame intensity variation among local pixels. The extracted local context information can improve the performance of CNN models to handle various reflectances and performs robustly under sparse lighting distributions [Li2019CVPR, Wang2020TIP]. In our implementation, we experimentally set to achieve a good balance between model complexity and good performance.

In order to handle flexible number of input images in photometric stereo tasks, order-agnostic operations (e.g., pooling layers) [Hartmann2017learned, Wiles2017silnet] are typically utilized to standardize/fix the channel number of feature maps. Following the research work of Chen et al. [Chen2018ECCV_PS-FCN]

, we apply the Max-Pooling operation (

) to compress the channel number of as

(5)

where is the output of the max-pooling operation. It is a representation with a fixed number of channels by aggregating most salient features from images captured under different light directions.

A Normal Map Estimation sub-network is appended after the max-pooling operation for normal map estimation, converting the computed feature to surface normal ( and denote the spatial coordinates of the normal map) as

(6)

where denotes the operations of the Normal Map Estimation module, which contains three convolutional layers, two leaky ReLU activations and a L2-normalization layer for predicting the normal map.

The training of the proposed MT-PS-CNN model is driven by minimizing the error between the predicted and the ground truth normal maps. We adopt the commonly used cosine similarity loss, which can be formulated as

(7)

where and denote the predicted and the ground truth normal maps, respectively. means the dot product operation. When the predicted normal has a similar orientation as the ground truth, approaches 1 and the loss approaches 0, and vice versa.

Iii-B Network design alternatives

Fig. 3: A number of network design alternatives for inter- and intra-frame feature extraction. (a)T-IRFE-IAFE; (b)T-IAFE-IRFE; (c)M-IRFE-IAFE.

The design of a CNN architecture to extract inter- and intra-frame features for accurate surface normal estimation involves two critical design issues. The first issue is that both frame-to-frame (inter-) and per-frame (intra-) observations provide important information for the photometric stereo task and which feature extraction module should be deployed first? The second issue is whether the CNN architecture should be divided into two individual stages to perform inter- and intra-frame feature extraction separately or mix the inter- and intra-frame feature extraction steps?

Based on the above-mentioned two critical design issues, we design three different network alternatives (Fig. 3) which employ different schemes to deploy inter- and intra-frame feature extraction modules as following:

(1) Two-stage IRFE-IAFE design (T-IRFE-IAFE): The first design incorporates a two-stage architecture by deploying inter- and intra-frame feature extraction steps in a cascaded manner. More specifically, it first deploys a number of IRFE blocks to compute inter-frame features based on frame-to-frame lighting variations of individual pixels and then a number of IAFE blocks to compute intra-frame features via analyzing intensity variation among spatial neighboring pixels.

(2) Two-stage IAFE-IRFE design (T-IAFE-IRFE): The second design also utilizes a two-stage architecture but switch the order to deploy inter- and intra-frame feature extraction modules. Thus, it first deploys a number of IAFE blocks and then a number of IRFE blocks to compute features for surface normal estimation.

(3) Mixed IRFE-IAFE design (M-IRFE-IAFE): Different from the above designs based on two-stage architecture, M-IRFE-IAFE deploys inter- and intra-frame feature extraction steps in an alternative manner. More specifically, it makes use of a number of grouped IRFE and IAFE blocks to compute inter- and intra-frame features in different convolutional stages.

In our implementation, we experimentally set to achieve a good balance between model complexity and good performance. We will experimentally evaluate these three different designs and discuss the best way to deploy inter-frame and intra-frame feature extraction modules to achieve high-accuracy surface normal estimation in Sec. IV-B1.

Iii-C Masked lighting map

Fig. 4: The illustration of generating masked lighting map.

In the photometric stereo task, large field-of-view cameras are typically utilized to cover the entire target object for image capturing. As the result, a large area of input image covers invalid backgrounds and contains pixels of unchanged RGB values. Note these background pixels present similar lighting variation patterns as the object pixels with insufficient reflectance observations (e.g., surfaces made of highly absorptive dark materials or with cast shadows) and will cause confusion for surface normal inference, particularly in intra-frame spatial convolutions. Therefore, we argue that it is important to exclude such invalid background pixels in the training/testing of CNN-based surface normal estimation models.

As a simple yet effective solution, we utilize the easily obtained object mask to extract pixels of the target object and eliminate invalid background regions. More specifically, we follow the conventional practice to replicate the 3-vector light direction of a captured image along the x and y directions to obtain a lighting map [Chen2018ECCV_PS-FCN] and then apply the binary object mask (0 mask value defines a background pixel and 1 mask value defines an object pixel) to generate the masked lighting maps as illustrated in Fig. 7. We will experimentally evaluate the effectiveness of utilizing masked lighting maps to eliminate adverse interference from invalid background regions and improve the accuracy of normal estimation for surfaces made of highly absorptive dark materials or with specular highlights and cast shadows.

Iv Experimental Results

In this section, we systematically evaluate the performance of our proposed MT-PS-CNN and compare it with the state-of-the-art photometric stereo methods on commonly used synthetic (MERL [Matusik2003]) and real-world (DiLiGenT [shi2016benchmark] and Light Stage Data Gallery [chabert2006relighting]) datasets.

Iv-a Implementation details

All experiments were performed on a PC with GeForce GTX 1080Ti and 96GB RAM. For training and testing, our model was implemented in PyTorch using 723K learnable parameters. Adam optimizer is used to optimize our networks with parameters

and . We use Blobby and Sculpture datasets provided by [Chen2018ECCV_PS-FCN]

as our training data. There are 85212 samples on both datasets, where each sample contains 64 images rendered under randomly sampling light directions. The training process takes 30 epochs with a batch size of 32 (around 17 hours). We applied the same data augmentation technique as suggested in PS-FCN

[Chen2018ECCV_PS-FCN] expect adding extra noise disturbances. The accuracy of surface normal is quantitatively evaluated by computing the mean angular error (MAE). A lower MAE value indicates better estimation accuracy.

Iv-B Performance Analysis

In this section, we set up ablation experiments to evaluate the effects of (1) the design of network architectures, (2) the incorporation of masked lighting map, and (3) the setting of kernel sizes.

Iv-B1 Evaluation of network designs

In this subsection, we experimentally evaluate the performances of three network architectures which employ different schemes to deploy inter- and intra-frame feature extraction modules as illustrated in Fig. 3. For a fair comparison, these three different network design alternatives (T-IRFE-IAFE, T-IAFE-IRFE, and M-IRFE-IAFE as illustrated in Fig 3) are trained and evaluated using the same parameters (, , , and ). The comparative results (MAE) on the DiLiGenT dataset are shown in Table I.

Method BALL CAT POT1 BEAR POT2 BUDD. GOBL. READ. COW HARV. Avg.
T-IR.-IA. 2.82 5.79 6.92 5.59 7.55 6.93 7.98 11.85 7.82 14.18 7.74
T-IA.-IR. 2.58 6.02 6.99 6.62 8.68 7.34 9.46 13.68 8.79 15.95 8.61
M-IR.-IA. 2.76 5.99 7.24 6.84 7.79 7.23 8.39 11.71 8.32 14.79 8.11
TABLE I: The quantitative evaluation (MAE) of three different architectures (T-IRFE-IAFE, T-IAFE-IRFE, and M-IRFE-IAFE) to deploy inter- and intra-frame feature extraction blocks on the DiLiGenT dataset.

It is observed that the design of T-IRFE-IAFE significantly outperforms the design of T-IAFE-IRFE and the average MAE of 10 objects is reduced by . The comparative result manifests that it is better to firstly perform inter-frame feature extraction and then intra-frame feature extraction. This might be explained by the fact that although both inter- and intra-frame features are useful for normal prediction, the inter-frame information is sensitive to outliers (specularity, shadow et al.) thus it provides more important information to generate high-fidelity normal maps. In comparison, the intra-frame features provide complementary information to efficiently eliminate the interference from outliers but adversely smooth out textured details. Our experimental results are consistent with previous research findings that pixel-wise methods are generally more accurate on surface normal estimation than frame-wise ones if dense input images are provided [Ikehata2018ECCV-CNN-PS, Santo2017]. Therefore, it is reasonable to firstly extract inter-frame features which provide fundamental information for the photometric stereo task and then perform spatial reasoning through intra-frame feature extraction to further improve the accuracy of surface normal estimation.

Another interesting finding is that although M-IRFE-IAFE (altering inter- and intra-frame feature extraction steps) is proven an effective design in many video sequence analysis tasks such as gesture/action recognition [Tran2018], it is not very suitable for the photometric stereo task (M-IRFE-IAFE vs. T-IRFE-IAFE: vs. ). Since inter- and intra-frame features provide two different kinds of information for estimating normal maps, it is more reasonable to divide the entire feature extraction process into two individual stages instead of mixing the inter- and intra-frame feature extraction steps.

Iv-B2 Effectiveness of masked lighting map

Object Method M=3, N=1 M=1, N=3 M=3, N=3
Sphere Mask 8.89 4.51 4.16
No-Mask 8.82 5.97 5.73
Bunny Mask 7.93 4.39 4.03
No-Mask 7.86 5.01 4.73
TABLE II: Comparative results of various surface normal estimation models with/without referring to the masked lighting maps. For Sphere and Bunny objects, we calculate the average MAE values for 100 different materials.

In this subsection, we experimentally evaluate the effectiveness of utilizing masked lighting maps to improve the accuracy of normal estimation. Here, we adopt the best-performing two-stage CNN architecture (T-IRFE-IAFE ) for inter- and intra-frame feature extraction and set kernel size parameters (defining how many adjacent frames to process) and (defining how many neighboring pixels to process) to different values. Note the model will only perform inter-frame convolutions by setting since it only considers intensity variation among spatial neighboring pixels on the current frame. Similarly, the model will become a per-pixel method and compute inter-frame features solely based on lighting variations of single pixels by setting .

Fig. 5: Quantitative comparison of CNN models () with/without referring to the object mask for Bunny made of 30 different materials. Images in the second row show samples of 5 representative materials.

To test our model on different materials, we rendered Sphere and Bunny objects using different BRDFs from the synthetic MERL dataset. We calculate the average MAE values of various CNN models with/without referring to the masked lighting maps for 100 different materials, as illustrated in Table II. It is observed that the performance of the per-pixel CNN model () remains almost unchanged after incorporating the masked lighting map. In comparison, the CNN-based models performing intra-frame spatial convolutions (setting ) achieved more accurate surface normal estimation results by referring to the binary object mask which defines pixels of target and background. For instance, the average MAE of 100 different materials for Sphere is significantly reduced from to for the model using and from to for the model using . The quantitative evaluation results illustrate that it is important to integrate the easily obtained object mask in CNN models which involve intra-frame spatial convolutions, eliminating adverse interference from invalid background regions for high-accuracy surface normal estimation.

In Fig. 5, we show the calculated MAE values using CNN models () with/without referring to the object mask for Bunny object made of 30 representative materials. Note the reflectance observations of background pixels remain zero which present similar variation patterns to those of object pixels with inadequate reflectance observations (e.g., made of highly absorptive dark materials). As the result, utilizing the binary object mask to eliminate adverse interference from invalid background regions leads to significant increase of MAE values for objects of dark materials such as ss400 and specular-black-phenoli as shown in the right part of Fig. 5. In comparison, such improvement is almost neglectable for Bunny made of light materials such as pink-felt and silver-paint as illustrated in the left part of Fig. 5.

Fig. 6: Qualitative results of CNN models with/without referring to the object mask for real-world objects (Bear, Reading and Pot2) in the DiLiGenT dataset. More accurate normal estimation results are achieved for regions with cast shadows by referring to the easily obtained object mask. Zoom in to check more details in the highlighted regions.

In Fig. 6, we visualize the surface normal estimation results with/without considering the object mask for real-world objects (Bear, Reading and Pot2) in the DiLiGenT dataset. It is observed that more accurate surface normal estimation results are generated in regions with complex geometry and obvious cast shadows by referring to the easily obtained object mask.

Iv-B3 The setting of kernel sizes

Kernel Size BALL CAT POT1 BEAR POT2 BUDD. GOBL. READ. COW HARV. Avg.
N=3 M=1 2.97 6.23 7.90 7.13 10.24 8.23 9.30 13.37 10.01 15.80 9.12
M=3 2.82 5.95 6.91 5.46 8.10 7.05 8.38 11.85 8.16 14.09 7.88
M=5 2.29 5.87 6.92 5.79 6.89 6.85 7.88 11.94 7.48 13.71 7.56
M=7 2.21 5.46 6.53 5.62 7.42 6.99 7.86 12.56 7.30 13.93 7.59
M=5 N=1 2.80 5.63 6.74 6.94 9.23 7.42 10.28 12.51 12.32 15.22 8.91
N=3 2.29 5.87 6.92 5.79 6.89 6.85 7.88 11.94 7.48 13.71 7.56
N=5 2.30 6.57 7.16 5.11 7.97 7.13 8.24 12.51 7.45 13.60 7.80
TABLE III: The comparative results (MAE) of MT-PS-CNN models using different and parameters on objects in the DiLiGenT benchmark.

The kernel sizes of 3D convolutional layers ( and ) in inter- and intra-frame feature extractors are critical parameters to determine the complexity and performance of the proposed model. The performances of MT-PS-CNN models using different and parameters are experimentally evaluated and the comparative results (MAE values) on objects in the DiLiGenT benchmark are shown in Table III. It is observed that networks using larger kernel sizes generally produce lower MAE values. A more complex model (considering more adjacent frames and more neighboring pixels) provides a better expressive/generalization ability to achieve more accurate surface normal estimation results. However, the improvements become insignificant when and are larger than 3. In our implementation, we set to achieve a good balance between model complexity and good performance.

Iv-C Comparisons with state-of-the-arts

Method BALL CAT POT1 BEAR POT2 BUDD. GOBL. READ. COW HARV. Avg.
Proposed 2.29 5.87 6.92 5.79 6.89 6.85 7.88 11.94 7.48 13.71 7.56
JU-19[Li2019] 2.40 6.11 6.54 5.23 7.49 9.89 8.61 13.68 7.98 16.18 8.41
CH-18[Chen2018ECCV_PS-FCN] 2.82 6.16 7.13 7.55 7.25 7.91 8.60 13.33 7.33 15.85 8.39
SI-18[Ikehata2018ECCV-CNN-PS] 2.20 4.60 5.40 12.30 6.00 7.90 7.30 12.60 7.90 13.90 8.01
TM-18[Taniai2018] 1.47 5.44 6.09 5.79 7.76 10.36 11.47 11.03 6.32 22.59 8.83
HI-17[Santo2017] 2.02 6.54 7.05 6.31 7.86 12.68 11.28 15.51 8.01 16.86 9.41
ST-14[Shi2014] 1.74 6.12 6.51 6.12 8.78 10.60 10.09 13.63 13.93 25.44 10.30
IA-14[Ikehata2014] 3.34 6.74 6.64 7.11 8.77 10.47 9.71 14.19 13.05 25.95 10.60
BASELINE[Woodham1980] 4.10 8.41 8.89 8.39 14.65 14.92 18.50 19.80 25.60 30.62 15.39
  • indicates that we use all 96 images to estimate the normal map of Bear. The result shown in SI-18 [Ikehata2018ECCV-CNN-PS] (Bear 4.1) was achieved by discarding the first 20 input images.

TABLE IV: Quantitative results of our proposed MT-PS-CNN model and state-of-the-art photometric stereo methods on the DiLiGenT benchmark dataset.

In this subsection, we compared our proposed MT-PS-CNN model with a number of state-of-the-art photometric stereo solutions including IK-12[Ikehata2012], IA-14[Ikehata2014], ST-14[Shi2014], HI-17[Santo2017], TM-18[Taniai2018], CH-18[Chen2018ECCV_PS-FCN], SI-18[Ikehata2018ECCV-CNN-PS], and JU-19[Li2019]. The source codes or evaluation results of these methods are publicly available.

Quantitative evaluation results on the DiLiGenT benchmark dataset are provided in Table IV. When the number of input images is 96, our proposed MT-PS-CNN model achieved the lowest average MAE () on 10 real-world objects of DiLiGenT. It is worth mentioning that such improvements are particularly obvious for objects with complex geometry (e.g., Buddha , Reading, and Harvest).

Fig. 7 and Fig. 8 show some qualitative results on two real-world DiLiGenT and Light Stage Data Gallery datasets, respectively. It is observed that our proposed MT-PS-CNN model is capable of predicting a surface normal map with rich details. Moreover, it can generate more accurate surface normal estimation results in the highlighted regions with complex geometry compared with the state-of-the-art PS-FCN method (CH-18) [Chen2018ECCV_PS-FCN].

Fig. 7: Surface normal estimation results using 96 input images on the DiLiGenT dataset. Zoom in to check estimation results in regions with complex geometry.
Fig. 8: Surface normal estimation results using 36 input images on the Light Stage Data Gallery dataset. Zoom in to check more details in the highlighted regions.
HI-17[Santo2017] CH-18[Chen2018ECCV_PS-FCN] SI-18[Ikehata2018ECCV-CNN-PS] MT-PS-CNN
Model size 33.97M 2.21M 2.65M 0.72M
TABLE V: Model size of various deep learning-based photometric stereo approaches.

Tab V show the model size of MT-PS-CNN and three deep learning-based approaches. Our proposed MT-PS-CNN model achieves the highest normal estimation accuracy using significantly fewer parameters. As the result, it is more suitable for implementation in memory-restricted devices.

Method 96 16 10 8 6
Proposed 7.56 8.82 9.84 10.75 12.30
JU-19[Li2019] 8.43 9.66 10.02 10.39 12.16
CH-18[Chen2018ECCV_PS-FCN] 8.39 9.37 10.33 11.13 12.56
TABLE VI: Surface normal estimation results using different number of input images (96, 16, 10, 8, 6) on the DiLiGenT dataset. We calculate the average MAE for 10 objects.

In Table VI, we show the normal estimation results (the average MAE of 10 objects in the DiLiGenT dataset) of different deep-learning-based photometric stereo methods using 96, 16, 10, 8, 6 input images. A noticeable advantage of the proposed MT-PS-CNN is that it performs robustly when the number of input images significantly changes. Note JU-19[Li2019] is designed to decrease the demands on the number of images for the photometric stereo task by learning the most informative ones under different illumination conditions, thus it performs the best based on 6 and 8 input images. However, its performance is suboptimal when processing dense input images. In comparison, our proposed method performs consistently well based on images captured in both sparse and dense lighting distributions. As illustrated in Table  VII, MT-PS-CNN performs significantly better than other alternatives for objects with complex geometry (e.g., Buddha, Reading, and Harvest) which represents a more challenging normal estimation task.

Method BALL CAT POT1 BEAR POT2 BUDD. GOBL. READ. COW HARV. Avg.
Proposed 4.20 7.30 8.78 8.59 9.85 8.25 10.44 13.17 10.84 16.97 9.84
JU-19[Li2019] 3.97 6.69 7.30 8.73 9.74 11.36 10.46 14.37 10.19 17.33 10.02
CH-18 4.26 8.12 9.84 8.07 9.29 9.24 10.61 15.11 9.90 18.90 10.33
TABLE VII: Surface normal estimation results of objects in the DiLiGenT dataset using 10 illumination directions. We randomly selected 10 images from 96 inputs and calculate the average MAE over 10 trails.

V Conclusion

In this paper, we present a two-stage CNN architecture to extract inter- and intra-frame feature representations for high-quality surface normal estimation of non-Lambertian objects. We experimentally investigate a number of network design alternatives to identify the optimal scheme (T-IRFE-IAFE) to deploy inter-frame and intra-frame feature extraction modules for the photometric stereo problem. Moreover, we integrate the easily obtained object mask in intra-frame spatial convolutions to improve the accuracy of normal estimation for surfaces made of highly absorptive dark materials or with obvious cast shadows. The advantages of the proposed MT-PS-CNN model include producing more accurate normal estimation results using significantly few parameters and performing robustly under both dense and sparse image capturing configurations. The source code of MT-PS-CNN model will be made publicly available.

References