Photo-Geometric Autoencoding to Learn 3D Objects from Unlabelled Images

06/04/2019 ∙ by Shangzhe Wu, et al. ∙ University of Oxford 1

We show that generative models can be used to capture visual geometry constraints statistically. We use this fact to infer the 3D shape of object categories from raw single-view images. Differently from prior work, we use no external supervision, nor do we use multiple views or videos of the objects. We achieve this by a simple reconstruction task, exploiting the symmetry of the objects' shape and albedo. Specifically, given a single image of the object seen from an arbitrary viewpoint, our model predicts a symmetric canonical view, the corresponding 3D shape and a viewpoint transformation, and trains with the goal of reconstructing the input view, resembling an auto-encoder. Our experiments show that this method can recover the 3D shape of human faces, cat faces, and cars from single view images, without supervision. On benchmarks, we demonstrate superior accuracy compared to other methods that use supervision at the level of 2D image correspondences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given enough labelled data, deep neural networks can learn tasks such as object recognition and monocular depth estimation. However, doing so from unlabelled data is much more difficult. In fact, it is often unclear

what can be learned when labels are missing. In this paper, we consider the problem of learning the 3D shape of object categories from raw images and seek to solve it by making minimal assumptions on the data. In particular, we do not wish to use any external image annotation.

Given multiple views of the same 3D object, techniques such as structure-from-motion (SFM) can be used to reconstruct the object’s 3D shape Faugeras and Luong (2001). In fact, visual geometry suggests that multiple views are not only sufficient, but also necessary for reconstruction. However, this requirement can be relaxed if one has prior information on the possible shapes of the objects. By learning such a prior, authors have demonstrated that monocular 3D reconstruction is in fact possible and practical Zhou et al. (2017); Ummenhofer et al. (2017). However, this does not clarify how the necessary prior can be acquired in the first place. Recent methods such as SFM-Learner Zhou et al. (2017) have shown that the prior too can be extracted from a collection of unlabelled images with no externally-provided 3D information. However, these methods require multiple views of the same object, just as SFM does.

In this paper, we wish to relax these conditions even further and reconstruct objects from an unconstrained collection of images. By “unconstrained”, we mean that images are i.i.d. samples from a distribution of views of different object instances, such as a gallery of human faces. In particular, each image may contain a different object identity, preventing a direct application of the geometric principles leveraged by SFM and SFM-Learner. Even so, we argue that these principles are still relevant and applicable, albeit in a statistical sense.

The recent work of Moniz et al. (2018); Kanazawa et al. (2018), and more in general non-rigid SFM approaches, have demonstrated that 3D reconstruction from unconstrained images is possible provided that at least 2D object keypoint annotations are available. However, the knowledge of keypoints captures a significant amount of information, as shown by the fact that learning a keypoint detector in an unsupervised manner is a very difficult problem in its own right  Thewlis et al. (2017a). Ultimately, 3D reconstruction of deformable objects without keypoint annotations and from unconstrained images remains an open challenge.

Figure 1: From single image to 3D without supervision. The leftmost column shows the input image to our model and on the right we show the predicted 3D representation from novel viewpoints and modified lighting. During training we only use a collection of “in-the-wild” face images with one view per face. Our model learns the geometry, lighting and shading implicitly from the data.

In this work, we suggest that the constraints that are explicitly provided by 2D or 3D data annotations can be replaced by weaker constraints on the statistics of the data that do not require to perform data annotation of any kind. We capture such constraints as follows (section 3). First, we observe that understanding 3D geometry can explain much of the variation in a dataset of images. For example, the depth map extracted from an image of a human face depends on the face shape as well as the viewpoint; if the viewpoint is registered, then the variability is significantly reduced. We can exploit this fact by seeking a generative model that explains the data as a combination of three partially independent factors: viewpoint, geometry and texture. We pair the generative model with an encoder that can explain any given image of the object as a combination of these factors. The resulting autoencoder maps the data in a structured space. The structure is specified by how factors are combined: shape and viewpoint induce, based on the camera geometry, a 2D deformation of the canoncial texture, which in turns matches the observed data.

This process can also be seen as extracting 3D information from 2D image deformations. However, deformations arising from small depth variations, which are characteristics of objects such as faces, are subtle. Illumination provides a complementary cue for high-frequency but shallow depth variations. Using a generative model allows to integrate this cue in a simple manner: we estimate an additional factor, representing the main lighting direction, and combine the latter with the estimated depth and texture to infer the object shading. Finally, we propose to exploit the fact that many object classes of interest are symmetric, in most cases bilaterally. We show how this constraint can be leveraged by enforcing that, in the canonical modelling space, appearance and geometry are mirror-symmetric.

We show empirically (section 4) that these cues, when combined, lead to a good monocular reconstruction of 3D object categories such as human and animal faces and objects such as cars. Quantitatively, we test our model on a large dataset of synthetic faces generated using computer graphics, so that accurate ground-truth information is available for assessment. We also consider a benchmark dataset of real human faces for which 3D annotations are available and outperform a recent state-of-the-art method that uses keypoint supervision, while our method uses no supervision at all. We also test the components of our methods via an ablation study and show the different benefits they bring to the quality of the reconstruction.

2 Related Work

The literature on estimating 3D structures from (collections of) images is vast. Here, we will restrict the overview to mostly learning based methods. Traditionally, 3D reconstruction can be achieved by means of multiple view geometry Hartley and Zisserman (2003), where correspondences over multiple views can be used to reconstruct the shape and the camera parameters. Another important clue to recover a surface from an image is the shading Woodham (1980). In fact, our model also uses the line between shape and shading to recover the geometry of objects. When multiple distinct views are not available, information can be gained from various different sources.

Zhou et al. (2017); Wang et al. (2018); Novotny et al. (2017); Agrawal et al. (2015) learn from videos, while Godard et al. (2017); Luo et al. (2018) train using stereo image pairs. Recently, learning instance geometry without multiple views but collections of instance images has emerged. DAE Shu et al. (2018) learns to predict a deformation field through heavily constraining an autoencoder with a small bottleneck embedding. In a recent follow-up work Sahasrabudhe et al. (2019) they learn to disentangle the 3D mesh and the viewpoint from the deformation field. Similarly, SfSNet Sengupta et al. (2018) learns partially supervised by synthetic ground truth data and Kanazawa et al. (2018) needs foreground segmentation and 2D keypoints to learn a parametric 3D model. GANs have been proposed to learn a generative model for 3D shapes by discriminating the reprojection into images Kato and Harada (2019); Henzler et al. (2018); Szabó and Favaro (2018); Nguyen-Phuoc et al. (2019); Zhu et al. (2018). Shape models have been used as a form of supervised constraints to learn a mapping between images and shape parameters Wang et al. (2017); Gecer et al. (2019). Thewlis et al. (2018, 2017b) demonstrate that the symmetry of objects can be leveraged to learn a canonical representation. Depth can also be learned from keypoints Moniz et al. (2018); Kudo et al. (2018); Chen et al. (2019); Suwajanakorn et al. (2018) which serve as a form of correspondences even across instances of different objects.

Since our model generates images from an internal 3D representation, one part of the model is a differentiable renderer. However, with a traditional rendering pipeline, gradients across occlusions and object boundaries are undefined. Several soft relaxations have thus been proposed Kato et al. (2018); Liu et al. (2019); Loper and Black (2014). In this work we use an implementation111https://github.com/daniilidis-group/neural_renderer of Kato et al. (2018).

3 Method

(a) Unsupervised model.
(b) Canonical view and symmetry.
Figure 2: Left: Our network decomposes an input image into shape, albedo, viewpoint and shading. It is trained in an unsupervised fashion to reconstruct the input images. Right: The process is regularized by mapping the image into a space where the symmetry of the object is apparent.

Our method takes as input an unconstrained collection of images of an object category, such as human faces, and returns as output a model that can explain each image as the combination of a 3D shape, a texture, an illumination and a viewpoint, as illustrated in fig. 1(a).

Formally, an image is a function defined on a lattice

, or, equivalently, a tensor in

. We assume that the image is roughly centered on an instance of the objects of interest. The goal is to learn a function , implemented as a neural network, that maps the image to four factors consisting of a depth map , and albedo image , a viewpoint and a global light direction .

In order to learn this disentangled representation without supervision, we task the model with the goal of reconstructing the input image from the four factors. The reconstruction is a differentiable operation composed of two steps: lighting and reprojection , as follows:

(1)

The lighting function generates a version of the face based on the depth map , the light direction and the albedo as seen from a canonical viewpoint . For example, for faces a natural choice for the canonical viewpoint is a frontal view, since this minimizes self occlusions, but we let the network choose one automatically. The viewpoint then represents the transformation between the canonical view and the viewpoint of the actual input image . Then, the reprojection function simulates the effect of a viewpoint change and generates the image given the canonical depth and the shaded canonical image .

Next, we discuss the functions and in model (1) in detail.

Reprojection function and camera model.

The image is formed by a camera sensor looking at a 3D object. If we denote with a 3D point expressed in the reference frame of the camera, this is mapped to pixel by the following projection equation:

(2)

This model assumes a perspective camera with field of view (FOV) in the horizontal direction. Given that the images are cropped around a particular object, we assume a relatively narrow FOV of . The object is assumed to be approximately at a distance of from the camera.

The depth map associates to each pixel a depth value in the canonical view. Inverting the camera model (2), this corresponds to the 3D point

The viewpoint represents an Euclidean transformation such that (this is the exponential map of ) and .

The map transforms 3D points from the canonical view to the actual view. Thus a pixel in the canonical view is mapped to the pixel in the actual view by the warping function given by:

(3)

Finally, the reprojection function takes as input the depth and the viewpoint change and applies the resulting warp to the canonical image to obtain the actual image as

Notice that this requires to compute the inverse of the warp . This issue is discussed in detail in section 3.1.

Lighting function .

The goal of the lighting function is to generate the canonical image as a combination of albedo, 3D shape and light direction. Note that the effect of lighting could be incorporated in the factor by interpreting the latter as a texture rather than as the object’s albedo. However, there are two good reasons for avoiding this. First, the albedo is often symmetric even if the illumination causes the corresponding texture to look asymmetric. Separating them allows us to more effectively incorporate the symmetry constraint described below. Second, shading provides an additional cue on the underlying 3D shape Woodham (1980); Horn (1975); Belhumeur et al. (1999). In particular, unlike the recent work of Shu et al. (2018) where a shading map is predicted independently from shape, our model computes the shading based on the predicted depth, constraining the two.

Formally, given the depth map , we derive the normal map by associating to each pixel

a vector normal to the underlying 3D surface. In order to find this vector, we compute the vectors

and tangent to the surface along the and directions. For example, the first one is:

Then the normal is obtained by taking the vector product .

The normal is multiplied by the light direction to obtain a value for the direct illumination and the latter is added to the ambient light. Finally, the result is multiplied by the albedo function to obtain the illuminated texture, as follows:

(4)

Here and are the scalar coefficients weighting the ambient and direct terms.

3.1 Differentiable rendering layer

As noted in the previous section, the reprojection function warps the canonical image to generate the actual image . In CNNs, image warping is usually regarded as a simple operation that can be implemented efficiently using a bilinear resampling layer Jaderberg et al. (2015). However, this is true only if we can easily send pixels in the warped image back to pixels in the source image , a process also known as backward warping. Unfortunately, in our case the function obtained by eq. 3 sends pixels in the opposite direction.

Implementing a forward warping layer is surprisingly delicate. One way of approaching the problem is to regard this task as a special case of rendering a textured mesh. The recent Neural Mesh Renderer (NMR) of Kato et al. (2018) is a differentiable renderer of this type. In our case, however, the mesh has one vertex per pixel and each group of

adjacent pixels is tessellated by two triangles. Empirically, we found the quality of the texture gradients computed by NMR to be poor in this case, probably also due to the high frequency content of the texture image

.

We solve the problem as follows. First, we use NMR to warp not the albedo , but the depth map itself, obtaining a version of the depth map as seen from the actual viewpoint. This has two advantages: NMR is much faster when the task is limited to rendering the depth map instead of warping an actual texture. Secondly, the gradients are more stable, probably also due to the comparatively smooth nature of the depth map compared to the texture image . Given the depth map , we then use the inverse of (3) to find the warp field from the observed viewpoint to the canonical viewpoint, and bilinearly resample the canonical image to obtain the reconstruction.

Discussion.

Several alternative architectures were tested and discarded in favor of the one outlined above. Among those, one option is to task the network to estimate as well as . However, this requires to ensure that the two depth maps are compatible, which adds extra complexity to the model and did not work as well.

3.2 Symmetry

A constraint that is often useful in modelling object categories is the fact that these have a bilateral symmetry, both in shape and albedo. Under the assumption of bilateral symmetry, we are able to obtain a second virtual view of an object simply by flipping the image horizontally, as shown in fig. 1(b). Note that, if we are given the correspondence between symmetric points of the object (such as the corner of the two eyes, etc.), we could use this information to infer the object’s 3D shape Gao and Yuille (2017); Gordon (1990). While such correspondences are not given to us as the system is unsupervised, we estimate them implicitly by mapping the image to the canonical space.

In practice, there are various ways to enforce a symmetry constraint. For example, one can add a symmetry loss term to the learning objective as a regularizer. However, this requires balancing more terms in the objective. Instead, we incorporate symmetry by performing reconstruction from both canonical image and its mirrored version.

In order to do so, we introduce the (horizontal) flipping operator, whose action on a tensor is given by During training, we randomly choose to flip the canonical albedo and depth before we reconstruct the image using eq. 1. Implicitly and without introducing an additional loss term, this imposes several constraints on the model. Both, depth and albedo will be predicted with horizontal symmetry by to overcome the confusion that is introduced by the flipping operation. Additionally, this constrains the canonical viewpoint to align the object’s plane of symmetry with the vertical centerline of the image. Finally, flipping helps to disentangle albedo and shading: if an object is lit from one side and the albedo is flipped, the target still needs to be lit from the same side, requiring the shading to arise from geometry and not from the texture.

3.3 Loss, regularizer, and objective function

The primary loss function of our model is the

loss on the reconstruction and input image :

(5)

However, this loss is sensitive to small geometric imperfection and tends to result in blurry reconstruction; to avoid that, we add a perceptual loss, which is more robust to such geometric imperfections and eventually leads to a much sharper canonical image. This is obtained by using an off-the-shelf image encoder (VGG16 in our case Simonyan and Zisserman (2015)), and is given by where is the feature map computed by the -th layer of the encoder network.

We regularize the viewpoint by pulling its mean to zero, breaking the tie between equivalent rotations (which have a period of ) and aligning the canonical view to the mean viewpoint in the dataset. This is achieved by minimizing the function where is the viewpoint is estimated for image in a batch of

images. We also regularize the depth by shrinking its variance between faces. We do so via the regularization term

where and are the depth maps obtained from a pair of example images and . Losses and regularizers are averaged over a batch, yielding the objective:

(6)

3.4 Neural network architecture

We use different networks to extract depth, albedo, viewpoint and lighting from a single image of the object. The depth and albedo are generated by encoder-decoder networks, while viewpoint and lighting are regressed using simple encoder networks. In particular, we use DenseNet Huang et al. (2017) for albedo prediction, with deeper architecture than standard encoder-decoder for depth prediction, because we would like the albedo to capture more details than the depth. We do not use skip connections between encoder and decoder because the network is generating a different view, and thus pixel alignment is not desirable.

4 Experiments

[2] No. Method Scale-inv. err. () Normal err. (deg.) (1) Supervised () (2) Supervised () (3) Const. null depth (4) Const. blob depth (5) Average g.t. depth (6) Ours Depth Corr. Ground truth AIGN Tung et al. (2017) (supervised) DepthNetGAN Moniz et al. (2018) (supervised) MOFA Tewari et al. (2017) (model-based) DepthNet (paper) Moniz et al. (2018) DepthNet (github) Moniz et al. (2018) Ours

Table 1: Performance bounds. We compare our method to fully-supervised and trivial baselines.
Table 2: 3DFAW - Keypoint depth. Depth correlation between ground truth and prediction evaluated at facial keypoint locations.

We first analyze the contribution of the individual components of our model (1) and of the regularizers. We do so quantitatively, by using a synthetic face dataset where 3D ground truth is available to measure the quality of the predicted depth maps. However, we also show qualitatively that these metrics do not fully account for all the aspects that make a good 3D reconstruction, and we demonstrate that some components of our model are particularly good at addressing those. We also compare our method on real data to Moniz et al. (2018) who estimate depth for facial keypoints, and test the generalization of the model to objects other than human faces by training on synthetic ShapeNet cars and real cat faces.

For reproducibility and future comparisons, we describe network architecture details and hyperparameters in the supplementary material. We will release the code, trained models and the synthetic dataset upon acceptance of the paper.

4.1 Quantitative assessment and ablation

To evaluate the model quantitatively, we utilize synthetic data, where we know the ground truth depth. We follow in particular the protocol of Sengupta et al. (2018) to generate a large dataset of synthetic faces using the Basel Face Model Paysan et al. (2009). The faces are rendered with shapes, textures, illuminations, and rotations randomly sampled from the model. We use images from SUN Database Xiao et al. (2010) as background and render the images together with ground truth depth maps for evaluation.

Since the scale of 3D reconstruction from projective cameras is inherently ambiguous, we adjust for it in the evaluation. Specifically, we take the depth map predicted by the model in the canonical view, map it to a depth map in the actual view, and compare the latter to the ground-truth depth map using the scale invariant error Eigen et al. (2014) where . Additionally, we report the mean angle deviation between normals computed from ground truth depth and from the depth prediction, which measures how well the surface is captured by the prediction.

In table 2 we estimate upper bounds on the model performance by comparing it to supervised baselines using the same network architectures. We also see a significant improvement over various constant prediction baselines.

To understand the influence of the individual parts, we remove each one of them and evaluate the ablated model in fig. 3. However, since the error is computed in the original view of the image, it does not evaluate the quality of the 3D shape not visible from that vantage point. Thus, we also visualize the canonical albedo and the normal map computed from the depth map . We can see that all components reduce the error as well as improve the visual quality of the samples. The symmetry constraint for the albedo and depth have the strongest impact on the model, while the perceptual loss improves the quality of the reconstruction and helps to avoid local minima during training. The regularizers improve the canonical representation, and, lastly, lighting helps to resolve possible ambiguities in the geometric reconstruction, particularly in texture-less regions.

Figure 3: Ablation study. We compute the overall test set performance when turning off different components of our model and report the scale invariant error and absolute angle error (in degrees). Additionally, we show for one image, the canonical albedo and normal map computed from canonical depth .

4.2 Qualitative results

To evaluate the performance on real data, we conduct experiments using two datasets. CelebA Liu et al. (2015) contains over k images of real human faces, and 3DFAW Gross et al. (2010); Jeni et al. (2015); Zhang et al. (2014); Yin et al. (2008) contains k images with 3D keypoint annotations. We crop the faces from original images using the provided keypoints, and follow the official train/val/test splits. In fig. 1 we show qualitative results on both datasets. The 3D shape of the faces is recovered very well by the model, including details of nose, eyes and mouth, despite the presence of extreme facial expression.

4.3 Comparison with the state of the art

To the best of our knowledge, there is no prior work on fully unsupervised dense object depth estimation that we can directly compare with. However, the DepthNet model of Moniz et al. (2018) predicts depth for selected facial keypoints given the 2D keypoint locations as input. Hence, we can evaluate the reconstruction obtained by our method on this sparse set of points. We also compare to the baselines MOFA Tewari et al. (2017) and AIGN Tung et al. (2017) reported in Moniz et al. (2018). For a fair comparison, we use their public code which computes the depth correlation score (between and ). We use the 2D keypoint locations to sample our predicted depth and then evaluate the same metric. The set of test images from 3DFAW Gross et al. (2010); Jeni et al. (2015); Zhang et al. (2014); Yin et al. (2008) and the preprocessing are identical to Moniz et al. (2018).

In table 2 we report the results from their paper and the slightly higher results we obtained from their publicly-available implementation. The paper also evaluates a supervised model using a GAN discriminator trained with ground-truth depth information. Our fully unsupervised model outperforms DepthNet and reaches close-to-supervised performance, indicating that we learn reliable depth maps.

4.4 Generalization to other objects

To understand the generalization of the method to other symmetric objects, we train on two additional datasets. We use the cat dataset provided by Zhang et al. (2008), crop the cat heads using the keypoint annotations and split the images by :: into train, validation and test sets. For car images we render ShapeNet’s Chang et al. (2015) synthetic car models from various viewpoints and textures.

Figure 4: Other datasets. Results from training on cat faces and cars.

We are able to reconstruct both object categories well and the results are visualized in fig. 4. Although we assume Lambertian surfaces to estimate the shading, our model can reconstruct cat faces convincingly despite their fur which has complicated light transport mechanics. This shows that the other parts of the model constrain the shape enough to still converge to meaningful representations. Overall, the model is able to reconstruct cats and cars as well as human faces, showing that the method generalizes over object categories.

5 Conclusions

We have presented a method that can learn from an unconstrained image collection of single views of a given category to reconstruct the 3D shapes of individual instances. The model is fully unsupervised and learns based on a reconstruction loss, similar to an autoencoder. We have shown that lighting and symmetry are strong indicators for shape and help the model to converge to a meaningful reconstruction. Our model outperforms a current state-of-the-art method that uses 2D keypoint supervision. As for future work, the model currently represents 3D shape from a canonical viewpoint, which is sufficient for objects such as faces that have roughly convex shape and a natural canonical viewpoint. In order to handle more complex objects, it may be possible to extend the model to use either a collection of canonical views or a 3D representations such as a mesh or a voxel map.

Acknowledgement

We gratefully thank Soumyadip Sengupta for sharing with us the code to generate synthetic face datasets, and members of Visual Geometry Group for insightful discussion. Shangzhe Wu is supported by Facebook Research. Christian Rupprecht is supported by ERC Stg Grant IDIU-638009.

References

6 Appendix

6.1 Further Implementation Details

We will release the code and the datasets for future benchmarks upon acceptance of this paper.

Table 3 summarizes the number of images in each of datasets used in this paper. We use an image size of in all experiments. We also report all hyper-parameter settings in table 7. Our models were trained for around k iterations (e.g., epochs for the synthetic face dataset), which translates to roughly one day on a Titan X Pascal GPU. To avoid border issues after the viewpoint transformation, we predict depth maps twice as large and crop the center after warping.

Architecture

We use standard encoder networks for both viewpoint and lighting prediction, and encoder decoder networks for albedo and depth prediction. The architecture for each network is detailed in table 5, table 5, and table 7. Abbreviations of building blocks are defined are as follows:

  • : convolution with input channels, output channels, kernel size

    , stride

    and padding

    .

  • : convolution with input channels, output channels, kernel size , stride and padding .

  • : dense encoder block with convolutions with

    channels, each followed by batch normalization and ReLU.

  • : encoder transition block with convolutions with input channels, output channels and stride , each followed by batch normalization and LeakyReLU.

  • : dense decoder block with deconvolutions with channels, each followed by batch normalization and ReLU.

  • : encoder transition block with deconvolutions with input channels, output channels and stride , each followed by batch normalization and ReLU.

Total Train Val Test
Syn Face
CelebA
3DFAW
Cats
Cars
Table 3: Dataset split sizes for training, validation and testing.

[2]

Encoder
Conv(3, 64, 4, 2, 1) + LeakyReLU(0.2)
Conv(64, 128, 4, 2, 1) + BN + LeakyReLU(0.2)
Conv(128, 256, 4, 2, 1) + BN + LeakyReLU(0.2)
Conv(256, 512, 4, 2, 1) + BN + LeakyReLU(0.2)
Conv(512, 512, 4, 2, 1) + BN + LeakyReLU(0.2)
Conv(512, 128, 2, 1, 0) + BN + LeakyReLU(0.2)
Decoder
Deconv(128, 512, 2, 1, 0) + BN + ReLU
Deconv(512, 512, 4, 2, 1) + BN + ReLU
Deconv(512, 256, 4, 2, 1) + BN + ReLU
Deconv(256, 128, 4, 2, 1) + BN + ReLU
Deconv(128, 64, 4, 2, 1) + BN + ReLU
Deconv(64, 64, 4, 2, 1) + BN + ReLU
Conv(64, 1, 5, 1, 2) + Tanh
Table 4: Depth network
Dense Encoder
Conv(3, 64, 4, 2, 1)
DBE(64, 6) + TBE(64, 128, 2)
DBE(128, 12) + TBE(128, 256, 2)
DBE(256, 24) + TBE(256, 512, 2)
DBE(512, 16) + TBE(512, 128, 4)
Sigmoid
Dense Decoder
Deconv(128, 512, 4, 1, 0)
DBD(512, 16) + TBD(512, 256, 2)
DBD(256, 24) + TBD(256, 128, 2)
DBD(128, 12) + TBD(128, 64, 2)
DBD(64, 6) + TBD(64, 64, 2)
BN + ReLU + Conv(64, 3, 5, 1, 2)
Tanh
Table 5: Albedo network

[2]

Parameter Value/Range
Optimizer Adam
Learning rate
Number of epochs
Batch size
Loss weight
Loss weight
Loss weight
Loss weight
Depth (Human face)
Depth (Cat head)
Depth (Car)
Albedo
Light coefficient
Light coefficient
Light direction x/y
Viewpoint rotation x/y/z
Viewpoint translation x/y/z
Table 6: Hyper-parameter settings
Encoder
Conv(3, 32, 4, 2, 1) + ReLU
Conv(32, 64, 4, 2, 1) + ReLU
Conv(64, 128, 4, 2, 1) + ReLU
Conv(128, 256, 4, 1, 0) + ReLU
Conv(256, 256, 4, 2, 1) + ReLU
FC(256, output dim) + Tanh
Table 7: Viewpoint & light networks

6.2 More Qualitative Results

Figure 5: More results on faces, cats and cars. The lefthand side is input, and the righthand side shows the recovered 3D object rendered from different viewpoints.






input
recon albedo normal shading shaded side
Figure 6: More results of intrinsic image decomposition on synthetic faces.





input
recon albedo normal shading shaded side
Figure 7: More results of intrinsic image decomposition on CelebA faces.
Figure 8: More results on real faces from 3DFAW. The first column is the input image, and the rest illustrates the recovered 3D face and normal map from multiple viewpoints.





Figure 9: More results on real faces from CelebA. The first column is input image, and the rest illustrates the recovered 3D face and normal map from multiple viewpoints.