Log In Sign Up

Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images

by   Long-Nhat Ho, et al.

Recovering the 3D structure of an object from a single image is a challenging task due to its ill-posed nature. One approach is to utilize the plentiful photos of the same object category to learn a strong 3D shape prior for the object. This approach has successfully been demonstrated by a recent work of Wu et al. (2020), which obtained impressive 3D reconstruction networks with unsupervised learning. However, their algorithm is only applicable to symmetric objects. In this paper, we eliminate the symmetry requirement with a novel unsupervised algorithm that can learn a 3D reconstruction network from a multi-image dataset. Our algorithm is more general and covers the symmetry-required scenario as a special case. Besides, we employ a novel albedo loss that improves the reconstructed details and realisticity. Our method surpasses the previous work in both quality and robustness, as shown in experiments on datasets of various structures, including single-view, multi-view, image-collection, and video sets.


page 1

page 4

page 8


Few-Shot Generalization for Single-Image 3D Reconstruction via Priors

Recent work on single-view 3D reconstruction shows impressive results, b...

Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision

Understanding the 3D world is a fundamental problem in computer vision. ...

Im2Struct: Recovering 3D Shape Structure from a Single RGB Image

We propose to recover 3D shape structures from single RGB images, where ...

Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild

We propose a method to learn 3D deformable object categories from raw si...

X2Teeth: 3D Teeth Reconstruction from a Single Panoramic Radiograph

3D teeth reconstruction from X-ray is important for dental diagnosis and...

Ladybird: Quasi-Monte Carlo Sampling for Deep Implicit Field Based 3D Reconstruction with Symmetry

Deep implicit field regression methods are effective for 3D reconstructi...

Unsupervised Severely Deformed Mesh Reconstruction (DMR) from a Single-View Image

Much progress has been made in the supervised learning of 3D reconstruct...

1 Introduction

Images are 2D projections of real-world 3D objects, and recovering the 3D structure from a 2D image is an important computer vision task with many applications. Most image-based 3D modeling methods rely on multi-view inputs

[42, 43, 16, 17, 11, 57, 48, 21]

, requiring multiple images of the target object captured from different views. However, these methods are not applicable to the scenarios where only a single input image is available, which is the focus of our work in this paper. This problem is called single-view 3D reconstruction, and it is ill-posed since an image can be a projection of infinitely many 3D shapes. Interestingly, humans are very good at estimating the 3D structure of any known class object from a single image; we can even predict how it looks in unseen views. This is perhaps because humans have strong prior knowledge about the 3D shape and texture of the object class in consideration. Inspired by this observation, many category-specific 3D modeling methods have been proposed for specific object categories such as faces

[3, 40, 59, 46, 39, 44, 47, 13], hands [60, 30, 4, 18], and bodies [34, 24].

In this paper, instead of focusing on any individual category, we aim to develop a general framework that can work for any object category, as long as there are many images from that category to train a single-view 3D reconstruction network. Furthermore, given the difficulty of acquiring 3D ground-truth annotation, we also aim to develop an unsupervised learning method which does not require the ground-truth 3D structures for the objects in the training images. However, this is a challenging problem due to the huge variation of the training images, regarding their viewpoint, appearance, illumination, and background.

A recent study [52] made a break-though in solving this problem with a novel end-to-end trainable deep network. Their network consisted of several modules to regress the image formation’s components, including the object’s 3D shape, texture, viewpoint, and lighting parameters, so that the rendered image was similar to the input. The modules were trained in an unsupervised manner on image datasets. They assumed a single image per training example, so it was still highly under-constrained. To make this training procedure converge, the authors proposed using the symmetry constraint. Their system successfully recovered 3D shape of human faces, cat faces, and synthetic cars after training on respective datasets. For convenience, from now on we will call this Learning from Symmetry method as LeSym.

While showing good initial results, LeSym has several limitations. First, it requires the target object to be almost symmetric, severely restricting its applicability to certain object classes. For highly asymmetric objects, this method does not work, and for nearly symmetric objects, it would not preserve the asymmetric details. Second, with a strong symmetry constraint, an incorrect mirror line estimation would lead to unrealistic 3D reconstruction. Some examples and detailed discussions on these issues can be found in Sec. 4. Third, when multiple images of the same object in the training dataset are available, LeSym cannot correlate and leverage these images to improve the reconstruction accuracy and stability. This is a drawback because there are many imagery datasets that contain multiple images for each object. For example, multiview stereo datasets have photos of each object captured at different views. Some datasets instead have multiple pictures of the same view but with different lighting conditions or focal lengths. Facial datasets often have multiple images for each person, and video datasets have a large number of frames covering the same object in each video.

In this paper, we propose a more general framework, called LeMul, that effectively Learns from Multi-image datasets for more flexible and reliable unsupervised training of 3D reconstruction networks. It employs loose shape and texture consistency losses based on component swapping across views. This is an “unsupervised” method since it does not require any 3D ground-truth data in training. Although it exploits multiple images per training instance, these images are so diverse and cannot be combined in traditional approaches to form any 3D supervision. LeMul can cover the symmetric object addressed in LeSym by using the original and the flipped image with less regularized results. More importantly, it handles a wider range of training datasets and object classes.

Besides, we employ an albedo loss in LeMul, which accurately recovers fine details of the 3D shape. This loss is inspired by a well-known Shape-from-Shading (SfS) literature [32]. It greatly improves the realisticity of the reconstructed 3D model, sometimes approaching laser-scan quality, from a low-res single image input.

In short, our contributions are: (1) we introduce a general framework, called LeMul, that can exploit multi-image datasets in learning 3D object reconstruction from a single image without the symmetry constraint; (2) we employ shape and texture consistency losses to make that unsupervised learning converge; (3) we apply an albedo loss to improve realisticity of the reconstruction results; (4) LeMul shows state-of-the-art performance, qualitatively and quantitively, on a wide range of datasets.

2 Related Work

In this section, we briefly review the existing image-based 3D reconstruction approaches, from classical to deep-learning-based algorithms.

Multi-view 3D reconstruction. This approach requires multiple images of the target object captured at different viewpoints. It consists of two sub-tasks: Structure-from-Motion (SfM) and Multi-view Stereo (MVS). SfM estimates from the input images the camera matrices and a sparse 3D reconstructed point-cloud [42, 43]. SfM requires robust keypoints extracted from each input view for matching and reconstruction. MVS assumes known camera matrices for a dense 3D reconstruction [17, 16]. These tasks are often combined to form end-to-end systems: SfM provides camera matrix estimation as an input to MVS [51]. These approaches were well-studied in classical literature, and they have been further improved with deep learning [57, 48, 21]. These methods, however, are unfit for our objective of 3D reconstruction from a single image at inference time. Even at training time, they hardly work with our in-the-wild inputs with low image quality, diverse capturing conditions, and freely non-rigid deformation.

Shape from X is another common 3D modeling approach that relies on a specific aspect of the image(s) such as silhouettes [27], focus [12], symmetry [31, 14, 45, 41], and shading [55, 26, 2, 32]. These methods only work on restricted conditions, thus do not apply to in-the-wild data. We focus on two latter directions since they are applicable to our problem. Shape-from-symmetry assumes the target object is symmetric, thus using the original and flipped image as a stereo pair for 3D reconstruction. Shape-from-shading (SfS) relies on some shading model, normally Phong shading [36] or Spherical Harmonic Lighting [22], and solves an inverse rendering problem to decompose image’s intrinsic components, including 3D shape, albedo, and illumination. SfS methods often either refine an initial 3D [26, 32]

or solve an optimization problem with multiple heuristic constraints

[2]. We are particularly interested in [32]

, which employs bilateral-like loss functions to obtain fine-details on an initial raw depth-map.

Deep-learning-based 3D modeling. Deep learning provides a powerful tool to handle challenging computer vision problems, including 3D reconstruction from a single image. Some studies managed to solve the monocular depth estimation [9, 53, 38, 15]

from a single image via supervised learning on ground-truth datasets. Some other studies learned a 3D shape representation from 3D datasets, using a generative model such as GAN or VAE, and fit it into the input image either with or without supervision

[5, 20, 58, 58, 28]. These methods, however, require ground-truth data for supervision or 3D shape datasets for prior learning. They are not unsupervised and cannot handle a new object class that has no available 3D data.

Figure 1: Overview of the proposed system. We train a decomposing network to optimize different loss components. Note that we omit the confidence maps in this figure for simplicity. Also, we use diffuse shading images to visualize depth maps.

Category-specific 3D reconstruction. Some research focus on reconstructing 3D models of a specific object class, such as human faces [3, 40, 59, 46, 39, 44, 47, 13], hands [60, 30, 4, 18], and bodies [34, 24]. The 3D modeling process often heavily relies on well-defined shape priors. For instance, early 3D face modeling studies used simple PCA models learned from facial landmarks such as Active Appearance Model (AAM) [6, 7] and Constrained Local Model (CLM) [8, 1]. Later, statistical models for 3D face shape and albedo learned from 3D face scans, called 3D Morphable Models (3DMMs) [35, 19], were used as an effective prior in 3D face modeling algorithms [3, 40, 59, 46, 39, 44]. Recently, many works have explored other 3D face presentations, such as non-linear 3DMMs [47] or GCN-based features [37, 49]. Instead of learning specific models based on characteristics of each object class, we target a general framework that can extract 3D shape prior for any class just from in-the-wild images.

LeSym [52] was the first work that could handle the task of 3D modeling from a single image in a general and unsupervised manner. It followed the SfS approach to extract the image’s intrinsic components, including 3D shape, texture, view, and illumination parameters. The network was trained to minimize the reconstruction loss, comparing the rendered image and the input, using a differentiable renderer on a large image set of same-class objects. The optimization problem was under-constrained, so the authors assumed symmetry on the target object and incorporated the flipped image as in Shape-from-Symmetry algorithms. LeSym showed impressive reconstruction results on human faces, cat faces, and synthetic cars. However, the symmetry assumption strongly regularized the estimated 3D models and restricted LeSym’s applications. Also, the reconstructed 3D models are still raw, with many details missing.

3 Learning from Multi-Image Datasets

3.1 Overview

We revise the mechanism used in LeSym to get LeMul as a more general, effective, and accurate unsupervised 3D reconstruction method. Two key ideas in our proposal are: a multi-image based unsupervised training and a novel albedo loss. The system overview is illustrated in Fig. 1.

Unlike LeSym, we do not require the modeling target to be symmetric. Instead, we assume more than one image for each object in the training data. We run the network modules over each image and enforce shape and albedo consistency. Note that having a single image of a symmetric object is a special case of ours; we can simply use the original and flipped input as two images of each training instance, and the 3D model consistency will enforce the object’s symmetry. Moreover, this configuration can account for many other common scenarios such as multi-view, multi-exposure, multi-frame datasets. The multi-image configuration is only needed in training. During inference, the system can output a 3D model from a single input image.

Consider a training example and let denote the set of  images of an object taken at different conditions. Each image can be decomposed into four components . The first two components represent the object’s 3D model in a canonical view that is independent to camera pose, with is the depth-map and is the albedo-map. The latter components model the capturing conditions, with

is a vector of

illumination parameters and is the viewing vector. The image is formed by a shading function :


where is the noise term for factors such as background clutter and occlusions. The shading model is a differentiable renderer [25], which uses a perspective projection camera, Phong shading model, and Lambertian surface assumption. There are illumination parameters, including the weighting coefficients for the ambient term and the diffuse term and the light direction . Other details are described in [52].

Our decomposing network consists of four modules to estimate the four intrinsic components of an input image . We denote these modules as , and respectively. and translate the input to output maps that have the same spatial resolution. and are regression networks that output parameter vectors. The outputs of these modules components, denoted as , are used to reconstruct the input image:

There are two desired criteria: (1) the reconstructed image should be similar to the input ; (2) for any pair of images coming from the same training sample and , the estimated canonical depth and albedo maps (, ) and (, ) should be almost similar and interchangeable. These criteria can be formulated into two losses and respectively. Furthermore, we employ novel loss functions, called albedo losses, inspired by [32] to further improve the reconstruction of fine details. These losses follow the same-view and cross-view settings, and we denote them as and . The total training loss, therefore, will be:


with and being weighting hyper-parameters. We will now discuss each loss component above.

3.2 Reconstruction loss

We inherit this loss from LeSym. It enforces the reconstructed image to be similar to the input. To discard the effect of the noise , another sub-network, called is used to regress a pair of confidence maps () that weigh the pixels in computing the reconstruction loss. The total reconstruction loss is summed over all input views:

where and are functions to compute the and perceptual loss components,   is a function to extract the -th layer feature

of a VGG-16 network pre-trained on ImageNet, and

is a weighting hyper-parameter.

Assuming Gaussian distributions, the mentioned loss components have detailed expressions as following:


with and as pixel sets in image and feature space.

3.3 Cross-view consistency loss

Reconstruction loss alone is not enough to constrain the reconstruction outcome. Since we have multiple images per training instance, we can enforce the reconstructed 3D models () to be consistent via a cross-view consistency loss.

In theory, we can simply minimize the distance , but we found it ineffective in practice, making the training unstable to converge. Instead, we propose to implement the consistency loss based on a component swapping mechanism. For each pair of views , we can swap the estimated 3D model from one view () to the other to render a cross-model image:


This image should be almost the same as the input . Similar to the reconstruction loss, we employ some confidence maps for loss computation. However, these maps correlate two input images (), requiring another confidence network. We call this network that inputs the image pair () stacked by channels and returns a pair of confidence maps (). The cross-view loss item for this image pair can be computed as follows:

We can compute the cross-view entropy loss for all pairs of , but this can be computationally expensive if is large. For computational efficiency, we select the first view as a pivot and use only the pairs related to the first view:


3.4 Albedo losses

Although the 3D reconstructed shapes obtained with the above losses are reasonably accurate already, the 3D shapes tend to be over-smooth with many fine details of the 3D surface being inaccurately transferred to the albedo map. For sharper 3D reconstruction, we apply a regularization on the albedo map to avoid overfitting to pixel intensities. This regularization should guarantee that the albedo is smooth at non-edge pixels while preserving the edges. Following [32], we implement such regularization by albedo loss terms.

An albedo loss requires three aligned inputs, including an input image and the corresponding maps for depth and albedo . It enforces smoothness on :

where defines the neighbors of a pixel , is the intensity weighting term:


and is the depth weighting term:


The weighting terms suppress the effect of neighbor pixels that likely come from other regions due to a large gap in intensity/depth compared with the current one. We use and to control the allowed intensity and depth discontinuity.

Note that the three inputs of the albedo loss needed to be aligned pixel-by-pixel. We keep the original input , which is at an estimated view . Therefore, we cannot use the canonical maps () directly, so we transform them to the view . This process can be done by a warping function . This function first computes the 3D shape from the canonical depth , then project and render it at the view . The outputs are transformed depth and albedo maps:


Similar to the previous loss terms, we compute the albedo loss in same-view and cross-view settings:


4 Experiments

4.1 Experimental setups

4.1.1 Implementation details

We implemented our system in PyTorch. The networks

, and had the same structure as in the official released code of LeSym111 The cross-view confidence network was similar to , except for having six input channels instead of three. In all experiments, we used the same input image size . The hyper-parameters were set as , , and . The networks were jointly trained with Adam optimizer at a fixed learning rate until convergence.

4.1.2 Datasets

To evaluate the proposed algorithm, we run experiments on datasets with various capturing settings and data structures (single-view, multi-view, image-collection, or video):

BFM is a synthetic dataset of 200K human face images proposed by LeSym. Each image is rendered with a 3D shape and texture randomly sampled from the Basel Face Model [35], a random view, and one of the spherical harmonics lights estimated from CelebA images [29]. Besides RGB images, the ground-truth 3D depth-maps are also provided. We use this dataset to quantitatively evaluate our approach as well as comparing it with other baselines.

CelebA [29] is a popular facial dataset of more than 200K celebrity images. The images were captured under in-the-wild conditions. It is split into three subsets for training, validation, and testing with 162K, 20K, and 20K images, respectively. We use this dataset to compare LeMul and LeSym under the “single-view” and “symmetric-objects” settings. We generate two image inputs for each training instance, including the original and the flipped image.

Cat Faces is a dataset of 11.2K images capturing cat faces in-the-wild. This dataset was constructed in LeSym by combining two previous datasets [56, 33]. This set is split into 8930 training and 2256 testing instances. This dataset is also under the “single-view” and “symmetric-objects” settings, and its two-view data is formed similar to CelebA.

Multi-PIE [23] is a large human-face dataset captured in studio settings. It contains more than 750K images of 337 people involved in from one to four different recording sessions. In each session, each subject has a collection of images captured at 15 view-points, 19 illuminations, and with several expressions. We excluded images with extreme light or overwhelmed expression and selected ones at three viewing angles corresponding to frontal, -to-the-left, and -to-the-right views, to form a multi-view image set. Each training instance is a set of three images of each person, captured at the selected views. We use random illumination, causing three input views drastically different and unable to be used by traditional multi-view stereo methods.

CASIA-WebFace [54]

has 500K face images of 10K people collected from the Internet. Each person has on average 50 in-the-wild images with drastically different conditions. We keep the last 200 subjects for testing and use the rest for training. In each training epoch, for each subject, we randomly select

images of that person regardless of pose, expression, and illumination to form a training example.

YouTube Faces (YTF) [50] is a video dataset that consists of 3425 videos of 1595 people. The videos have low-quality frames, which were severely degraded by video compression. Many videos are also bad for 3D face modeling, with the target faces at non-frontal views and barely moving. Still, we aim to evaluate our method on such extreme conditions. For each video, we extract the frames and crop them around the target faces. We split the videos for training (3299) and testing (126). Similar to CASIA, in each training epoch and with each video, we randomly select frames to form a training instance.

Quantitative Metrics. For fair comparison results, we use the same metrics used in LeSym. The first metric is Scale-Invariant Depth Error (SIDE) [10]

, which computes the standard deviation of the difference between the estimated depth map at the input view and the ground-truth depth map at the log scale. However, we argue that this metric is not a strong indicator of the reconstruction quality. A reasonable error in the object distance estimation, while not affecting the projected image, can cause SIDE varying. In contrast, it is ineffective in evaluating the reconstructed surface quality. We can smooth out the depth-map or add small random noise to it but cause a minimal change in SIDE value.

Instead, we focus on the second metric, which is the Mean Angle Deviation (MAD) [52] between normal maps computed from estimated depth map and ground-truth depth map . It can measure how well the surface is captured and is sensitive to surface noise.

No Baseline SIDE(x) MAD(deg.)
(1) Supervised
(2) Const. null depth
(3) Average G.T. depth
(4) LeSym 0.7930.140
(5) LeMul (proposed) 15.491.50
Table 1: BFM results comparison with baselines.

4.2 Quantitative experiments

In this section, we perform quantitative evaluations on the BFM dataset with provided ground-truth data.

BFM results. We trained and tested our algorithm on the BFM dataset, and the results are reported in Table 1, along with some baselines: (1) supervised 3D reconstruction network as the upper bound, (2) a dummy network returning a constant null depth, (3) a dummy network producing a constant mean depth computed over the ground-truth one, and (4) LeSym. As can be seen, LeMul outperforms the dummy networks by a wide margin. Compared with LeSym, it achieves a better MAD number with decrease, implying better reconstructed 3D surfaces with details recovered.

As for SIDE, we examine the error maps and find that LeMul provides a better overall depth estimation. However, the outliers, particularly on the face boundary or outer components (ears, neck), are more unstable and skew the average score. If we compute the SIDE metric over the facial region bounded by Dlib’s 68-landmarks, LeMul has a lower error (

0.00534) compared with LeSym (0.00564), confirming this observation. Fig. (a)a provides a common scenario in which LeMul provides a lower error on most facial areas but higher errors on the boundary and an ear.

Figure 2: Qualitative analyses. (a) LeMul vs. LeSym (SIDE) and (b) Texture refinement.

Ablation studies. We run ablation experiments to evaluate each proposed component’s contribution to our result on the BFM dataset. From LeSym as the baseline, we can modify it to follow our multi-view scheme or integrate the albedo losses. As reported in Table 2, each of our proposals positively affects the MAD numbers. We achieve the best reconstructed 3D surfaces when combining both techniques.

No Method SIDE(x) MAD(deg.)
(1) Baseline [52]
(2)    + multi-view 0.7280.135
(3)    + albedo loss
(4)    + mul+al (full) 15.491.50
Table 2: Ablation studies on BFM dataset

4.3 Qualitative experiments

We qualitatively compare our method to the LeSym baseline on the mentioned datasets in Fig. LABEL:fig:teaser and Fig. 3. In all experiments, LeSym uses each single image as a training instance and applies the symmetry constraint. Our method also assumes that symmetry property on BFM, CelebA, and Cat Faces by using pairs of original and flipped images as training instances. However, on Multi-PIE, CASIA-WebFace, and Youtube Faces, we completely drop that assumption and use multi-image examples in training.

Three symmetry-assumed datasets. On BFM, CelebA, and Cat Faces, both LeSym and LeMul can reconstruct reasonable 3D models. However, thanks to the albedo loss, LeMul can recover more 3D details such as human hairs, beards, and cat furs. The 3D models are well recognizable even without texture.

Multi-PIE results. LeSym completely collapsed, perhaps due to the limited number of poses and the asymmetric lights. LeMul, instead, performed well on this data configuration with high quality produced 3D models.

CASIA-WebFace results. LeMul can learn well the 3D face structure. It is impressive since the images used in each training example are wildly different; they are even challenging for humans to correlate, as illustrated in Fig. LABEL:fig:teaser (second row). In both Fig. LABEL:fig:teaser and Fig. 3, LeMul can capture asymmetric details such as one-sided hairstyle and lopsided smile. In contrast, LeSym over-regulated the 3D shapes with the symmetry constraint, producing incorrect 3Ds.

Youtube Faces results. This dataset is pretty challenging to our training due to low-quality images and limited variation between frames in each video. Still, LeMul manages to converge and produce reasonable results at test time. When the input image is not too blurry, LeMul can reconstruct a 3D model with more details compared with LeSym, while it does not suffer from the symmetry assumption.

4.4 User surveys

We further compared our method with the baseline via user surveys. We skipped this test on BFM, which was already used in quantitative evaluations, and Multi-PIE, in which LeSym completely failed. For each remaining dataset, we created a survey with testing images randomly sampled from the respective test set. We generated two 3D models, estimated by LeSym and LeMul, for each image and produced corresponding videos to illustrate these models in various viewing angles. Each tester was asked to pick which model was better. At least 40 people took each survey, leading to at least 1200 answers per dataset.

We report the rate that each method is selected in Table 3. LeMul outperforms LeSym on CelebA and CASIA datasets by a wide margin, showing that LeMul can recover better 3D models in good training conditions. Notably, it was selected near of time on CASIA, proving the superiority of the multi-image setting over the symmetry constraint. It also beats LeSym on YTF and Cat Faces datasets but with smaller gaps. We found many YTF frames too blurry, making both 3D models smooth and hard to compare. The small gap of LeMul over LeSym came from clear frames, which is still very meaningful. Finally, on Cat Faces, while our models are more detailed, some testers preferred smooth 3Ds from LeSym, decreasing our selected rate. This phenomenon suggests that it is not always good to have many details, opening a future research to improve our method.

Method CelebA CatFaces CASIA YTF
LeSym [52] 36.01 47.03 20.86 45.23
Ours 63.99 52.97 79.14 54.77
Table 3: User survey results. For each dataset, we report the rate (%) that each method is selected by the tester for providing a better 3D model.
Figure 3: Comparing the reconstructed 3D models from the baseline method LeSym model LeSym [52] and ours. The datasets from top to bottom: BFM [52], CelebA [29], Cat Faces [52], Multi-PIE [23], CASIA-WebFace [54], and Youtube Faces [50]. For each 3D model, we provide two textureless views, two textured views, and the canonical normal map.
Figure 4: Reconstructed 3D models from in-the-wild images. We compare the baseline model LeSym [52] trained on CelebA dataset [29], and our method trained on CASIA-WebFace [54] dataset. For each 3D model, we provide two textureless views, two textured views, and the canonical normal map.

4.5 In-the-wild tests

Finally, we run evaluation on in-the-wild facial images collected from the Internet. We select the LeMul model trained on the CASIA dataset since it can capture even the asymmetric details. In contrast, among LeSym models for human face, the released model trained on the CelebA dataset shows the best reconstruction quality. We compare these models on some in-the-wild images in Fig. 4. The 3D shapes generated from LeSym are often distorted by symmetry regulation. Our results, instead, look more natural and detailed. Particularly, LeMul can create a realistic-looking 3D model from a cartoon drawing (the fourth row).

4.6 Texture refinement.

We observe that the regressed texture with models trained on CASIA and YTF datasets is a bit blurry, possibly due to two reasons. First, these datasets have lower image quality compared to CelebA; many images have blur, noise, or JPEG artifacts. Second, the models have to learn the subject’s albedo from vastly different inputs, causing blurry texture. We propose a simple solution to fix the second issue. After getting a trained model, we can finetune , , and while freezing the other modules for a few epochs on single-image inputs of the same training set. As shown in Fig. (b)b, the estimated texture is significantly improved. Note that this refinement preserves the high-quality 3D shape evaluated in previous experiments.

5 Discussions

In this paper, we present a novel system that shows the state-of-the-art 3D modeling quality in unsupervised learning for single-view 3D object reconstruction. The key insights are to exploit multi-image datasets in training and to employ albedo losses for improved detailed reconstruction.

Our method can work on various training datasets ranging from single- and multi-view datasets to image collection and video data. However, a current limitation of our work is that the images of the target object need to be compatible to the depth-map representation, being primarily frontal view without self-occlusion. We plan to address this limitation in future work to increase the applicability of our method.


  • [1] T. Baltrušaitis, P. Robinson, and L. Morency (2012) 3D constrained local model for rigid and non-rigid facial tracking. In cvpr, Cited by: §2.
  • [2] J. T. Barron and J. Malik (2014) Shape, illumination, and reflectance from shading. tpami, pp. 1670–1687. Cited by: §2.
  • [3] V. Blanz and T. Vetter (1999) Morphable model for the synthesis of 3D faces. In siggraph, Cited by: §1, §2.
  • [4] A. Boukhayma, R. d. Bem, and P. H. Torr (2019) 3d hand shape and pose from images in the wild. In cvpr, Cited by: §1, §2.
  • [5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In eccv, Cited by: §2.
  • [6] T. F. Cootes, G. J. Edwards, and C. J. Taylor (1998) Active appearance models. In eccv, Cited by: §2.
  • [7] T. F. Cootes, G. J. Edwards, and C. J. Taylor (2001) Active appearance models. tpami, pp. 681–685. Cited by: §2.
  • [8] D. Cristinacce and T. Cootes (2008) Automatic feature localisation with constrained local models. Pattern Recognition, pp. 3054–3067. Cited by: §2.
  • [9] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In iccv, Cited by: §2.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In nips, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), External Links: Link Cited by: §4.1.2.
  • [11] O. Faugeras, Q. Luong, and T. Papadopoulo (2001) The geometry of multiple images: the laws that govern the formation of multiple images of a scene and some of their applications. Cited by: §1.
  • [12] P. Favaro and S. Soatto (2005) A geometric approach to shape from defocus. tpami, pp. 406–417. Cited by: §2.
  • [13] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In eccv, Cited by: §1, §2.
  • [14] A. R. François, G. G. Medioni, and R. Waupotitsch (2003) Mirror symmetry 2-view stereo geometry. Image and Vision Computing, pp. 137–143. Cited by: §2.
  • [15] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In cvpr, Cited by: §2.
  • [16] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski (2010) Towards internet-scale multi-view stereo. In cvpr, Cited by: §1, §2.
  • [17] Y. Furukawa and J. Ponce (2007) Accurate, dense, and robust multi-view stereopsis (pmvs). In cvpr, Cited by: §1, §2.
  • [18] L. Ge, Z. Ren, Y. Li, Z. Xue, Y. Wang, J. Cai, and J. Yuan (2019)

    3d hand shape and pose estimation from a single rgb image

    In cvpr, Cited by: §1, §2.
  • [19] T. Gerig, A. Morel-Forster, C. Blumer, B. Egger, M. Luthi, S. Schönborn, and T. Vetter (2018) Morphable face models-an open framework. In fg, Cited by: §2.
  • [20] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta (2016) Learning a predictable and generative vector representation for objects. In eccv, Cited by: §2.
  • [21] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In cvpr, Cited by: §1, §2.
  • [22] R. Green (2003) Spherical harmonic lighting: the gritty details. Cited by: §2.
  • [23] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker (2008) Multi-pie. In fg, Cited by: Figure 3, §4.1.2.
  • [24] W. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis (2020) Coherent reconstruction of multiple humans from a single image. In cvpr, Cited by: §1, §2.
  • [25] H. Kato, Y. Ushiku, and T. Harada Neural 3d mesh renderer. In cvpr, Cited by: §3.1.
  • [26] I. Kemelmacher-Shlizerman and R. Basri (2010) 3D face reconstruction from a single image using a single reference face shape. tpami, pp. 394–405. Cited by: §2.
  • [27] J. J. Koenderink (1984) What does the occluding contour tell us about solid shape?. Perception, pp. 321–330. Cited by: §2.
  • [28] A. Kundu, Y. Li, and J. M. Rehg (2018) 3d-rcnn: instance-level 3d object reconstruction via render-and-compare. In cvpr, Cited by: §2.
  • [29] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In iccv, Cited by: Figure 3, Figure 4, §4.1.2, §4.1.2.
  • [30] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In cvpr, Cited by: §1, §2.
  • [31] D. P. Mukherjee, A. P. Zisserman, M. Brady, and F. Smith (1995) Shape from symmetry: detecting and exploiting symmetry in affine images. Philosophical Transactions of the Royal Society of London. Series A: Physical and Engineering Sciences, pp. 77–106. Cited by: §2.
  • [32] R. Or-El, G. Rosman, A. Wetzler, R. Kimmel, and A. M. Bruckstein (2015) Rgbd-fusion: real-time high precision depth recovery. In cvpr, Cited by: §1, §2, §3.1, §3.4.
  • [33] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar (2012) Cats and dogs. In cvpr, Cited by: §4.1.2.
  • [34] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black (2019) Expressive body capture: 3d hands, face, and body from a single image. In cvpr, Cited by: §1, §2.
  • [35] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009)

    A 3d face model for pose and illumination invariant face recognition

    In avss, Cited by: §2, §4.1.2.
  • [36] B. T. Phong (1975) Illumination for computer generated pictures. Communications of the ACM, pp. 311–317. Cited by: §2.
  • [37] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black (2018)

    Generating 3d faces using convolutional mesh autoencoders

    In eccv, Cited by: §2.
  • [38] E. Ricci, W. Ouyang, X. Wang, N. Sebe, et al. (2018) Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. tpami, pp. 1426–1440. Cited by: §2.
  • [39] E. Richardson, M. Sela, and R. Kimmel (2016) 3D face reconstruction by learning from synthetic data. In threedv, Cited by: §1, §2.
  • [40] S. Romdhani and T. Vetter (2003) Efficient, robust and accurate fitting of a 3D morphable model. In iccv, Cited by: §1, §2.
  • [41] S. N. Sinha, K. Ramnath, and R. Szeliski (2012) Detecting and reconstructing 3d mirror symmetric objects. In eccv, Cited by: §2.
  • [42] N. Snavely, S. M. Seitz, and R. Szeliski (2006) Photo tourism: exploring photo collections in 3d. In siggraph, Cited by: §1, §2.
  • [43] N. Snavely, S. M. Seitz, and R. Szeliski (2008) Modeling the world from internet photo collections. ijcv, pp. 189–210. Cited by: §1, §2.
  • [44] A. Tewari, M. Zollhofer, H. Kim, P. Garrido, F. Bernard, P. Perez, and C. Theobalt (2017) MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In iccv, Cited by: §1, §2.
  • [45] S. Thrun and B. Wegbreit (2005) Shape from symmetry. In iccv, Cited by: §2.
  • [46] A. Tran, T. Hassner, I. Masi, and G. Medioni (2017)

    Regressing robust and discriminative 3D morphable models with a very deep neural network

    In cvpr, Note: Available: Cited by: §1, §2.
  • [47] L. Tran and X. Liu (2018) Nonlinear 3d face morphable model. In cvpr, Cited by: §1, §2.
  • [48] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) Demon: depth and motion network for learning monocular stereo. In cvpr, Cited by: §1, §2.
  • [49] H. Wei, S. Liang, and Y. Wei (2019) 3d dense face alignment via graph convolution networks. arXiv preprint arXiv:1904.05562. Cited by: §2.
  • [50] L. Wolf, T. Hassner, and I. Maoz (2011) Face recognition in unconstrained videos with matched background similarity. In cvpr, Cited by: Figure 3, §4.1.2.
  • [51] C. Wu et al. VisualSFM: a visual structure from motion system. Cited by: §2.
  • [52] S. Wu, C. Rupprecht, and A. Vedaldi (2020)

    Unsupervised learning of probably symmetric deformable 3d objects from images in the wild

    In cvpr, Cited by: §1, §2, §3.1, Figure 3, Figure 4, §4.1.2, Table 2, Table 3.
  • [53] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In cvpr, Cited by: §2.
  • [54] D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923. Note: Available: Cited by: Figure 3, Figure 4, §4.1.2.
  • [55] R. Zhang, P. Tsai, J. E. Cryer, and M. Shah (1999) Shape-from-shading: a survey. tpami, pp. 690–706. Cited by: §2.
  • [56] W. Zhang, J. Sun, and X. Tang (2008) Cat head detection-how to effectively exploit shape and texture features. In eccv, Cited by: §4.1.2.
  • [57] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In cvpr, Cited by: §1, §2.
  • [58] R. Zhu, H. Kiani Galoogahi, C. Wang, and S. Lucey (2017) Rethinking reprojection: closing the loop for pose-aware shape reconstruction from a single image. In iccv, Cited by: §2.
  • [59] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. (2016) Face alignment across large poses: a 3D solution. In cvpr, Cited by: §1, §2.
  • [60] C. Zimmermann and T. Brox (2017) Learning to estimate 3d hand pose from single rgb images. In iccv, Cited by: §1, §2.