List of projects for 3d reconstruction
3D reconstruction is a longstanding ill-posed problem, which has been explored for decades by the computer vision, computer graphics, and machine learning communities. Since 2015, image-based 3D reconstruction using convolutional neural networks (CNN) has attracted increasing interest and demonstrated an impressive performance. Given this new era of rapid evolution, this article provides a comprehensive survey of the recent developments in this field. We focus on the works which use deep learning techniques to estimate the 3D shape of generic objects either from a single or multiple RGB images. We organize the literature based on the shape representations, the network architectures, and the training mechanisms they use. While this survey is intended for methods which reconstruct generic objects, we also review some of the recent works which focus on specific object classes such as human body shapes and faces. We provide an analysis and comparison of the performance of some key papers, summarize some of the open problems in this field, and discuss promising directions for future research.READ FULL TEXT VIEW PDF
Estimating depth from RGB images is a long-standing ill-posed problem, w...
Shape is an important physical property of natural and manmade 3D object...
Object detection from RGB images is a long-standing problem in image
Human-object interaction detection is a relatively new task in the world...
Single-image 3D shape reconstruction is an important and long-standing
We introduce a large-scale 3D shape understanding benchmark using data a...
3D shape editing is widely used in a range of applications such as movie...
List of projects for 3d reconstruction
The goal of image-based 3D reconstruction is to infer the 3D geometry and structure of objects and scenes from one or multiple 2D images. This long standing ill-posed problem is fundamental to many applications such as robot navigation, object recognition and scene understanding, 3D modeling and animation, industrial control, and medical diagnosis.
Recovering the lost dimension from just 2D images has been the goal of classic multiview stereo and shape-from-X methods, which have been extensively investigated for many decades. The first generation of methods approached the problem from the geometric perspective; they focused on understanding and formalizing, mathematically, the 3D to 2D projection process, with the aim to devise mathematical or algorithmic solutions to the ill-posed inverse problem. Effective solutions typically require multiple images, captured using accurately calibrated cameras. Stereo-based techniques , for example, require matching features across images captured from slightly different viewing angles, and then use the triangulation principle to recover the 3D coordinates of the image pixels. Shape-from-silhouette, or shape-by-space-carving, methods  require accurately segmented 2D silhouettes. These methods, which have led to reasonable quality 3D reconstructions, require multiple images of the same object captured by well-calibrated cameras. This, however, may not be practical or feasible in many situations.
Interestingly, humans are good at solving such ill-posed inverse problems by leveraging prior knowledge. They can easily infer the approximate size and rough geometry of objects using only one eye. They can even guess what it would look like from another viewpoint. We can do this because all the previously seen objects and scenes have enabled us to build prior knowledge and develop mental models of what objects look like. The second generation of 3D reconstruction methods tried to leverage this prior knowledge by formulating the 3D reconstruction problem as a recognition problem. The avenue of deep learning techniques, and more importantly, the increasing availability of large training data sets, have lead to a new generation of methods that are able to recover the 3D geometry and structure of objects from one or multiple RGB images without the complex camera calibration process. Despite being recent, these methods have demonstrated exciting and promising results on various tasks related to computer vision and graphics.
In this article, we provide a comprehensive and structured review of the recent advances in 3D object reconstruction using deep learning techniques. We first focus on generic shapes and then discuss specific cases, such as human body shapes and faces. We have gathered more than papers, which appeared since in leading computer vision, computer graphics, and machine learning conferences and journals111This continuously increasing number, even at the time we are finalising this article, does not include CVPR2019 papers.. The goal is to help the reader navigate in this emerging field, which gained a significant momentum in the past few years. Compared to the existing literature, the main contributions of this article are as follows;
To the best of our knowledge, this is the first survey paper in the literature which focuses on image-based 3D object reconstruction using deep learning.
We cover the contemporary literature with respect to this area. We present a comprehensive review of methods, which appeared since .
This survey also provides a comprehensive review and an insightful analysis on all aspects of 3D reconstruction using deep learning, including the training data, the choice of network architectures and their effect on the 3D reconstruction results, the training strategies, and the application scenarios.
We provide a comparative summary of the properties and performance of the reviewed methods for generic 3D object reconstruction. We cover algorithms for generic 3D object reconstruction, methods related to 3D face reconstruction, and methods for 3D human body shape reconstruction.
We provide a comparative summary of the methods in a tabular form.
The rest of this article is organized as follows; Section 2 fomulates the problem and lays down the taxonomy. Section 3 reviews the latent spaces and the input encoding mechanisms. Section 4 surveys the volumetric reconstruction techniques, while Section 5 focuses on surface-based techniques. Section 6 shows how some of the state-of-the-art techniques use additional cues to boost the performance of 3D reconstruction. Section 7 discusses the training procedures. Section 8 focuses on specific objects such as human body shapes and faces. Section 9 compares and discusses the performance of some key methods. Finally, Section 10 discusses potential future research directions while Section 11 concludes the paper with some important remarks.
Let be a set of RGB images of one or multiple objects . 3D reconstruction can be summarized as the process of learning a predictor that can infer a shape that is as close as possible to the unknown shape . In other words, the function is the minimizer of a reconstruction objective . Here, is the set of parameters of and is a certain measure of distance between the target shape and the reconstructed shape . The reconstruction objective is also known as the loss function in the deep learning literature.
This survey discusses and categorizes the state-of-the-art based on the nature of the input I, the representation of the output, the deep neural network architectures used to approximate the predictor , the training procedures they use, and their degree of supervision, see Table I for a visual summary. In particular, The input I can be (1) a single image, (2) multiple images captured using RGB cameras whose intrinsic and extrinsic parameters can be known or unknown, or (3) a video stream, i.e., a sequence of images with temporal correlation. The first case is very challenging because of the ambiguities in the 3D reconstruction. When the input is a video stream, one can exploit the temporal correlation to facilitate the 3D reconstruction while ensuring that the reconstruction is smooth and consistent across all the frames of the video stream. Also the input can be depicting one or multiple 3D objects belonging to known or unknown shape categories. It can also include additional information such as silhouettes, segmentation masks, and semantic labels as priors to guide the reconstruction.
The representation of the output is crucial to the choice of the network architecture. It also has impact on the computational efficiency and quality of the reconstruction. In particular,
Volumetric representations, which have been extensively adopted in early deep leaning-based 3D reconstruction techniques, allow the parametrization of 3D shapes using regular voxel grids. As such, 2D convolutions used in image analysis can be easily extended to 3D by using 3D convolutions. They are, however, very expensive in terms of memory requirements, and only a few techniques can achieve sub-voxel accuracy.
Surface-based representations: Other papers explored surface-based representations such as meshes and point clouds. While memory-efficient, such representations are not regular structures and thus, they do not easily fit into deep learning architectures.
Intermediation: While some 3D reconstruction algorithms predict the 3D geometry of an object from RGB images directly, others decompose the problem into sequential steps, each step predicts an intermediate representation.
|Input||Training||1 vs. muli RGB,|
|3D ground truth,||One vs. multiple objects, Uniform vs. cluttered background.|
|Testing||1 vs. muli RGB,|
|Output||Volumetric||High vs. low resolution|
|Surface||Parameterization, template deformation,|
|Direct vs. intermediating|
|Network architecture||Training||Encoder-Decoder, GAN, 3D-GAN-VAE|
|3D Variational Auto Encoder (3D-VAE)|
|Training||Degree of supervision||2D vs. 3D supervision,|
|Joint 2D-3D embedding,|
|Joint training with other tasks.|
A variety of network architectures have been utilized to implement and train the predictor . The backbone architecture is composed of an encoder followed by a decoder , i.e., . The encoder maps the input into a latent variable x
, referred to as a feature vector or a code, using a sequence of convolutions and pooling operations, followed by fully connected layers of neurons. The decoder, also called the generator, decodes the feature vector into the desired output by using either fully connected layers or a deconvolution network (a sequence of convolution and upsampling operations, also referred to as upconvolutions). The former is suitable for unstructured output,e.g., 3D point clouds, while the latter is used to reconstruct volumetric grids or parametrized surfaces. Since the introduction of this vanilla architecture, several extensions have been proposed by varying the architecture (e.g., ConvNet vs. ResNet, Convolutional Neural Networks (CNN) vs. Generative Adversarial Networks (GAN), CNN vs. Variational Auto-Encoders, and 2D vs. 3D convolutions), and by cascading multiple blocks each one achieving a specific task.
While the architecture of the network and its building blocks are important, the performance depends highly on the way it is trained. In this survey, we will look at:
Datasets: There are various datasets that are currently available for training and evaluating deep learning-based 3D reconstruction. Some of them use real data, other are CG-generated.
Loss function: The choice of the loss function can significantly impact on the reconstruction quality. It also defines the degree of supervision.
Training procedure and degree of supervision: Some methods require real images annotated with their corresponding 3D models, which are very expensive to obtain. Other methods rely on a combination of real and synthetic data. Others avoid completely 3D supervision by using loss functions that exploit supervisory signals that are easy to obtain.
The following sections review in detail these aspects.
Deep learning-based 3D reconstruction algorithms encode the input I into a feature vector where is referred to as the latent space. A good mapping function should satisfy the following properties:
Two inputs and that represent similar 3D objects should be mapped into and that are close to each other in the latent space.
A small perturbation of x should correspond to a small perturbation of the shape of the input.
The latent representation induced by should be invariant to extrinsic factors such as camera pose.
A 3D model and its corresponding 2D images should be mapped onto the same point in the latent space. This will ensure that the representation is not ambiguous and thus facilitate the reconstruction.
The first two conditions have been addressed by using encoders that map the input onto discrete (Section 3.1) or continuous (Section 3.2) latent spaces. These can be flat or hierarchical (Section 3.3). The third one has been addressed by using disentangled representations (Section 3.4). The latter has been addressed by using TL-architectures during the training phase. This is covered in Section 7.3.1 as one of the many training mechanisms which have been used in the literature. Table II summarizes this taxonomy.
|Discrete (3.1) vs. continuous (3.2)||ConvNet, ResNet,|
|Flat vs. hierarchical (3.3)||FC, 3D-VAE|
|Disentangled representation (3.4)|
Wu et al. in their seminal work  introduced 3D ShapeNet, a 3D shape encoding network which maps a 3D shape, represented as a discretized volumetric grid of size , into a latent representation of size . Its core network is composed of convolutional layers, each one using 3D convolution filters, followed by fully connected layers. This standard vanilla architecture has been initially used for 3D shape classification and retrieval , and later applied to 3D volumetric reconstruction from depth maps represented as voxel grids . It has been also used in 3D reconstruction from one or multiple RGB images by replacing the 3D convolutions with 2D convolutions [5, 6, 7, 8, 9, 10, 11, 12].
Since its introduction, several variants to this vanilla architecture have been proposed. Early works differ in the type and number of layers they use. For instance, Yan et al.  used convolutional layers with , ad channels, respectively, and fully-connected layers with , , and neurons, respectively. Wiles and Zisserman  used convolutional layers of , and channels, respectively. Other works add pooling layers [8, 13]8, 13, 14]. For example, Wiles and Zisserman 
used max pooling layers between each pair of convolutional layers, except after the first layer and before the last layer. ReLU layers improve learning since the gradient during the back propagation is never zero.
Another commonly used variant for the encoder is the deep residual network (ResNet) 
, which adds residual connections between the convolutional layers in order to improve and speed up the learning process for very deep networks. Such residual networks have been used in[8, 7, 10] to encode images of size or into feature vectors of low dimension.
Using the encoders presented in the previous section, the latent space
may not be continuous and thus it does not allow easy interpolation. In other words, ifand , then there is not guarantee that can be decoded into a valid 3D shape. Also, small perturbations of
do not necessarily correspond to small perturbations of the input. Variational Autoencoders (VAE) and their 3D extension (3D-VAE)  have one fundamentally unique property that makes them suitable for generative modeling: their latent spaces are, by design, continuous, allowing easy sampling and interpolation. The key idea is that instead of mapping the input into a feature vector, it is mapped into the mean vector
and the vector of standard deviations
of a multivariate Gaussian distribution. A sampling layer then takes these two vectors, and generates, by random sampling from the multivariate Gaussian distribution, a feature vectorx, which will serve as input to the subsequent decoding stages.
This architecture has been used to learn a continuous latent space for volumetric [17, 18], depth-baed , surface-based , and point-based  3D reconstruction. In Wu et al. , for example, the image encoder takes as input a RGB image and outputs two -dimensional vectors representing, respectively, the mean and the standard deviation of a Gaussian distribution in the -dimensional space. Compared to standard encoders, 3D-VAE can be used to randomly sample from the latent space, to generate variations of an input, and to reconstruct multiple plausible 3D shapes from an ambiguous input, e.g., an image with occlusions [21, 22]. It generalizes well to images that have not been unseen during the training.
Liu et al.  showed that encoders that map the input into a single latent representation cannot extract rich structures and thus may lead to blurry reconstructions. To improve the quality of the reconstruction, Liu et al.  introduced a more complex internal variable structure, with the specific goal of encouraging the learning of a hierarchical arrangement of latent feature detectors. The approach starts with a global latent variable layer that is hardwired to a set of local latent variable layers, each tasked with representing one level of feature abstraction. The skip-connections tie together the latent codes, and in a top-down directed fashion, local codes closer to the input will tend to represent lower-level features while local codes farther away from the input will tend towards representing higher-level features. Finally, the local latent codes are concatenated to a flattened structure when fed into the task-specific models such as 3D reconstruction.
The appearance of an object in an image is affected by multiple factors such as the object’s shape, the camera pose, and the lighting conditions. Standard encoders represent all these variabilities in the learned code x. This is not desirable in applications such as recognition and classification , which should be invariant to extrinsic factors such as pose and lighting. 3D reconstruction can also benefit from disentangled representations where shape, pose, and lighting are represented with different codes. To this end, Grant et al.  proposed an encoder, which maps an RGB image into a shape code and a transformation code. The former is decoded into a 3D shape. The latter, which encodes lighting conditions and pose, is decoded into (1) another RGB image with correct lighting, using upconvolutional layers, and (2) camera pose using fully-connected layers (FC). To enable a disentangled representation, the network is trained in such a way that in the forward pass, the image decoder receives input from the shape code and the transformation code. In the backward pass, the signal from the image decoder to the shape code is suppressed to force it to only represent shape.
Zhu et al.  followed the same idea by decoupling the 6DOF pose parameters and shape. The network reconstructs from the 2D input the 3D shape but in a canonical pose. At the same time, a pose regressor estimates the 6DOF pose parameters, which are then applied to the reconstructed canonical shape. Decoupling pose and shape reduces the number of free parameters in the network, which results in improved efficiency.
Volumetric representations discritize the space around a 3D object into a 3D voxel grid . The finer the discretization is, the more accurate the representation will be. The goal is then to recover a grid such that the 3D shape it represents is as close as possible to the unknown real 3D shape . The main advantage of using volumetric grids is that many of the existing deep learning architectures that have been designed for 2D image analysis can be easily extended to 3D data by replacing the 2D pixel array with its 3D analogue and then processing the grid using 3D convolution and pooling operations. This section looks at the different volumetric representations (Section 4.1) and reviews the decoder architectures for low-resolution (Section 4.2) and high-resolution (Section 4.3) 3D reconstruction.
|Sampling||Content||Low res: , (4.2)||High resolution (4.3)||Network||Intermediation (6.1)|
|Space part. (4.3.1)||Shape part. (4.3.2)||Subspace param. (4.3.3)||Refinement (4.3.4)|
|Regular,||Occupancy,||Normal,||HSP, OGN,||Parts,||PCA,||Upsampling,||FC,||(1) image voxels,|
|Fruxel,||SDF,||O-CNN,||Patch-guide||Patches||DCT||Volume slicing,||UpConv.||(2) image (2.5D,|
|Adaptive||TSDF||OctNet||Patch synthesis,||silh.) voxels|
|||Regular||Occupancy||✓||LSTM + UpConv||image voxels|
|||regular||SDF||✓||patch synthesis||UpConv||scans voxels|
|||Regular||Occupancy||Volume slicing||CNN LSTM CNN||image voxels|
|||Regular||TSDF||Parts||LSTM + MDN||depth BBX|
|||Regular||TSDF||OctNet||Global to local||UpConv||scans voxels|
|||Adaptive||Occupancy||HSP||UpConv nets||image voxels|
There are four main volumetric representations that have been used in the literature:
Binary occupancy grid. In this representation, a voxel is set to one if it belongs to the objects of interest, whereas background voxels are set to zero.
Probabilistic occupancy grid.
Each voxel in a probabilistic occupancy grid encodes its probability of belonging to the surface of the objects of interest.
The Signed Distance Function (SDF). Each voxel encodes its signed distance to the closest surface point. It is positive if the voxel is located inside the object and negative otherwise
Truncated Signed Distance Function (TSDF). Introduced by Curless and Levoy , TSDF is computed by first estimating distances along the lines of sight of a range sensor, forming a projective signed distance field, and then truncating the field at small negative and positive values.
Probabilistic occupancy grids are particularly suitable for machine learning algorithms which output likelihoods. SDFs provide an unambiguous estimate of surface positions and normal directions. However, they are not trivial to construct from partial data such as depth maps. TSDFs sacrifice the full signed distance field that extends indefinitely away from the surface geometry, but allow for local updates of the field based on partial observations. They are suitable for reconstructing 3D volumes from a set of depth maps [25, 37, 30, 34].
In general, volumetric representations are created by regular sampling of the volume around the objects. Knyaz et al.  introduced a representation method called Frustum Voxel Model or Fruxel, which combines depth representation with voxel grids. It uses the slices of the camera’s 3D frustum to build the voxel space, and thus provides precise alignment of voxel slices with the contours in the input image.
Once a compact vector representation of the input is learned using an encoder, the next step is to learn the decoding function , known as the generator or generative model, which maps the vector representation into a volumetric voxel grid. The standard approach uses a convolutional decoder, called also up-convolutional network, which mirrors the convolutional encoder. Wu et al.  were among the first to propose this methodology to reconstruct 3D volumes from depth maps. Wu et al.  proposed a two-stage reconstruction network called MarrNet. The first stage uses an encoder-decoder architecture to reconstruct, from an input image, the depth map, the normal map, and the silhouette map. These three maps, referred to as sketches, are then used as input to another encoder-decoder architecture, which regresses a volumetric 3D shape. The network has been later extended by Sun et al.  to also regress the pose of the input. The main advantage of this two-staged approach is that, compared to full 3D models, depth maps, normal maps, and silhouette maps are much easier to recover from 2D images. Likewise, 3D models are much easier to recover from these three modalities than from 2D images alone. This method, however, fails to reconstruct complex, thin structures.
Wu et al.’s work  has led to several extensions [8, 17, 26, 38, 9]. In particular, recent works tried to directly regress the 3D voxel grid [9, 18, 14, 12] without intermediation. Tulsiani et al. , and later in , used a decoder composed of 3D upconvolution layers to predict the voxel occupancy probabilities. Liu et al.  used a 3D upconvolutional neural network, followed by an element-wise logistic sigmoid, to decode the learned latent features into a 3D occupancy probability grid. These methods have been successful in performing 3D reconstruction from a single or a collection of images captured with uncalibrated cameras. Their main advantage is that the deep learning architectures proposed for the analysis of 2D images can be easily adapted to 3D models by replacing the 2D up-convolutions in the decoder with 3D up-convolutions, which also can be efficiently implemented on the GPU. However, given the computational complexity and memory requirements, these methods produce low resolution grids, usually of size or . As such, they fail to recover fine details.
There have been attempts to upscale the deep learning architectures for high resolution volumetric reconstruction. For instance, Wu et al.  were able to reconstruct voxel grids of size
by simply expanding the network. Volumetric grids, however, are very expensive in terms of memory requirements, which grow cubically with the grid resolution.This section reviews some of the techniques that have been used to infer high resolution volumetric grids, while keeping the computational and memory requirements tractable. We classify these methods into four categories based on whether they use space partitioning, shape partitioning, subspace parameterization, or coarse-to-fine refinement strategies.
|(a) Octree Network (OctNet) .||(b) Hierarchical Space Partionning (HSP) .||(c) Octree Generative Network (OGN) .|
While regular volumetric grids facilitate convolutional operations, they very sparse since surface elements are contained in few voxels. Several papers have exploited this sparsity to address the resolution problem [39, 31, 32, 40]. They were able to reconstruct 3D volumetric grids of size to by using space partitioning techniques such as octrees. There are, however, two main challenging issues when using octree structures for deep-learning based reconstruction. The first one is computational since convolutional operations are easier to implement (especially on GPUs) when operating on regular grids. For this purpose, Wang et al.  designed O-CNN, a novel octree data structure, to efficiently store the octant information and CNN features into the graphics memory and execute the entire training and evaluation on the GPU. O-CNN supports various CNN structures and works with 3D shapes of different representations. By restraining the computations on the octants occupied by 3D surfaces, the memory and computational costs of the O-CNN grow quadratically as the depth of the octree increases, which makes the 3D CNN feasible for high-resolution 3D models.
The second challenge stems from the fact that the octree structure is object-dependent. Thus, ideally, the deep neural network needs to learn how to infer both the structure of the octree and its content. In this section, we will discuss how these challenges have been addressed in the literature.
The simplest approach is to assume that, at runtime, the structure of the octree is known. This is fine for applications such as semantic segmentation where the structure of the output octree can be set to be identical to that of the input. However, in many important scenarios, e.g., 3D reconstruction, shape modeling, and RGB-D fusion, the structure of the octree is not known in advance and must be predicted. To this end, Riegler et al.  proposed a hybrid grid-octree structure called OctNet (Fig. 1-(a)). The key idea is to restrict the maximal depth of an octree to a small number, e.g., three, and place several such shallow octrees on a regular grid. This representation enables 3D convolutional networks that are both deep and of high resolution. However, at test time, Riegler et al.  assume that the structure of the individual octrees is known. Thus, although the method is able to reconstruct 3D volumes at a resolution of , it lacks flexibility since different types of objects may require different training.
Ideally, the octree structure and its content should be simultaneously estimated. This can be done as follows;
First, the input is encoded into a compact feature vector using a convolutional encoder (Section 3).
Next, the feature vector is decoded using a standard up-convolutional network. This results in a coarse volumetric reconstruction of the input, usually of resolution (Section 4.2).
The reconstructed volume, which forms the root of the octree, is subdivided into octants. Octants with boundary voxels are upsampled and further processed, using an up-convolutional network, to refine the reconstruction of the regions in that octant.
The octants are processed recursively until the desired resolution is reached.
Häne et al.  introduced the Hierarchical Surface Prediction (HSP), see Fig. 1-(b), which used the approach described above to reconstruct volumetric grids of resolution up to . In this approach, the octree is explored in depth-first manner. Tatarchenko et al. , on the other hand, proposed the Octree Generating Networks (OGN), which follows the same idea but the octree is explored in breadth-first manner, see Fig. 1-(c). As such, OGN produces a hierarchical reconstruction of the 3D shape. The approach was able to reconstruct volumetric grids of size .
Wang et al.  introduced a patch-guided partitioning strategy. The core idea is to represent a 3D shape with an octree where each of its leaf nodes approximates a planar surface. To infer such structure from a latent representation, Wang et al.  used a cascade of decoders, one per octree level. At each octree level, a decoder predicts the planar patch within each cell, and a predictor (composed of fully connected layers) predicts the patch approximation status for each octant, i.e., wether the cell is ”empty”, ”surface well approximated” with a plane, and ”surface poorly approximated”. Cells of poorly approximated surface patches are further subdivided and processed by the next level. This approach reduces the memory requirements from GB for volumetric grids of  to GB, and the computation time from s to s, while maintaining the same level of accuracy. Its main limitation is that adjacent patches are not seamlessly reconstructed. Also, since a plane is fitted to each octree cell, it does not approximate well curved surfaces.
Instead of partitionning the volumetric space in which the 3D shapes are embedded, an alternative approach is to consider the shape as an arrangement of geometric parts, reconstruct the individual parts independently from each other, and then stitch the parts together to form the complete 3D shape. There has been a few works which attempted this approach. For instance, Li et al.  only generate voxel representations at the part level. They proposed a Generative Recursive Autoencoder for Shape Structure (GRASS). The idea is to split the problem into two steps. The first step uses a Recursive Neural Nets (RvNN) encoder-decoder architecture coupled with a Generative Adversarial Network to learn how to best organize a shape structure into a symmetry hierarchy and how to synthesize the part arrangements. The second step learns, using another generative model, how to synthesize the geometry of each part, represented as a voxel grid of size . Thus, although the part generator network synthesizes the 3D geometry of parts at only resolution, the fact that individual parts are treated separately enables the reconstruction of 3D shapes at high resolution.
Zou et al.
reconstruct a 3D object as a collection of primitives using a generative recurrent neural network, called 3D-PRNN. The architecture transforms the input into a feature vector of size
via an encoder network. Then, a recurrent generator composed of stacks of Long Short-Term Memory (LSTM) and a Mixture Density Network (MDN) sequentially predicts from the feature vector the different parts of the shape. At each time step, the network predicts a set of primitives conditioned on both the feature vector and the previously estimated single primitive. The predicted parts are then combined together to form the reconstruction result. This approach predicts only an abstracted representation in the form of cuboids. Coupling it with volumetric-based reconstruction techniques, which would focus on individual cuboids, could lead to refined 3D reconstruction at the part level.
The space of all possible shapes can be parameterized using a set of orthogonal basis . Every shape can then be represented as a linear combination of the bases, i.e., . This formulation simplifies the reconstruction problem; instead of trying to learn how to reconstruct the volumetric grid , one can design a decoder composed of fully connected layers to estimate the coefficients for the latent representation, and then recover the complete 3D volume. Johnston et al.  used the Discrete Cosine Transform-II (DCT-II) to define B. They then proposed a convolutional encoder to predict the low frequency DCT-II coefficients . These coefficients are then converted by a DCT decoder, which replaces the decoding network, to a solid 3D volume. This had a profound impact on the computational cost of training and inference: using DCT coefficients, the network is able to reconstruct surfaces at volumetric grids of size .
The main issue when using generic bases such as DCT basis is that, in general, one requires a large number of basis to accurately represent 3D objects which have a complex geometry. In practice, we usually deal with objects of known categories, e.g., human faces and 3D human bodies, and usually, training data is available, see Section 8. As such, one can use Principal Component (PCA) basis, learned from the training data, to parameterize the space of shapes . This would require a significantly smaller number of bases (in the order of ) compared to the number of generic basis, which is in the order to thousands.
Another way to improve the resolution of volumetric techniques is by using multi-staged approaches [25, 41, 27, 42, 34]. The first stage recovers a low resolution voxel grid, say , using an encoder-decoder architecture. The subsequent stages, which function as upsampling networks, refine the reconstruction by focusing on local regions. Yang et al.  used an up-sampling module which simply consists of two up-convolutional layers. This simple up-sampling module upgrades the output 3D shape to a higher resolution of .
Wang et al.  treat the reconstructed coarse voxel grid as a sequence of images (or slices). The 3D object is then reconstructed slice by slice at high resolution. While this approach allows efficient refinement using 2D up-convolutions, the 3D shapes used for training should be consistently aligned. The volumes can then be sliced into 2D images along the first principal direction. Also, reconstructing individual slices independently from each other may result in discontinuities and incoherences in the final volume. Wang et al.  overcome this limitation by using a Long term Recurrent Convolutional Network (LRCN) . The LRCN takes five consecutive slices to produce a fixed-length vector representation as input to the LSTM. The output of the LSTM is passed through a 2D convolutional decoder to produce a high resolution image. A sequence of high-resolution 2D images forms the output 3D volume.
Instead of using volume slicing, other papers used additional CNN modules, which focus on regions that require refinement. For example, Dai et al.  firstly predict a coarse but complete shape volume of size and then refine it into a grid of size via an iterative volumetric patch synthesis process, which copy-pastes voxels from the k-nearest-neighbors retrieved from a database of 3D models. Han et al.  extended Dai et al.’s approach by introducing a local 3D CNN to perform patch-level surface refinement. Cao et al. , which recover in the first stage a volumetric grid of size , take volumetric blocks of size and predict whether they require further refinement. Blocks that require refinement are resampled into and fed into another encoder-decoder for refinement, along with the initial coarse prediction to guide the refinement. Both subnetworks adopt the U-net architecture  while substituting convolution and pooling layers with the corresponding operations from OctNet .
Note that these methods need separate and sometimes time-consuming steps before local inference. For example, Dai et al.  require nearest neighbor searches from a 3D database. Han et al.  require 3D boundary detection while Cao et al.  require assessing whether a block requires further refinement or not.
While volumetric representations can handle 3D shapes of arbitrary topologies, they require a post processing step, e.g., marching cubes , to retrieve the actual 3D surface mesh, which is the quantity of interest in 3D reconstruction. As such, the whole pipeline cannot be trained end-to-end. To overcome this limitation, Liao et al.  introduced the Deep Marching Cubes, an end-to-end trainable network, which predicts explicit surface representations of arbitrary topology. They use a modified differentiable representation, which separates the mesh topology from the geometry. The network is composed of an encoder and a two-branch decoder. Instead of predicting signed distance values, the first branch predicts the probability of occupancy for each voxel. The mesh topology is then implicitly (and probabilistically) defined by the state of the occupancy variables at its corners. The second branch of the decoder predicts a vertex location for every edge of each cell. The combination of both implicitly-defined topology and vertex location defines a distribution over meshes that is differentiable and can be used for back propagation. While the approach enables end-to-end training, it is limited to low resolution grids of size .
Instead of directly estimating high resolution volumetric grids, some methods produce multiview depth maps, which are fused into an output volume. The main advantage is that, in the decoding stage, one can use 2D convolutions, which are more efficient, in terms of computation and memory storage, than 3D convolutions. Their main limitation, however, is that depth maps only encode the external surface. To capture internal structures, Richter et al.  introduced Matryoshka Networks, which use
nested depth layers; the shape is recursively reconstructed by first fusing the depth maps in the first layer, then subtracting shapes in even layers, and adding shapes in odd layers. The method is able to reconstruct volumetric grids of size.
Volumetric representation-based methods are computationally very wasteful since information is rich only on or near the surfaces of 3D shapes. The main challenge when working directly with surfaces is that common representations such as meshes or point clouds are not regularly structured and thus, they do not easily fit into deep learning architectures, especially those using CNNs. This section reviews the techniques used to address this problem. We classify the state-of-the-art into three main categories: parameterization-based (Section 5.1), template deformation-based (Section 5.2), and point-based methods (Section 5.3).
|Geometry Images||Vertex defo.||Sphere / ellipse||FC layers|
|||Geometry Image||ResNet blocks + 2 Conv layers|
|||Mesh||vertex defo.||ellipse||GCNN blocks|
|||Mesh||vertex defo.||Learned (CNN)||FC layer|
Taxonomy of mesh decoders. GCNN: graph CNN. MLP: Multilayer Perceptron. Param.: parameterization.
Instead of working directly with triangular meshes, we can represent the surface of a 3D shape as a mapping where is a regular parameterization domain. The goal of the 3D reconstruction process is then to recover the shape function from an input I. When is a 3D domain then the methods in this class fall within the volumetric techniques described in Section 4. Here, we focus on the case where is a regular 2D domain, which can be a subset of the two dimensional plane, e.g., , or the unit sphere, i.e., . In the first case, one can implement encoder-decoder architectures using standard 2D convolution operations. In the latter case, one has to use spherical convolutions  since the domain is spherical.
Spherical parameterizations and geometry images [58, 59, 60] are the most commonly used parameterizations. They are, however, suitable only for genus-0 and disk-like surfaces. When dealing with surfaces of arbitrary topology, the surface needs to be cut into disk-like patches, and then unfolded into a regular 2D domain. Finding the optimal cut for a given surface, and more importantly, findings cuts that are consistent across shapes within the same category is challenging. In fact, naively creating independent geometry images for a shape category and feeding them into deep neural networks would fail to generate coherent 3D shape surfaces .
To create, for genus-0 surfaces, robust geometry images that are consistent across a shape category, the 3D objects within the category should be first put in correspondence [61, 62, 63]. Sinha et al.  proposed a cut-invariant procedure, which solves a large-scale correspondence problem, and an extension of deep residual nets to automatically generate geometry images encoding the surface coordinates. The approach uses three separate encoder-decoder networks, which learn, respectively, the and geometry images. The three networks are composed of standard convolutions, up-residual, and down-residual blocks. They take as input a depth image or a RGB image, and learn the 3D reconstruction by minimizing a shape-aware loss function.
Pumarola et al.  reconstruct the shape of a deformable surface using a network which has two branches: a detection branch and a depth estimation branch, which operate in parallel, and a third shape branch, which merges the detection mask and the depth map into a parameterized surface. Groueix et al.  decompose the surface of a 3D object into patches, each patch is defined as a mapping . They have then designed a decoder which is composed of branches. Each branch reconstructs the th patch by estimating the function . At the end, the reconstructed patches are merged together to form the entire surface. Although this approach can handle surfaces of high genus, it is still not general enough to handle surfaces of arbitrary genus. In fact, the optimal number of patches depends on the genus of the surface ( for genus-0, for genus-1, etc.). Also, the patches are not guaranteed to be connected, although in practice one can still post-process the result and fill in the gaps between disconnected patches.
In summary, parameterization methods are limited to low-genus surfaces. As such, they are suitable for the reconstruction of objects that belong to a given shape category, e.g., human faces and bodies.
Methods in this class take an input I and estimate a deformation field , which, when applied to a template 3D shape, results in the reconstructed 3D model . Existing techniques differ in the type of deformation models they use (Section 5.2.1), the way the template is defined (Section 5.2.2), and in the network architecture used to estimate the deformation field (Section 5.2.3). In what follows, we assume that a 3D shape is represented with vertices and faces . Let denote a template shape.
(1) Vertex deformation. This model assumes that a 3D shape can be written in terms of linear displacements of the individual vertices of a template, i.e., , where . The deformation field is defined as . This deformation model, illustrated in Fig. 2-(left), has been used by Kato et al.  and Kanazawa et al. . It assumes that (1) there is one-to-one correspondence between the vertices of the shape and those of the template , and (2) the shape has the same topology as the template .
(2) Morphable model. Alternatively, one can use learned morphable models to parameterize a 3D mesh. Let be the mean shape and be a set of orthonormal basis. Any shape can be written in the form:
The second term of Equation (1) can be seen as a deformation field, , applied to the vertices of the mean shape. By setting , Equation (1) can be written as . In this case, the mean is treated as a bias term.
One approach to learning a morphable model is by using Principal Component Analysis (PCA) on a collection of clean 3D mesh exemplars. Recent techniques showed that, with only 2D annotations, it is possible to build category-specific 3D morphable models from 2D silhouettes or 2D images [65, 66]. These methods require efficient detection and segmentation of the objects, and camera pose estimation, which can also be done using CNN-based techniques.
(3) Free-Form Deformation (FFD). Instead of directly deforming the vertices of the template , one can deform the space around it, see Fig. 2-(right). This can be done by defining around a set of control points, called deformation handles. When the deformation field , is applied to these control points, they deform the entire space around the shape and thus, they also deform the vertices of the shape according to the following equation:
This approach has been used by Kuryenkov et al. , Pontes et al. , and Jack et al. . The main advantage of free-form deformation is that it does not require one-to-one correspondence between the shapes and the template. However, the shapes that can be approximated by the FFD of the template are only those that have the same topology as the template.
Kato et al.  used a sphere as a template. Wang et al.  used an ellipse. Henderson et al.  defined two types of templates: a complex shape abstracted into cuboidal primitives, and a cube subdivided into multiple vertices. While the former is suitable for man-made shapes that have multiple components, the latter is suitable for representing genus-0 shapes and does not offer advantage compared to using a sphere or an ellipsoid.
To speed up the convergence, Kuryenkov et al.  introduced DeformNet, which takes an image as input, searches the nearest shape from a database, and then deforms, using the FFD model of Equation (2), the retrieved model to match the query image. This method allows detail-preserving 3D reconstruction.
Pontes et al.  used an approach that is similar to DeformNet . However, once the FFD field is estimated and applied to the template, the result is further refined by adding a residual defined as a weighted sum of some 3D models retrieved from a dictionary. The role of the deep neural network is to learn how to estimate the deformation field and the weights used in computing the refinement residual. Jack et al. , on the other hand, deform, using FFD, multiple templates and select the one that provides the best fitting accuracy.
Another approach is to learn the template, either separately using statistical shape analysis techniques, e.g., PCA, on a set of training data, or jointly with the deformation field using deep learning techniques. For instance, Tulsiani et al.  use the mean shape of each category of 3D models as a class-specific template. The deep neural network estimates both the class of the input shape, which is used to select the class-specific mean shape, and the deformation field that needs to be applied to the class-specific mean shape. Kanazawa et al.  learn, at the same time, the mean shape and the deformation field. Thus, the approach does not require a separate 3D training set to learn the morphable model. In both cases, the reconstruction results lack details and are limited to popular categories such as cars and birds.
Deformation-based methods also use encoder-decoder architectures. The encoder maps the input into a latent variable x using successive convolutional operations. The latent space can be discrete or continuous as in , which used a variational auto-encoder (see Section 3). The decoder is, in general, composed of fully-connected layers. Kato et al. , for example, used two fully connected layers to estimate the deformation field to apply to a sphere to match the input’s silhouette.
Instead of deforming a sphere or an ellipse, Kuryenkov et al.  retrieve from a database the 3D model that is most similar to the input I and then estimate the FFD needed to deform it to match the input. The retrieved template is first voxelized and encoded, using a 3D CNN, into another latent variable . The latent representation of the input image and the latent representation of the retrieved template are then concatenated and decoded, using an up-convolutional network, into an FFD field defined on the vertices of a voxel grid.
Pontes et al.  used a similar approach, but the latent variable x is used as input into a classifier which finds, from a database, the closest model to the input. At the same time, the latent variable is decoded, using a feed-forward network, into a deformation field and weights . The retrieved template is then deformed using and a weighted combination of a dictionary of CAD models, using the weights .
Note that, one can design several variants to these approaches. For instance, instead of using a 3D model retrieved from a database as a template, one can use a class-specific mean shape. In this case, the latent variable x can be used to classify the input into one of the shape categories, and then pick the learned mean shape of this category as a template . Also, instead of learning separately the mean shape, e.g., using morphable models, Kanazawa et al.  treated the mean shape as a bias term, which can then be predicted by the network, along with the deformation field . Finally, Wang et al.  adopted a coarse to fine strategy, which makes the procedure more stable. They proposed a deformation network composed of three deformation blocks, each block is a graph-based CNN (GCNN), intersected by two graph unpooling layers. The deformation blocks update the location of the vertices while the graph unpoolling layers increase the number of vertices.
Parameterization and deformation-based techniques can only reconstruct surfaces of fixed topology. The former is limited to surfaces of low genus while the latter is limited to the topology of the template.
A 3D shape can be represented using an unordered set of points. Such point-based representation is simple but efficient in terms of memory requirements. It is well suited for objects with intriguing parts and fine details. As such, an increasing number of papers, at least one in , more than in [69, 70, 21, 22, 71, 72, 21, 73, 74, 75, 76, 77, 78], and a few others in 2019 , explored their usage for deep learning-based reconstruction. This section discusses the state-of-the-art point-based representations and their corresponding network architectures.
The main challenge with point clouds is that they are not regular structures and do not easily fit into the convolutional architectures that exploit the spatial regularity. Three representations have been proposed to overcome this limitation:
The last two representations, hereinafter referred to as grid representations, are well suited for convolutional networks. They are also computationally efficient as they can be inferred using only 2D convolutions. Note that depth map-based methods require an additional fusion step to infer the entire 3D shape of an object. This can be done in a straightforward manner if the camera parameters are known. Otherwise, the fusion can be done using point cloud registration techniques [80, 81] or fusion networks . Also, point set representations require fixing in advance the number of points while in methods that use grid representations, the number of points can vary based on the nature of the object but it is always bounded by the grid resolution.
Similar to volumetric and surface-based representations, techniques that use point-based representations follow the encoder-decoder model. While they all use the same architecture for the encoder, they differ in the type and architecture of their decoder, see Fig. 3. In general, grid representations use up-convolutional networks to decode the latent variable [68, 78, 69, 74], see Fig. 3-(a) and (b). Point set representations (Fig. 3-(c)) use fully connected layers [68, 70, 21, 73, 77] since point clouds are unordered. The main advantage of fully-connected layers is that they capture the global information. However, compared to convolutional operations, they are computationally expensive. To benefit from the efficiency of convolutional operations, Gadelha et al.  order, spatially, the point cloud using a space-partionning tree such as KD-tree and then process them using 1D convolutional operations, see Fig. 3-(d). With a conventional CNN, each convolutional operation has a restricted receptive field and is not able to leverage both global and local information effectively. Gadelha et al.  resolve this issue by maintaining three different resolutions. That is, the latent variable is decoded into three different resolutions, which are then concatenated and further processed with 1D convolutional layers to generate a point cloud of size K.
Fan et al.  proposed a generative deep network that combines both the point set representation and the grid representation (Fig. 3-(a)). The network is composed of a cascade of encoder-decoder blocks:
The first block takes the input image and maps it into a latent representation, which is then decoded into a 3-channel image of size . The three values at each pixel are the coordinates of a point.
Each of the subsequent blocks takes the output of its previous block and further encodes and decodes it into a 3-channel image of size .
The last block is an encoder, of the same type as the previous ones, followed by a predictor composed of two branches. The first branch is a decoder which predicts a 3-channel image of size ( in this case), of which the three values at each pixel are the coordinates of a point. The second branch is a fully-connected network, which predicts a matrix of size , each row is a 3D point ().
The predictions of the two branches are merged using set union to produce a 3D point set of size .
Tatarchenki et al. , Wang et al. , and Lin et al.  followed the same idea but their decoder regresses grids, see Fig. 3-(a). Each grid encodes the depth map  or the coordinates [78, 69] of the visible surface from that view point. The viewpoint, encoded with a sequence of fully connected layers, is provided as input to the decoder along with the latent representation of the input image. Li et al. , on the other hand, used a multi-branch decoder, one for each viewpoint, see Fig. 3-(b). Unlike , each branch regresses a canonical depth map from a given view point and a deformation field, which deforms the estimated canonical depth map to match the input, using Grid Deformation Units (GDUs). The reconstructed grids are then lifted to 3D and merged together.
Similar to volumetric techniques, the vanilla architecture for point-based 3D reconstruction only recovers low resolution geometry. For high-resolution reconstruction, Mandikal et al. , see Fig. 3-(c), use a cascade of multiple networks. The first network predicts a low resolution point cloud. Each subsequent block takes the previously predicted point cloud, computes global features, using a multi-layer perceptron architecture (MLP) similar to PointNet  or Pointnet++ , and local features by applying MLPs in balls around each point. Local and global features are then aggregated and fed to another MLP, which predicts a dense point cloud. The process can be repeated recursively until the desired resolution is reached.
Mandikal et al.  combines TL-embedding with a variational auto-encoder (Fig. 3-(c)). The former allows mapping a 3D a point cloud and its corresponding views onto the same location in the latent space. The latter enables the reconstruction of multiple plausible point clouds from the input image(s).
Finally, point-based representations can handle 3D shapes of arbitrary topologies. However, they require a post processing step, e.g., Poisson surface reconstruction  or SSD , to retrieve the 3D surface mesh, which is the quantity of interest. The pipeline, from the input until the final mesh is obtained, cannot be trained end-to-end. Thus, these methods only optimise an auxiliary loss defined on an intermediate representation.
The previous sections discussed methods that directly reconstruct 3D objects from their 2D observations. This section shows how additional cues such as intermediate representations (Section 6.1) and temporal correlations (Section 6.2), can be used to boost 3D reconstruction.
Many of the deep learning-based 3D reconstruction algorithms predict the 3D geometry of an object from RGB images directly. Some techniques, however, decompose the problem into sequential steps, which estimate D information such as depth maps, normal maps, and/or segmentation masks, see Fig. 4. The last step, which can be implemented using traditional techniques such as space carving or 3D back-projection followed by filtering and registration, recovers the full 3D geometry and the pose of the input.
While early methods train separately the different modules, recent works proposed end-to-end solutions [7, 49, 10, 87, 37, 76, 88]. For instance, Wu et al.  and later Sun et al.  used two blocks. The first block is an encoder followed by a three-branch decoder, which estimate the depth map, the normal map, and the segmentation mask (called D sketches). These are then concatenated and fed into another encoder-decoder, which regresses a full 3D volumetric grid [7, 10, 87], and a set of fully-connected layers, which regress the camera pose . The entire network is trained end-to-end.
Other techniques convert the intermediate depth map into (1) a 3D occupancy grid  or a truncated signed distance function volume , which is then processed using a 3D encoder-decoder network for completion and refinement, or (2) a partial point cloud, which is further processed using a point-cloud completion module . Zhang et al.  convert the inferred depth map into a spherical map and unpaint it, to fill in holes, using another encoder-decoder. The unpainted spherical depth map is then back-projected to 3D and refined using a voxel refinement network, which estimates a voxel occupancy grid of size .
Other techniques estimate multiple depth maps from pre-defined or arbitrary viewpoints. Tatarchenko et al.  proposed a network, which takes as input an RGB image and a target viewpoint , and infers the depth map of the object as seen from the viewpoint . By varying the viewpoint, the network is able to estimate multiple depths, which can then be merged into a complete 3D model. The approach uses a standard encoder-decoder and an additional network composed of three fully-connected layers to encode the viewpoint. Soltani et al.  and Lin et al.  followed the same approach but predicts the depth maps, along with their binary masks, from pre-defined view points. In both methods, the merging is performed in a post-processing step. Smith et al.  first estimate a low resolution voxel grid. It then takes the depth maps computed from the six axis-aligned views and refines them using a silhouette and depth refinement network. The refined depth maps are finally combined into a volumetric grid of size using space carving techniques.
Tatarchenko et al. , Lin et al. , and Sun et al.  also estimate the binary/silhouette masks, along with the depth maps. The binary masks have been used to filter out points that are not back-projected to the surface in 3D space. The side effect of these depth mask-based approaches is that it is a huge computation waste as a large number of points are discarded, especially for objects with thin structures. Li et al.  overcome this problem by deforming a regular depth map using a learned deformation field. Instead of directly inferring depth maps that best fit the input, Li et al.  infer a set of 2D pre-deformation depth maps and their corresponding deformation fields at pre-defined canonical viewpoints. These are each passed to a Grid Deformation Unit (GDU) that transforms the regular grid of the depth map to a deformed depth map. Finally, the deformed depth maps are transformed into a common coordinate frame for fusion into a dense point cloud.
The main advantage of multi-staged approaches is that depth, normal, and silhouette maps are much easier to recover from 2D images. Likewise, 3D models are much easier to recover from these three modalities than from 2D images alone.
There are many situations where multiple spatially distributed images of the same object(s) are acquired over an extended period of time. Single image-based reconstruction techniques can be used to reconstruct the 3D shapes by processing individual frames independently from each other, and then merging the reconstruction using registration techniques. Ideally, we would like to leverage on the spatio-temporal correlations that exist between the frames to resolve ambiguities especially in the presence of occlusions and highly cluttered scenes. In particular, the network at time should remember what has been reconstructed up to time , and use it, in addition to the new input, to reconstruct the scene or objects at time . This problem of processing sequential data has been addressed by using Recurrent Neural Networks (RNN) and Long-Short Term Memory (LSTM) networks, which enable networks to remember their inputs over a period of time.
Choy et al.  proposed an architecture called 3D Recurrent Reconstruction Network (3D-R2N2), which allows the network to adaptively and consistently learn a suitable 3D representation of an object as (potentially conflicting) information from different viewpoints becomes available. The network can perform incremental refinement every time a new view becomes available. It is composed of two parts; a standard convolution encoder-decoder and a set of 3D Convolutional Long-Short Term Memory (3D-LSTM) units placed at the start of the convolutional decoder. These take the output of the encoder, and then either selectively update their cell states or retain the states by closing the input gate. The decoder then decodes the hidden states of the LSTM units and generates a probabilistic reconstruction in the form of a voxel occupancy map.
The 3D-LSTM allows the network to retain what it has seen and update its memory when it sees a new image. It is able to effectively handle object self-occlusions when multiple views are fed to the network. At each time step, it selectively updates the memory cells that correspond to parts that became visible while retaining the states of the other parts.
LSTM and RNNs are time consuming since the input images are processed sequentially without parallelization. Also, when given the same set of images with different orders, RNNs are unable to estimate the 3D shape of an object consistently due to permutation variance. To overcome these limitations, Xieet al.  introduced Pix2Vox, which is composed of multiple encoder-decoder blocks, running in parallel, each one predicts a coarse volumetric grid from its input frame. This eliminates the effect of the order of input images and accelerates the computation. Then, a context-aware fusion module selects high-quality reconstructions from the coarse 3D volumes and generates a fused 3D volume, which fully exploits information of all input images without long-term memory loss.
In addition to their architectures, the performance of deep learning networks depends on the way they are trained. This section discusses the various supervisory modes (Section 7.1) and training procedures that have been used in the literature (Section 7.3).
Early methods rely on 3D supervision (Section 7.1.1). However, obtaining ground-truth 3D data, either manually or using traditional 3D reconstruction techniques, is extremely difficult and expensive. As such, recent techniques try to minimize the amount of 3D supervision by exploiting other supervisory signals such consistency across views (Section 7.1.2).
Supervised methods require training images paired with their corresponding ground-truth 3D shapes. The training process then minimizes a loss function that measures the discrepancy between the reconstructed 3D shape and the corresponding ground-truth 3D model. The discrepancy is measured using loss functions, which are required to be differentiable so that gradients can be computed. Examples of such functions include:
(1) Volumetric loss. It is defined as the distance between the reconstructed and ground-truth volumes;
Here, can be the distance between the two volumes or the negative Intersection over Union (see Equation (16)). Both metrics are suitable for binary occupancy grids and TSDF representations. For probabilistic occupancy grids, the cross-entropy loss is the most commonly used :
Here, is the ground-truth probability of voxel being occupied, is the estimated probability, and is the number of voxels.
(2) Point set loss. When using point-based representations, the reconstruction loss can be measured using the Earth Mover’s distance (EMD) [55, 68] or the Chamfer Distance (CD) [55, 68]. The EMD is defined as the minimum of the sum of distances between a point in one set and a point in another set over all possible permutations of the correspondences. More formally, given two sets of points and , the EMD is defined as:
Here, is the closest point on to . The CD loss, on the other hand, is defined as:
and are, respectively, the size of and . The CD is computationally easier than EMD since it uses sub-optimal matching to determine the pairwise relations.
(3) Learning to generate multiple plausible reconstructions. 3D reconstruction from a single image is an ill-posed problem, thus for a given input there might be multiple plausible reconstructions. Fan et al.  proposed the Min-of-N (MoN) loss to train neural networks to generate distributional output. The idea is to use a random vector drawn from a certain distribution to perturb the input. The network learns to generate a plausible 3D shape from each perturbation of the input. It is trained using a loss defined as follows;
Here, is the reconstructed 3D point cloud after perturbing the input with the random vector
sampled from the multivariate normal distribution, is the ground-truth point cloud, and is a reconstruction loss, which can be any of the loss functions defined above. At run time, various plausible reconstructions can be generated from a given input by sampling different random vectors from .
Obtaining 3D ground-truth data for supervision is an expensive and tedious process even for a small scale training. However, obtaining multiview D or D images for training is relatively easy. Methods in the category use the fact that if the estimated 3D shape is as close as possible to the ground truth then the discrepancy between views of the 3D model and the projection of the reconstructed 3D model onto any of these views is also minimized. Implementing this idea requires defining a projection operator, which renders the reconstructed 3D model from a given viewpoint (Section 7.1.2), and a loss function that measures the reprojection error (Section 7.1.2).
Techniques from projective geometry can be used to render views of a 3D object. However, to enable end-to-end training without gradient approximation , the projection operator should be differentiable. Gadelha et al.  introduced a differentiable projection operator defined as , where is the 3D voxel grid. This operator sums up the voxel occupancy values along each line of sight. However, it assumes an orthographic projection. Loper and Black  introduced OpenDR, an approximate differentiable renderer, which is suitable for orthographic and perspective projections.
Petersen et al.  introduced a novel smooth differentiable renderer for image-to-geometry reconstruction. The idea is that instead of taking a discrete decision of which triangle is the visible from a pixel, the approach softly blends their visibility. Taking the weighted SoftMin of the -positions in the camera space constitutes a smooth z-buffer, which leads to a smooth renderer, where the -positions of triangles are differentiable with respect to occlusions. In previous renderers, only the -coordinates were locally differentiable with respect to occlusions.
Finally, instead of using fixed renderers, Rezende et al.  proposed a learned projection operator, or a learnable camera, which is built by first applying an affine transformation to the reconstructed volume, followed by a combination of 3D and 2D convolutional layers, which map the 3D volume onto a 2D image.
There are several loss functions that have been proposed for 3D reconstruction using 2D supervision. We classify them into two main categories; (1) silhouette-based and (2) normal and depth-based loss functions.
(1) Silhouette-based loss functions. The idea is that a 2D silhouette projected from the reconstructed volume, under certain camera intrinsic and extrinsic parameters, should match the ground truth 2D silhouette of the input image. The discrepancy, which is inspired by space carving, is then:
where is the th ground truth 2D silhouette of the original 3D object , is the number of silhouettes or views used for each 3D model, is a 3D to 2D projection function, and are the camera parameters of the -th silhouette. The distance metric can be the standard metric , the negative Intersection over Union (IoU) between the true and reconstructed silhouettes , or the binary cross-entropy loss [23, 5].
Kundu et al.  introduced the render-and-compare loss, which is defined in terms of the IoU between the ground-truth silhouette and the rendered silhouette , and the distance between the ground-truth depth and the rendered depth , i.e.,
Here, and are binary ignore masks that have value of one at pixels which do not contribute to the loss. Since this loss is not differentiable, Kundu et al.  used finite difference to approximate its gradients.
Silhouette-based loss functions cannot distinguish between some views, e.g., front and back. To alleviate this issue, Insafutdinov and Dosovitskiy  use multiple pose regressors during training, each one using silhouette loss. The overall network is trained with the min of the individual losses. The predictor with minimum loss is used at test time.
Gwak et al.  minimize the reprojection error subject to the reconstructed shape being a valid member of a certain class, e.g., chairs. To constrain the reconstruction to remain in the manifold of the shape class, the approach defines a barrier function , which is set to be if the shape is in the manifold and otherwise. The loss function is then:
The barrier function is learned as the discriminative function of a GAN, see Section 7.3.2.
Finally, Tulsiani et al.  define the re-projection loss using a differentiable ray consistency loss for volumetric reconstruction. First, it assumes that the estimated shape is defined in terms of the probability occupancy grid. Let be an observation-camera pair. Let also be a set of rays where each ray has the camera center as origin and is casted through the image plane of the camera . The ray consistency loss is then defined as:
where captures if the inferred 3D model correctly explains the observations associated with the specific ray . If the observation is a ground-truth foreground mask taking values at foreground pixels and elsewhere, then is the probability that the ray hits a surface voxel weighted by the mask value at the pixel associated with the ray . This loss is differentiable with respect to the network predictions. Note that when using foreground masks as observations, this loss, which requires known camera parameters, is similar to the approaches designed to specifically use mask supervision where a learned  or a fixed  reprojection function is used. Also, the binary cross-entropy loss used in [5, 23, 5] can be thought of as an approximation of the one derived using ray consistency.
(2) Surface normal and depth-based loss. Additional cues such as surface normals and depth values can be used to guide the training process. Let be a normal vector to a surface at a point . The vectors and are orthogonal to . By normalizing them, we obtain two vectors and . The normal loss tries to guarantee that the voxels at and should be to match the estimated surface normals. This constraint only applies when the target voxels are inside the estimated silhouette. The projected surface normal loss is then:
This loss has been used by Wu et al. , which, in addition to the normal loss, also includes the projected depth loss. The idea is that the voxel with depth should be , and all voxels in front of it should be . The depth loss is then defined as:
This ensures the estimated 3D shape matches the estimated depth values.
(3) Combining multiple losses. One can also combine 2D loss with 3D loss. This is particularly useful when some ground-truth 3D data is available. One can for example train first the network using 3D supervision, and then fine-tune it using 2D supervision. Yan et al. , on the other hand, take the weighted sum of a 2D and a 3D loss.
In addition to the reconstruction loss, one can impose additional constraints to the solution. For instance, Kato et al.  used a weighted sum of silhouette loss, defined as the negative intersection over union (IoU) between the true and reconstructed silhouettes, and a smoothness loss. For surfaces, the smoothness loss ensures that the angles between adjacent faces is close to , encouraging flatness.
Reprojection-based loss functions use the camera parameters to render the estimated 3D shape onto image planes. Some methods assume the availability of one or multiple observation-camera pairs [9, 11, 5]. Here, the observation can be an RGB image, a silhouette/foreground mask or a depth map of the target 3D shape. Other methods optimize at the same time for the camera parameters and the 3D reconstruction that best describe the input [26, 73].
Gadelha et al.  encode an input image into a latent representation and a pose code using fully-connected layers. The pose code is then used as input to the 2D projection module, which renders the estimated 3D volume onto the view of the input. Insafutdinov and Dosovitskiy , on the other hand, take two views of the same object, and predict the corresponding shape (represented as a point cloud) from the first view, and the camera pose (represented as a quaternion) from the second one. The approach then uses a differentiable projection module to generate the view of the predicted shape from the predicted camera pose. The shape and pose predictor is implemented as a convolutional network with two branches. The network starts with a convolutional encoder with a total of layers followed by shared fully connected layers, after which the network splits into two branches for shape and pose prediction. The pose branch is implemented as a multi-layer perceptron.
There has been a few papers that only estimate the camera pose [94, 95, 66]. Unlike techniques that do simultaneously reconstruction, these approaches are trained with pose annotations only. For instance, Kendall et al.  introduced PoseNet, a convolutional neural network which estimates the camera pose from a single image. The network, which represents the camera pose using its location vector and orientation quaternion, is trained to minimize the loss between the ground-truth and the estimate pose. Su et al. , on the other hand, found that CNNs trained for viewpoint estimation of one class do not perform well on another class, possibly due to the huge geometric variation between the classes. As such, they proposed a network architecture where the lower layers (both convolutional layers and fully connected layers) are shared by all classes, while class-dependent fully-connected layers are stacked over them.
Another approach to significantly lower the level of supervision required to learn the 3D geometry of objects is by replacing 3D supervision with motion. To this end, Novotni et al.  used Structure-from Motion (SfM) to generate a supervisory signal from videos. That is, at training, the approach takes a video sequences, generates a partial point cloud and the relative camera parameters using SfM . Each RGB frame is then processed with a network that estimates a depth map, an uncertainty map, and the camera parameters. The different depth estimates are fused, using the estimated camera parameters, into a partial point cloud, which is further processed for completion using the point cloud completion network PointNet . The network is trained using the estimates of the SfM as supervisory signals. That is, the loss functions measure the discrepancy between the depth maps estimated by the network and the depth maps estimated by SfM, and between the camera parameters estimated by the network and those estimated by SfM. At test time, the network is able to recover a full 3D geometry from a single RGB image.
In addition to the datasets, loss functions, and degree of supervision, there are a few practical aspects that one needs to consider when training deep learning architectures for 3D reconstruction.
|(a) Joint 2D-3D embedding.||(b) TL-network.|
|(c) 3D-VAE-GAN architecture.|
Most of the state-of-the-art works map the input (e.g., RGB images) into a latent representation, and then decode the latent representation into a 3D model. A good latent representation should be (1) generative in 3D, i.e., we should be able to reconstruct objects in 3D from it, and (2) it must be predictable from 2D, i.e., we should be able to easily infer this representation from images . Achieving these two goals has been addressed by using TL-embedding networks during the training phase, see Fig. 5-(a) and (b). It is composed of two jointly trained encoding branches: the 2D encoder and the 3D encoder. They map, respectively, a 2D image and its corresponding 3D annotation into the same point in the latent space [24, 23].
Gidhar et al. , which use the TL-embedding network to reconstruct volumetric shapes from RGB images, train the network using batches of (image, voxel) pairs. The images are generated by rendering the 3D model and the network is then trained in a three stage procedure.
In the first stage, the 3D encoder part of the network and its decoder are initialized at random. They are then trained, end-to-end with the sigmoid cross-entropy loss, independently of the 2D encoder.
In the second stage, the 2D encoder is trained to regress the latent representation. The encoder generates the embedding for the voxel and the image network is trained to regress the embedding.
The final stage jointly fine-tunes the entire network.
In general, a good reconstruction model should be able to go beyond what has been seen during training. Networks trained with standard procedures may not generalize well to unseen data. Also, Yang et al.  noted that the results of standard techniques tend to be grainy and lack fine details. To overcome these issues, several recent papers train their networks with adversarial loss by using Generative Adversarial Networks (GAN), which generate a signal from a given random vector . Conditional GANs, on the other hand, conditions the generated signal on the input image(s), see Fig. 5-(c). It consists of a generator , which mirrors the encoder , and a discriminator , which mirrors the generator.
In the case of 3D reconstruction, the encoder can be a ConvNet/ResNet [99, 42] or a variational auto-encoder (VAE) . The generator decodes the latent vector x into a 3D shape . The discriminator, which is only used during training, evaluates the authenticity of the decoded data. It outputs a confidence between and of whether the 3D object is real or synthetic, i.e., coming from the generator. The goal is to jointly train the generator and the discriminator to make the reconstructed shape as close as possible to the ground truth.
Central to GAN is the adversarial loss function used to jointly train the discriminator and the generator. Following Goodfellow et al. , Wu et al.  use the binary cross entropy as the classification loss. The overall adversarial loss function is defined as:
Here where I is the 2D images(s) of the training shape . Yang et al. [99, 42] observed that the original GAN loss function presents an overall loss for both real and fake input. They then proposed to use the WGAN-GP loss [100, 101], which separately represents the loss for generating fake reconstruction pairs and the loss for discriminating fake and real construction pairs, see [100, 101] for the details.
To jointly train the three components of the network, i.e., the encoder, the generator, and the discriminator, the overall loss is defined as the sum of the reconstruction loss, see Section 7.1, and the GAN loss. When the network uses a variational auto-encoder, e.g., the 3D VAE-GAN , then an additional term is added to the overall loss in order to push the variational distribution towards the prior distribution. For example, Wu et al.  used a KL-divergence metric, and a multivariate Gaussian distribution with zero-mean and unit variance as a prior distribution.
GANs have been used for volumetric [17, 38, 99, 42, 14, 29] and point cloud [70, 71] reconstruction. They have been used with 3D supervision [17, 38, 99, 42, 29] and with 2D supervision as in [26, 93, 14] where the reconstruction error is measured using the reprojection loss, see Section 7.1.2. Their potential is huge, because they can learn to mimic any distribution of data. They are also very suitable for single-view 3D shape reconstruction, which is challenging, because among the many possible shapes that explain an observation, most are implausible and do not correspond to natural objects . Also, among plausible shapes, there are still multiple shapes that fit the 2D image equally well. To address this ambiguity, Wu et al.  used the discriminator of the GAN to penalize the 3D estimator if the predicted 3D shape is unnatural.
GANs are hard to train, especially for the complex joint data distribution over 3D objects of many categories and orientations. They also become unstable for high-resolution shapes. In fact, one must carefully balance the learning of the generator and the discriminator, otherwise the gradients can vanish, which will prevent improvement . To address this issue, Smith and Meger  and later Wu et al.  used as a training objective the Wasserstein distance normalized with the gradient penalization.
Jointly training for reconstruction and segmentation leads to improved performance in both tasks, when compared to training for each task individually. Mandikal et al.  proposed an approach, which generates a part-segmented 3D point cloud from one RGB image. The idea is to enable propagating information between the two tasks so as to generate more faithful part reconstructions while also improving segmentation accuracy. This is done using a loss defined as a weighted sum of a reconstruction loss, defined using the Chamfer distance, and a segmentation loss. The latter is defined using the symmetric softmax cross-entropy loss.
Image-based 3D reconstruction is an important problem and a building block to many applications ranging from robotics and autonomous navigation to graphics and entertainment. While some of these applications deal with generic objects, many of them deal with objects that belong to specific classes such as human bodies or body parts (e.g., faces and hands), animals in the wild, and cars. While the techniques described above can be applied to these specific classes of shapes, we advocate that the quality of the reconstruction can be significantly improved by designing customised methods that leverage the prior knowledge of the shape class. In this section, we will briefly summarize recent developments in the image-based 3D reconstruction of human body shapes (Section 8.1), and body parts such as faces (Section 8.2)
3D static and dynamic digital humans are essential for a variety of applications ranging from gaming, visual effects to free-viewpoint videos. However, high-end capture solutions use a large number of cameras and active sensors, and are restricted to professional as they operate under controlled lighting conditions and studio settings. With the avenue of deep learning techniques, several papers have explored more lightweight solutions that are able to recover 3D human shape and pose from a few RGB images. We can distinguish two classes of methods; (1) volumetric methods (Section 4), and (2) template or parameteric-based methods (Section 5.2).
Parametric methods regularize the problem using statistical models. The problem of human body shape reconstruction then boils down to estimating the parameters of the model. Popular models include morphable models , SCAPE , and SMPL .
Dibra et al.  used an encoder followed by three fully connected layers which regress the SCAPE parameters from one or multiple silhouette images. Later, Dibra et al.  first learn a common embedding of 2D silhouettes and 3D human body shapes (see Section 7.3.1). The latter are represented using their Heat Kernel Signatures . Both methods can only predict naked body shapes in nearly neutral poses.
Early methods focus on 3D human body shapes in static pose. Bogo et al. proposed SMPLify, the first 3D human pose and shape reconstruction from one image. They first used a CNN-based architecture, DeepCut , to estimate the 2D joint locations. They then fit the 3D generative model SMPL to the predicted 2D joints giving the estimation of 3D human body pose and shape. The training minimizes an objective function of five terms: joint-based data term, three pose priors, and a shape prior. Experimental results show that this method is effective in 3D human body reconstruction from arbitrary poses.
Kanazawa et al. , on the other hand, argue that such a stepwise approach is not optimal and propose an end-to-end solution to learn a direct mapping from image pixels to model parameters. This approach addresses two important challenges: (1) the lack of large scale ground truth 3D annotations for in-the-wild images, and (2) the inherent ambiguities in single-view 2D-to-3D mapping of human body shapes. An example is depth ambiguity where multiple 3D body configurations explain the same 2D projections . To address the first challenge, Kanazawa et al. observe that there are large-scale 2D keypoint annotations of in-the-wild images and a separate large-scale dataset of 3D meshes of people with various poses and shapes. They then take advantage of these unpaired 2D keypoint annotations and 3D scans in a conditional generative adversarial manner. They propose a network that infers the SMPL  parameters of a 3D mesh and the camera parameters such that the 3D keypoints match the annotated 2D keypoints after projection. To deal with ambiguities, these parameters are sent to a discriminator whose task is to determine if the 3D parameters correspond to bodies of real humans or not. Hence, the network is encouraged to output parameters on the human manifold. The discriminator acts as a weak supervisor.
This approach can handle complex poses from images with complex backgrounds, but is limited to a single person per image and does not handle clothes. These are better handled using volumetric techniques, which, in general, do not incorporate class-specific knowledge. An example is the work of Huang et al. , which takes multiple RGB views and their corresponding camera calibration parameters as input, and predicts a dense 3D field that encodes for each voxel its probability of being inside or outside the human body shape. The surface geometry can then be faithfully reconstructed from the 3D probability field using marching cubes. The approach uses a multi-branch encoder, one for each image, followed by a multi-layer perceptron which aggregates the features that correspond to the same 3D point into a probability value. The approach is able to recover detailed geometry even on human bodies with cloth but it is limited to simple backgrounds.
Detailed and dense image-based 3D reconstruction of the human face, which aims to recover shape, pose, expression, skin reflectance, and finer scale surface details, is a longstanding problem in computer vision and computer graphics. Recently, this problem has been formulated as a regression problem and solved using convolutional neural networks.
In this section, we review some of the representative papers. Most of the recent techniques use parametric representations, which parametrize the manifold of 3D faces. The most commonly used representation is the 3D morphable model (3DMM) of Blanz and Vetter , which is an extension of the 2D active appearance model  (see also Section 5.2.1). The model captures facial variabily in terms of geometry and texture. Gerig et al.  extended the model by including expressions as a separate space. Below, we discuss the various network architectures (Section 8.2.1) and their training procedures (Section 8.2.2). We will also discuss some of the model-free techniques (Section 8.2.3).
The backbone architecture is an encoder, which maps the input image into the parametric model parameters. It is composed of convolutional layers followed by fully connected layers. In general, existing techniques use generic networks such as AlexNet, or networks specifically trained on facial images such as VGG-Face or FaceNet . Tran et al.  use this architecture to regress the parameters of a 3DMM that encodes facial identity (geometry) and texture. It has been trained with 3D supervision using asymetric loss, i.e., a loss function that favours 3D reconstructions that are far from the mean.
Richardson et al.  used a similar architecture but perform the reconstruction iteratively. At each iteration, the network takes the previously reconstructed face, but projected onto an image using a frontal camera, with the input image, and regresses the parameters of a 3DMM. The reconstruction is initialized with the average face. Results show that, with three iterations, the approach can successfully handle face reconstruction from images with various expressions and illumination conditions.
One of the main issues with 3DMM-based approaches is that they tend to reconstruct smooth facial surfaces, which lack fine details such as wrinkles and dimples. As such, methods in this category use a refinement module to recover the fine details. For instance, Richardson et al.  refine the reconstructed face using Shape from Shading (SfS) techniques. Richardson et al. , on the other hand, add a second refinement block, FineNet, which takes as input the depth map of the coarse estimation and recovers using an encoder-decoder network a high resolution facial depth map. To enable end-to-end training, the two blocks are connected with a differentiable rendering layer. Unlike traditional SfS, the introduction of FineNet treats the calculation of albedo and lighting coefficients as part of the loss function without explicitly estimating these information. However, lighting is modeled by first-order spherical harmonics, which lead to inaccurate detail reconstruction.
One of the main challenges is in how to collect enough training images labelled with their corresponding 3D faces, to feed the network. Richardson et al. [118, 119] generate synthetic training data by drawing random samples from the morphable model and rendering the resulting faces. However, a network trained on purely synthetic data may perform poorly when faced with occlusions, unusual lighting, or ethnicities that are not well represented. Genova et al.  address the lack of training data by including randomly generated synthetic faces in each training batch to provide ground truth 3D coordinates, but train the network on real photographs at the same time. Tran et al.  use an iterative optimization to fit an expressionless model to a large number of photographs, and treat the results where the optimization converged as ground truth. To generalize to faces with expressions, identity labels and at least one neutral image are required. Thus, the potential size of the training dataset is restricted.
Tewari et al.  train, without 3D supervision, an encoder-decoder network to simultaneously predict facial shape, expression, texture, pose, and lighting. The encoder is a regression network from images to morphable model coordinates, and the decoder is a fixed, differentiable rendering layer that attempts to reproduce the input photograph. The loss measures the discrepancy between the reproduced photograph and the input one. Since the training loss is based on individual image pixels, the network is vulnerable to confounding variation between related variables. For example, it cannot readily distinguish between dark skin tone and a dim lighting environment.
To remove the need for supervised training data and the reliance on inverse rendering, Genova et al. 
propose a framework that learns to minimize a loss based on the facial identity features produced by a face recognition network such as VGG-Face or Google’s FaceNet . In other words, the face recognition network encodes the input photograph as well as the image rendered from the reconstructed face into feature vectors that are robust to pose, expression, lighting, and even non-photorealistic inputs. The method then applies a loss that measures the discrepancy between these two feature vectors instead of using pixel-wise distance between the rendered image and the input photograph. The 3D facial shape and texture regressor network is trained using only a face recognition network, a morphable face model, and a dataset of unlabelled facial images. The approach does not only improve on the accuracy of previous works but also produces 3D reconstructions that are often recognizable as the original subjects.
One of the main limitations of 3DMM-based techniques is that they are limited to the modelled subspace. As such, implausible reconstructions are possible outside the span of training data. Other representations such as volumetric grids, which do not suffer from this problem, have been also explored in the context of 3D face reconstruction. Jackson et al. , for example, propose a Volumetric Regression Network (VRN). The framework takes as input the 2D images and their corresponding 3D binary volume instead of a 3DMM. Unlike , the approach can deal with a wide range of expressions, poses and occlusions without alignment and correspondences. It, however, fails to recover fine details due to the resolution restriction of volumetric techniques.
Other techniques use intermediate representations. For example, Sela et al.
used an Image-to-Image Translation Network based on U-Net to estimates a depth image and a facial correspondence map. Then, an iterative deformation-based registration is performed followed by a geometric refinement procedure to reconstruct subtle facial details. Unlike 3DMM, this method can handle large geometric variations.
Feng et al.  also investigated a model-free method. First, a densely connected CNN framework is designed to regress 3D facial curves from horizontal and vertical epipolar plane images. Then, these curves are transformed into a 3D point cloud and the grid-fit algorithm  is used to fit a facial surface. Experimental results suggest that this approach is robust to varying poses, expressions and illumination.
Methods discussed so far are primarily dedicated to the 3D reconstruction of objects in isolation. Scenes with multiple objects, however, pose the additional challenges of delineating objects, properly handling occlusions, clutter, and uncertainty in shape and pose, and estimating the scene layout. Solutions to this problem involve 3D object detection and recognition, pose estimation, and 3D reconstruction. Traditionally, many of these tasks have been addressed using hand-crafted features. In the deep learning-based era, several of the blocks of the pipeline have been replaced with CNNs.
For instance, Izadinia et al.  proposed an approach that is based on recognizing objects in indoor scenes, inferring room geometry, and optimizing 3D object poses and sizes in the room to best match synthetic renderings to the input photo. The approach detects object regions, finds from a CAD database the most similar shapes, and then deforms them to fit the input. The room geometry is estimated using a fully convolutional network. Both the detection and retrieval of objects are performed using Faster R-CNN . The deformation and fitting, however, is performed via render and match. Tulsiani et al. , on the other hand, proposed an approach that entirely based on deep learning. The input, which consists of an RGB image and the bounding boxes of the objects, is processed with a four-branch network. The first branch is an encoder-decoder with skip connections, which estimates the disparity of the scene layout. The second branch, take a low resolution image of the entire scene and maps it into a latent space using a CNN followed by three fully-connected layers. The third branch, which has the same architecture as the second one, maps the image at its original resolution to convolutional feature maps, followed by ROI pooling to obtain features for the ROI. The last layer maps the bounding box location through fully connected layers. The three features are then concatenated and further processed with fully-connected layers followed by a decoder, which produces a voxel grid of the object in the ROI and its pose in the form of position, orientation, and scale. The method has been trained using synthetically-rendered images with their associated ground-truth 3D data.
This section discusses the performance of some key methods. We will describe the various datasets used to benchmark deep learning-based 3D shape reconstruction algorithms (Section 9.1), present the various performance criteria and metrics (Section 9.2), and discuss and compare the performance of some key methods (Section 9.3).
|Name||Data type||# of categories||3D models||Images|
|IKEA ||Real, indoor||Mesh||Real||Cluttered|
|Pix3D ||Real, indoor||Mesh||Real||Cluttered|
|PASCAL 3D+ ||Indoor/outdoor||Mesh||Real||Cluttered|
Tables V summarizes the most commonly used datasets to train and evaluate the performance of deep learning-based 3D reconstruction algorithms. Unlike traditional techniques, training and evaluating deep-learning architectures for 3D reconstruction requires large amounts of annotated data. This annotated data should be in the form of pairs of natural images and their corresponding 3D shapes/scenes. Obtaining such data, and at large scale, is challenging. In fact, most of the existing datasets (see Table V) are not specifically designed to benchmark deep-learning based 3D reconstruction. For instance, ShapeNet and ModelNet, the largest 3D datasets currently available, contain 3D CAD models without their corresponding natural images. In other datasets such as IKEA, PASCAL 3D+, and ObjectNet3D, only a relatively small subset of the images are annotated with their corresponding 3D models.
This issue has been addressed in the literature by data augmentation or domain adaptation techniques. Data augmentation is the process of augmenting the original sets with synthetically-generated data. For instance, one can generate new instances of 3D models by applying some geometric transformations, e.g., translation, rotation, and scaling, to existing ones. Note that, although some transformations are similarity-preserving, they still enrich the datasets. One can also synthetically render, from existing 3D models, new 2D and 2.5D (i.e., depth) views from various (random) viewpoints, poses, lighting conditions, and backgrounds. They can also be overlaid with natural images or random textures. Also, Instead of annotating natural images with 3D models, some papers annotate 2D images with segmentation masks, e.g., MS COCO . This is particularly useful for approaches that rely on 2D supervision, see Section 7.1.2.
Domain adaptation, on the other hand, are not commonly used for 3D reconstruction, with the exception of the recent work of Petersen et al. . Unlike previous work, which can only train on images of a similar appearance to those rendered by a differentiable renderer, Petersen et al.  introduced the Reconstructive Adversarial Network (RAN), which is able to train on different types of images.
Let be the ground truth 3D shape and the reconstructed one. The most commonly used quantitative metrics for evaluating the accuracy of 3D reconstruction algorithms include:
(1) The Mean Squared Error (MSE) . It is defined as the symmetric surface distance between the reconstructed shape and the ground-truth shape , i.e.,
Here, and are, respectively, the number of densely sampled points on and , and is the distance of to along the normal direction to . The smaller this measure is, the better is the reconstruction.
(2) Intersection over Union (IoU). The IoU measures the ratio of the intersection between the volume of the predicted shape and the volume of the ground-truth, to the union of the two volumes, i.e.,
where is the indicator function, is the predicted value at the th voxel, is the ground truth, and is a threshold. The higher the IoU value, the better is the reconstruction. This metric is suitable for volumetric reconstructions. Thus, when dealing with surface-based representations, the reconstructed and ground-truth 3D models need to be voxelized.
(3) Mean of Cross Entropy (CE) loss . It is defined as follows;
where is the total number of voxels or points, depending whether using voxel or point-based representation. and are, respectively, the ground-truth and the predicted value at the -voxel or point. The lower the CE value is, the better is the reconstruction.
In addition to these quantitative metrics, there are several qualitative aspects that are used to evaluate the efficiency of these methods. This includes;
(1) Degree of 3D supervision. An important aspect of deep learning-based 3D reconstruction methods is the degree of 3D supervision they require at training. In fact, while obtaining RGB images is easy, obtaining their corresponding ground-truth 3D data is quite challenging. As such, techniques that require minimal or no 3D supervision are usually preferred over those that require ground-truth 3D information during training.
(2) Computation time. While training can be slow, in general, its is desirable to achieve real-time performance at runtime.
(3) Memory footprint. Deep neural networks have a large number of parameters. Some of them operate on volumes using 3D convolutions. This would require large memory storage, which can affect their performance at runtime.
|(a) IoU of volumetric methods.|
|(b) IoU of surface-based methods.|
|(c) CD of surface-based methods.|
The majority of early works resort to voxelized representations [8, 24, 134, 17, 5], which can represent both the surface and the internal details of complex objects of arbitrary topology. With the introduction of space partitioning techniques such as O-CNN , OGN , and OctNet , volumetric techniques can attain relatively high resolutions, e.g., . This is due to the significant gain in memory efficiency. For instance, the OGN of  reduces the memory requirement for the reconstruction of volumetric grids of size from GB in  and GB in  to just GB (see Table VI). However, only a few papers adopted these techniques due to the complexity of their implementation, e.g., . To achieve high resolution 3D volumetric reconstruction, many recent papers use intermediation, through multiple depth maps, followed by volumetric [42, 37, 47, 88] or point-based  fusion.
Fig. 6 shows the evolution of the performance over the years, since , using ShapeNet dataset  as a benchmark. On the IoU metric, computed on volumetric grids of size , we can see that methods that use multiple views at training and/or at testing outperform those that are based solely on single views. Also, surface-based techniques, which started to emerge in (both mesh-based  and point-based [68, 55]), slightly outperform volumetric methods. Mesh-based techniques, however, are limited to genus-0 surfaces or surfaces with the same topology as the template.
Fig. 6 show that, since their introduction in 2017 by Yan et al. , 2D supervision-based methods significantly improved in performance. The IoU curves of Figures 6-(a) and (b), however, show that methods that use 3D supervision achieve slightly better performance. This can be attributed to the fact that 2D-based supervision methods use loss functions that are based on 2D binary masks and silhouettes. However, multiple 3D objects can explain the same 2D projections. This 2D to 3D ambiguity has been addressed either by using multiple binary masks captured from multiple view point , which can only reconstruct the visual hull and as such, they are limited in accuracy, or by using adversarial training [93, 87], which constraints the reconstructed 3D shapes to be within the manifold of valid classes.
|objects||(# params., mem.)|
|Xie et al. ||RGB, 3D GT||1 RGB||Clutter||Volumetric||3D||Encoder (VGG16),||(M, )|
|20 RGB||3D||Decoder, Refiner|
|Richter et al. ||RGB, 3D GT||1 RGB||Clean||Volumetric||3D||encoder, 2D decoder|
|Tulsiani et al. ||RGB, segmentation||1 RGB||Clutter||Volumetric + pose||2D (multiview)||Encoder, decoder,|
|Taterchenko et al. ||RGB + 3D GT||1 RGB||Clutter||Volumetric||3D||Octree Generating||s||(, GB)|
|Tulsiani et al. ||RGB, silh., (depth)||1 RGB||Clutter||Volumetric||2D (multiview)||encoder-decoder|
|Wu et al. ||RGB, 3D GT||1 RGB||Clutter||Volumetric||D and D||TL, 3D-VAE +|
|Gwak et al. ||1 RGB, silh., pose||1 RGB||2D||GAN|
|1 RGB, silh., pose, U3D||1 RGB||Clutter||Volumetric||2D, U3D||GAN|
|5 RGB, silh., pose||1 RGB||2D||GAN|
|5 RGB, silh., pose, U3D||1 RGB||2D, U3D||GAN|
|Johnston et al. ||RGB, 3D GT||1 RGB||Clutter||Volumetric||3D||Encoder + IDCT||(, GB)|
|Volumetric||3D||Encoder + IDCT||(, GB)|
|Yan et al. ||RGB, silh., pose||1 RGB||Clean||Volumetric||2D||encoder-decoder|
|Choy et al. ||RGB, 3D GT||1 RGB||Clutter||Volumetric||3D||encoder-LSTM-||(M, GB)|
|Kato et al. ||RGB, silh., pose||1 RGB||Clutter||Mesh + texture||2D||Encoder-Decoder|
|RGB, silh., poses|
|Mandikal et al. ||RGB, 3D GT||1 RGB||Clean||Point cloud||3D||Conv + FC layers,||(M, )|
|Global to local|
|Jiang et al. ||RGB + 3D GT||1 RGB||Clutter||Point cloud||3D||(encoder-decoder),|
|Zeng et al. ||1 RGB, silh., pose, 3D||1 RGB||Clutter||Point cloud||3D + self||encoder-decoder|
|Kato et al. ||RGB, silh.,||1 RGB||Clutter||Mesh||2D||encoder + FC layers|
|Jack et al. ||1 RGB, FFD params||1 RGB||Clean||Mesh||3D||Encoder + FC layers|
|Groueix et al. ||RGB, 3D GT||1 RGB||Clean||Mesh||3D||Multibranch MLP|
|Wang et al. ||1 RGB, 3D GT||1 RGB||Clean||Mesh||3D||see Fig. 2 (left)|
|Mandikal et al. ||RGB, 3D GT||1 RGB||Clutter||Point cloud||3D||3D-VAE,|
|Soltani et al. ||20 silh., poses||1 silh.||Clear||20 depth maps||2D||3D-VAE|
|1 silh.||1 silh.||20 depth maps|
|Fan et al. ||RGB, 3D GT||1 RGB||Clutter||Point set||3D||see Fig. 3-(a)|
|Pontes et al. ||RGB, GT FFD||1 RGB||Clean||Mesh||3D||Encoder + FC layers|
|Kurenkov et al. ||RGB, 3D GT||1 RGB||Clean||Point cloud||3D||2D encoder, 3D encoder,|
In light of the extensive research undertaken in the past five years, image-based 3D reconstruction using deep learning techniques has achieved promising results. The topic, however, is still in its infancy and further developments are yet to be expected. In this section, we present some of the current issues and highlight directions for future research.
(1) Training data issue. The success of deep learning techniques depends heavily on the availability of training data. Unfortunately, the size of the publicly available datasets that include both images and their 3D annotations is small compared to the training datasets used in tasks such as classification and recognition . 2D supervision techniques have been used to address the lack of 3D training data. Many of them, however, rely on silhouette-based supervision and thus they can only reconstruct the visual hull. As such, we expect to see in the future more papers proposing new large-scale datasets, new weakly-supervised and unsupervised methods that leverage various visual cues, and new domain adaptation techniques where networks trained with data from a certain domain, e.g., synthetically rendered images, are adapted to a new domain, e.g., in-the-wild images, with minimum retraining and supervision. Research on realistic rendering techniques that are able to close the gap between real images and synthetically rendered images can potentially contribute towards addressing the training data issue.
(2) Generalization to unseen objects. Most of the state-of-the-art papers split a dataset into a train, validation, and test subsets of a benchmark, e.g., ShapeNet or Pix3D, then report the performance on the test subsets. However, it is not clear how these methods would perform on a completely unseen object/image categories. In fact, the ultimate goal of 3D reconstruction is to be able to reconstruct any arbitrary 3D shape from arbitrary images. Learning-based techniques, however, perform well only on images and objects spanned by the training set. Some recent papers, e.g., Cherabier et al. , started to address this issue. An interesting direction for future research, however, would be to combine traditional and learning based techniques to improve the generalization of the latter methods.
(2) Fine-scale 3D reconstruction. Current state-of-the-art techniques are able to recover the coarse 3D structure of shapes. Although recent works have significantly improved the resolution of the reconstruction by using refinement modules, they still fail to recover thin and small parts such as plants, hair, and fur.
(3) Specialized instance reconstruction. There is increasing interests in reconstruction methods that are specialized in specific classes of objects. Examples include human bodies and body parts, which we briefly covered in this survey, vehicles, animals, and buildings. Specialized methods can highly benefit from prior knowledge, e.g., by using statistical shape models, to optimise the network architecture and its training process. Kanazawa et al.  have already addressed this model by jointly learning class-specific morphable models and the reconstruction function from images. We expect in the future to see more synergy between advanced statistical shape models, which can capture bending, elasticity, and topological variabilities [136, 61, 62, 63, 137], and deep learning-based 3D reconstruction.
(4) Handling multiple objects in the presence of occlusions and cluttered backgrounds. Most of the state-of-the-art techniques deal with images that contain a single object. In-the-wild images, however, contain multiple objects of different categories. Previous works employ detection followed by reconstruction within regions of interests, e.g., . The detection and then reconstruction modules operate independently from each other. However, these tasks are inter-related and can benefit from each other if solved jointly. Towards this goal, two important issues should be addressed. The first one is the lack of training data for multiple-object reconstruction. Second, designing appropriate CNN architectures, loss functions, and learning methods are important especially for methods that are trained without 3D supervision. These, in general, use silhouette-based loss functions, which require accurate object-level segmentation.
(5) 3D video. This paper focused on 3D reconstruction from one or multiple images, but with no temporal correlation. There is, however, a growing interest in 3D video, i.e., 3D reconstruction of entire video streams where successive frames are temporally correlated. On one hand, the availability of a sequence of frames can improve the reconstruction, since one can exploit the additional information available in subsequent frames to disambiguate and refine the reconstruction at the current frame. On the other hand, the reconstruction should be smooth and consistent across frames.
(6) Towards full 3D scene parsing. Finally, the ultimate goal is to be able to semantically parse a full 3D scene from one or multiple of its images. This requires joint detection, recognition, and reconstruction. It would also require capturing and modeling spatial relationships and interactions between objects and between object parts. While there have been a few attempts in the past to address this problem, they are mostly limited to indoor scenes with strong assumptions about the geometry and locations of the objects that compose the scene.
This paper provides a comprehensive survey of the developments in the past five-years in the field of image-based 3D object reconstruction using deep learning techniques. We classified the state-of-the-art into volumetric, surface-based, and point-based techniques. We then discussed methods in each category based on their input, the network architectures, and the training mechanisms they use. We have also discussed and compared the performance of some key methods.
This survey focused on methods that define 3D reconstruction as the problem of recovering the 3D geometry of objects from one or multiple RGB images. There are, however, many other related problems that share similar solutions. The closest topics include depth reconstruction from RGB images, which has been recently addressed using deep learning techniques, see the recent survey of Laga , 3D shape completion [134, 99, 25, 41, 27, 139, 140], 3D reconstruction from depth images , which can be seen as a 3D completion problem, 3D reconstruction and modelling from hand-drawn 2D sketches [141, 142], novel view synthesis [143, 144], and 3D shape structure recovery [79, 11, 28, 92]. These topics have been extensively investigated in the past five years and require separate survey papers.
Xian-Feng Han is supported by a China Scholarship Council (CSC) scholarship. This work was supported in part by ARC DP150100294 and DP150104251.
V. A. Knyaz, V. V. Kniaz, and F. Remondino, “Image-to-voxel model translation with conditional adversarial networks,” inECCV, 2018, pp. 0–0.
E. Insafutdinov and A. Dosovitskiy, “Unsupervised learning of shape and pose with differentiable point clouds,” inNIPS, 2018, pp. 2802–2812.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4104–4113.