Recent improvements in methods for acquisition and rendering of 3D models haven resulted in consolidated repositories containing massive amounts of 3D shapes on the Internet. With the increased availability of 3D models, we have been seeing an explosion in the demands of processing, generation and visualization of 3D models in a variety of disciplines, such as medicine, architecture and entertainment. The techniques for matching, identification and manipulation of 3D shapes have become fundamental building blocks in modern computer vision and computer graphics systems. Due to the complexity and irregularity of 3D shape data, how to effectively represent 3D shapes remains a challenging problem. Thus, there have been extensive research efforts concentrating on how to deal with and generate 3D shapes based on different representations.
In early research on 3D shape representations, 3D objects were normally modeled with a global approach, such as constructive solid geometry and deformed superquadrics. Those approaches have several drawbacks when utilized for the tasks like recognition and retrieval. First, when representing imperfect 3D shapes, including those with noise and incompleteness, which are common in practice, such representations may impose negative influence on matching performance. Second, the high-dimensionality heavily burdens the computation and tends to make models overfit. Hence, more sophisticated methods are designed to extract representations of 3D shapes in a more concise, yet discriminative and informative form.
In this survey, we mainly review deep learning methods on 3D shape representations and discuss their disadvantages and advantages considering different application scenarios. We now give a brief summary of different 3D shape representation categories.
Depth and multi-view images can be used to represent 3D models in the 2D field. The regular structure of images makes them efficient to be processed. Depending on whether depth maps are included, 3D shapes can be presented by RGB (color) or RGB-D (color and depth) images viewed from different viewpoints. Because of the influx of available depth data due to the popularity of 2.5D sensors, such as Microsoft Kinect, Intel RealSense, etc., multi-view RGB-D images are widely used to represent real-world 3D shapes. The large asset of image-based processing models can be leveraged using this representation. But it is inevitable that this kind of representation loses some geometry features.
A voxel is a 3D extension of the concept of pixel. Similar with pixels in 2D, the voxel-based representation also has a regular structure in the 3D space. The architectures of some neural networks which have been demonstrated useful in the 2D image field [krizhevsky2012imagenet, lecun2010convolutional] can be easily extended to the voxel form. Nevertheless, adding one dimension means an exponentially increased data size. As the resolution increases, the amount of required memory and computational costs increase dramatically, which restricts the representation only to low resolutions when representing 3D shapes.
describes 3D shapes by encoding their surfaces, which can also be regarded as 2-manifolds. Point clouds and meshes are both discretized forms of 3D shape surfaces. Point clouds use a set of sampled 3D point coordinates to represent the surface. It can be easily generated by scanners but difficult to process due to their lack of order and connectivity information. Researchers use order invariant operators such as the max pooling operator in deep neural networks[qi2017pointnet, qi2017pointnet++]
to mitigate the lack of order problem. Meshes can depict higher quality 3D shapes with less memory and computational cost compared with point clouds and voxels. A mesh contains a vertex set and an edge set. Due to its graphical nature, researchers have made attempts to build graph-based convolutional neural networks for coping with meshes. Some other methods regard meshes as the discretization of 2-manifolds. Moreover, meshes are more suitable for 3D shape deformation. One can deform a mesh model by transforming vertices while keeping the connectivity at the same time.
Implicit surface representation exploits implicit field functions, such as occupancy functions [mescheder2019occupancy] and signed distance functions [xu2019disn], to describe the surface of 3D shapes. The implicit functions learned by deep neural networks define the spatial relationship between points and surfaces. They provide a description with infinite resolution of 3D shapes with reasonable memory consumption, and are capable of representing shapes with changing topology. Nevertheless, implicit representations cannot reflect the geometric features of 3D shapes directly, and usually need to be transformed to explicit representations such as meshes. Most methods apply iso-surfacing, such as marching cubes [lorensen1987marching], which is an expensive operation.
Structured representation. One way to cope with complex 3D shapes is to decompose them into structure and geometric details, leading to structured representations. Recently, increasingly more methods regard a 3D shape as a collection of parts and organize them linearly or hierarchically. The structure of 3D shapes is processed by Recurrent Neural Networks (RNNs) [zou20173d], Recursive Neural Networks (RvNNs) [li2017grass] or other network architectures. Each part of the shape can be processed by unstructured models. The structured representation focuses on the relations (such as symmetry, supporting, being supported, etc.) between different parts within a 3D shape, which provides better description capability than alternative representations.
Deformation-based representation. Unlike rigid man-made 3D shapes such as chairs and tables, there are also a large number of non-rigid (e.g. articulated) 3D shapes such as human bodies, which also play an important role in computer animation, augmented reality, etc. The deformation-based representation is proposed mainly for describing the intrinsic deformation properties while ignoring the extrinsic transformation properties. Many methods use rotation-invariant local features for describing shape deformation to reduce the distortion and keep the geometry details at the same time.
Recently, deep learning has achieved superior performance in contrast to classical methods in many fields, including 3D shape analysis, reconstruction, etc. A variety of architectures of deep networks have been designed to process or generate 3D shape representations, which we refer to as geometry learning. In the following sections, we focus more on most recent deep learning based methods for representing and processing 3D shapes in different forms. According to how the representation is encoded and stored, our survey is organized in the following structure: Section 2 reviews image-based shape representation methods. Sections 3 and 4 introduce voxel-based and surface-based representations respectively. Section 5 further introduces implicit surface representations. Sections 6 and 7 review structure-based and deformation-based description methods. We then summarize typical datasets in Section 8 and typical applications for shape analysis and reconstruction in Section 9, before concluding the paper in Section 10. Figure 1 summarizes the timeline of representative deep learning methods based on various 3D shape representations.
2 Image-based methods
2D images are the projections of 3D entities. Although the geometric information carried by one image is incomplete, a plausible 3D shape could be inferred from a set of images with different perspectives. The extra channel of depth in RGB-D data further enhances the capacity of image-based representations on encoding geometric cues. Benefiting from its image-like structure, the research using deep neural networks on 3D shape inferences from images started earlier than alternative representations that can depict the surface or geometry of 3D shapes explicitly.
Socher et al. [socher2012convolutional] proposed a convolutional and recursive neural network for 3D object recognition, which copes with RGB and depth images by single convolutional layers separately and merges the features by a recursive network. Eigen et al. [eigen2014depth] first proposed to reconstruct the depth map from a single RGB image and designed a new scale invariant loss for the training stage. Gupta et al. [gupta2014learning] encoded the depth map into three channels including disparity, height and angle. Other deep learning methods based on RGB-D images are designed for 3D object detection [gupta2015aligning, song2016deep], outperforming previous methods.
Images from different viewpoints can provide complementary cues to infer 3D objects. Thanks to the development of deep learning models in 2D fields, the learning methods based on multi-view image representation perform better in the 3D shape recognition application than those based on other 3D representations. Su et al. [su2015multi] proposed MVCNN
(Multi-View Convolutional Neural Network) for 3D object recognition. MVCNN first processes the images in different views separately by the first part of CNN, then aggregates the features extracted from different views by view-pooling layers, and finally puts the merged feature to the remaining part of CNN. Qi et al.[qi2016volumetric] propose to add multi-resolution into MVCNN for higher classification accuracy.
3 Voxel-based representations
3.1 Dense Voxel Representation
The voxel-based representation is traditionally a dense representation, which describes 3D shape data by volumetric grids in 3D space. Each voxel in the grid records the status of occupancy (e.g., occupied or unoccupied) within a cuboid grid.
One of the earliest methods that applies deep neural networks to volumetric representations was proposed by Wu et al. [wu20153d] in 2015, which is called 3D ShapeNets
. Wu et al. assigned three different states to the voxels in the volumetric representation produced by 2.5D depth maps: observed, unobserved and free. 3D ShapeNets extended the deep belief network (DBN)[hinton2006fast] from pixel data to voxel data and replaced fully connected layers in DBN with convolutional layers. The model takes the aforementioned volumetric representation as input, and outputs category labels and predicted 3D shape by iterative computations. Concurrently, Maturana et al. proposed to process the volumetric representation with 3D Convolutional Neural Networks (3D CNNs) [maturana20153d] and designed VoxNet [maturana2015voxnet] for object recognition. VoxNet defines several volumetric layers, including Input Layer, Convolutional Layers, Pooling Layers and Fully Connected Layers. Although these defined layers are simple extensions of traditional 2D CNNs [krizhevsky2012imagenet] 3D, VoxNet is easy to implement and train and gets promising performance as the first attempt on volumetric convolutions. In addition, to ensure that VoxNet is invariant to orientation, Maturana et al. further augment the input data by rotating each shape into instances with different orientations in the training stage and adding a pooling operation after the output layer to group all the predictions from the instances in the test stage.
In addition to the development of deep belief networks and convolutional neural networks in shape analysis based on volumetric representation, two most successful generative models, namely auto-encoders and Generative Adversarial Networks (GANs) [goodfellow2014generative] are also extended to support this representation. Inspired by Denoising Auto-Encoders (DAEs) [vincent2008extracting, vincent2010stacked]
, Sharma et al. proposed an autoencoder modelVConv-DAE for coping with voxels [sharma2016vconv]
. It is one of the earliest unsupervised learning approaches in voxel-based shape analysis to our knowledge. Without object labels for training, VConv-DAE chooses mean square loss or cross entropy loss as the reconstruction loss function. Girdhar et al.[Girdhar16b] also proposed TL-embedding Network, which combine an auto-encoder for generating a voxel-based representation with a convolutional neural network for predicting the embeddings from the 2D images.
Choy et al. [choy20163d] proposed 3D-R2N2
which takes single or multiple images as input and reconstructs objects in occupancy grids. 3D-R2N2 regards input images as a sequence and designs the 3D recurrent neural network based on LSTM (Long Short-Term Memory)[hochreiter1997long]
or GRU (Gated Recurrent Unit)[cho2014learning]. The architecture consists of three parts: an image encoder to extract features from 2D images, 3D-LSTM to predict hidden states as coarse representations of final 3D models, and a decoder to increase the resolution and generate target shapes.
Wu et al. [wu2016learning] designed a generative model called 3D-GAN that applies the Generative Adversarial Network (GAN) [goodfellow2014generative]
in voxel data. 3D GAN learns to synthesize a 3D object from a sampled latent space vector
with the probability distribution. Moreover, [wu2016learning] also proposed 3D-VAE-GAN inspired by VAE-GAN [larsen2015autoencoding] for the object reconstruction task. 3D-VAE-GAN puts the encoder before 3D-GAN for inferring the latent vector from input 2D images and shares the decoder with the generator of 3D-GAN.
After the early attempts in dealing with volumetric representations by deep learning, researchers began to optimize the architecture of volumetric networks for better performance and more applications. A motivation is that the naive extension from traditional 2D domain networks often does not perform better than image-based CNNs such as MVCNN [su2015multi]. The main challenges affecting the performance include overfitting, orientation, data sparsity and low resolution.
Qi et al. [qi2016volumetric] proposed two new network structures aiming to improve the performance of volumetric CNNs. One introduces an extra task namely predicting class labels with subvolume space to prevent overfitting, and another utilizes elongated kernels to compress the 3D information into the 2D field in order to use 2D CNNs directly. Both of them use mlpconv layers [lin2013network] to replace traditional convolutional layers. [qi2016volumetric] also augments the input data in different orientation and elevation to encourage the network to get more local features in different poses so that the results are less influenced by orientation changes. To further mitigate the orientation impact on recognition accuracy, instead of using data augmentation like [maturana2015voxnet, qi2016volumetric], [sedaghat2016orientation] proposed a new model called ORION which extends VoxNet [maturana2015voxnet] and uses a fully connected layer to predict the object class label and orientation label simultaneously.
3.2 Sparse Voxel Representation (Octree)
Voxel-based representations often lead to high computational cost because of the exponential increase of computations from pixels to voxels. Most of the methods cannot cope with or generate high-resolution models within reasonable time. For instance, TL-embedding Network [Girdhar16b] was designed for voxel grids; 3DShapeNets [wu20153d] and VConv-DAE [sharma2016vconv] were designed for
voxel grids with 3 voxels padding on each direction of the voxel grids;VoxNet [maturana2015voxnet], 3D-R2N2 [choy20163d] and ORION [sedaghat2016orientation] were designed for voxel grids; 3D-GAN was designed for generating occupancy grids as 3D shape representation. As the voxel resolution increases, the occupied grids become sparser in the whole 3D space, which leads to more unnecessary computation. To address this problem, Li et al. [li2016fpnn] designed a novel method called FPNN to cope with the data sparsity.
Some methods instead encode the voxel grids by a sparse, adaptive data structure, namely octree [meagher1982geometric] to reduce the dimensionality of the input data. Häne et al. [hane2017hierarchical] proposed Hierarchical Surface Prediction (HSP) that can generate voxel grids in the form of octree from coarse to fine. Häne et al. observed that only the voxels near the object surface need to be predicted in a high resolution, so that the proposed HSP can avoid unnecessary calculation to ensure affordable generation of high resolution voxel grids. As introduced in [hane2017hierarchical], each node in the octree is defined as a voxel block with a fixed number (
in the paper) of voxels in different size, and each voxel block is classified into occupied, boundary and free. The decoder of the model takes a feature vector as input, and predicts feature blocks that correspond to voxel blocks hierarchically. The HSP defines that the octree has 5 layers and each voxel blocks containsvoxels, therefore, HSP can generate up to voxel grids. Tatarchenko et al. [tatarchenko2017octree] also proposed a decoder called OGN for generating high resolution volumetric representations. In [tatarchenko2017octree], nodes in the octree are separated into three categories, including “empty”, “filled” and “mixed”. The octree representing a 3D model and the feature map of the octree are stored in the form of hashing tables which are indexed by the spatial position and the octree level. In order to process the feature maps represented as hash tables, Tatarchenko et al. designed a convolutional layer named OGN-Conv, which converts the convolutional operation into matrix multiplication. [tatarchenko2017octree] adopts the method that generates different resolution of voxel grids in each decoder layer by convolutional operations in feature maps, and then decides whether to propagate the features to the next layer by specific labels (propagating the features if “boundary” and skipping the feature propagation if “mixed”).
Besides the decoder model design for synthesizing voxel grids, shape analysis methods are also designed using octrees. However, conventional octree structure [meagher1982geometric] has difficulty to be used in deep networks, so many researchers try to resolve the problem by designing new structures of octrees and special operations such as convolution, pooling and unpooling on octrees. Riegler et al. [riegler2017octnet] proposed OctNet. The octree representation mentioned in [riegler2017octnet] has a relatively regular structure than a traditional octree, which places a shallow octree in regular 3D grids. The shallow octree is constrained to have up to 3 levels and is encoded in 73 bits. Each bit determines if the corresponding cell needs to be split. Wang et al. [wang2017cnn] also proposed a convolutional neural network based on octree called O-CNN, where the model also removes pointers like shallow octree [riegler2017octnet] and stores the octree data and structure by a series of vectors including shuffle key vectors, labels and input signals.
In addition to representing voxels, octree structure can also be utilized to represent 3D shape surfaces with planar patches. Wang et al. [wang2018adaptive] proposed Adaptive O-CNN, where they defined another form of octree named patch-guided adaptive octree, which divides a 3D shape surface into a set of planar patches restricted by bounding boxes corresponding to octants. They also provided an encoder and a decoder for the octree defined by this paper.
4 Surface-based representations
4.1 Point-based Representation
The typical point-based representation is also referred to as point clouds or point sets. They can be raw data generated by 3D scanning devices. Because of its unordered and irregular structure, this kind of representation is relatively difficult to cope with by traditional deep learning methods. Therefore, most researchers avoided to use point clouds in a direct way at the early stage of the deep learning-based geometry research. One of the first models to generate point clouds by deep learning came out in 2017 [fan2017point]. They designed a neural network to learn a point sampler based on 3D shape point distribution. The network takes a single image and a random vector as input, and outputs an matrix representing the predicted point sets (, , coordinates for points). In addition, [fan2017point] proposed to use Chamfer Distance (CD) and Earth Mover’s Distance (EMD) [rubner2000earth] as the loss function to train the networks.
PointNet. At almost the same time, Qi et al. [qi2017pointnet] proposed PointNet for shape analysis, which was the first successful deep network architecture that directly processes point clouds without unnecessary rendering. The pipeline of PointNet is illustrated in Figure 2. On account of three properties of point sets mentioned in [qi2017pointnet], PointNet designed three components in their network, including using max-pooling layers as symmetry functions for dealing with the unordered property, concatenating global and local features together for point interaction, and jointly aligning the network for transformation invariance. Based on PointNet, Qi et al. further improved this model and proposed PointNet++ [qi2017pointnet++], in order to resolve the problem that PointNet cannot capture and deal with local features induced by metric well. Compared with PointNet, PointNet++ introduces a hierarchical structure, so that the model can capture features in different scales, which improves the capability of extracting 3D shape features. As PointNet and PointNet++ show state-of-the-art performance in shape classification and semantic segmentation, more and more deep learning models were proposed based on point-based representations.
Other Point Cloud Processing Techniques using Neural Networks. Klokov et al. [klokov2017escape] proposed Kd-Network to process point clouds based on the form of kd-trees. Yang et al. [yang2018foldingnet] proposed FoldingNet, an end-to-end auto-encoder for further compressing a point-based representation with unsupervised learning. Because point clouds can be transformed into 2D grids by folding operations, FoldingNet integrates folding operations in their encoder-decoder to recover input 3D shapes. Mehr et al. [mehr2019disconet] further proposed DiscoNet for 3D model editing by combining multiple autoencoders which are trained for different types of 3D shapes specifically. The autoencoders use pre-learned mean geometry of training 3D shapes as their templates.
Although the point-based representation can be more easily obtained by 3D scanners than other 3D representations, this raw form of 3D shapes is often unsuitable for 3D shape analysis, due to noise and data sparsity. Therefore, compared with other representations, it is essential for the point-based representation to incorporate an upsampling module to obtain fine-grained point clouds, such as PU-NET [yu2018pu], MPU [yifan2019patch], PU-GAN [li2019pu], etc. Guo et al. [guo2019deep] presented a survey focusing on deep learning models in point clouds, which provides provides more details in this field.
4.2 Mesh-based Representations
Compared with point-based representations, mesh-based representations contain connectivity between neighboring points, so they are more suitable for describing local regions on surfaces. As a typical type of representation in non-Euclidean space, mesh-based representations can be processed by deep learning models both in spatial and spectral domains [bronstein2017geometric]. However, directly applying CNNs to irregular data structures like meshes is non-trivial, so there emerged a handful of approaches converting 3D shape surfaces into 2D geometry images and applying traditional 2D CNNs on them [sinha2016deep, maron2017convolutional]. However, such methods do not take full advantage of the mesh-based representation. In this subsection, we will review deep learning models according to how meshes are treated as input, and introduce generative models working on meshes.
Graphs. The mesh-based representation is constructed by sets of vertices and edges, which can be seen as a graph. Some models were proposed based on the graph spectral theorem. They generalize CNNs on graphs [bruna2013spectral, henaff2015deep, defferrard2016convolutional, kipf2016semi, atwood2016diffusion] by eigen-decomposition of Laplacian matrices, which is able to generalize convolutional operators to the spectral domain of graphs. Verma et al. [verma2018feastnet] proposed another graph-based CNN named FeaStNet, which computes the receptive fields of convolution operator dynamically. Specifically, FeaStNet determines the assignment of the neighbor vertices by using features obtained in networks. Hanocka et al. [hanocka2019meshcnn] also designed operators of convolution, pooling and unpooling for triangle meshes, and proposed MeshCNN. Different from other graph-based methods, MeshCNN focuses on processing the features stored in edges, and proposes a convolution operator that is applied to the edges with a fixed number of neighbors and a pooling operator based on edge collapse. MeshCNN extracts 3D shape features with respect to specific tasks, and the network learns to preserve the important features and ignore the unimportant ones.
2-Manifolds. The mesh-based representation can be viewed as the discretization of 2-manifolds. Several works are designed in 2-manifolds with a series of refined CNN operators to adapt to this non-Euclidean space. These methods define their own local patches and kernel functions for generalizing CNN models. Masci et al. [masci2015geodesic] proposed Geodesic Convolutional Neural Networks (GCNNs) for manifolds, which extract and discretize local geodesic patches and apply convolutional filters on these patches in polar coordinates. The convolution operator is designed in the spatial domain and their Geodesic CNN is quite similar to conventional CNNs applied in Euclidean space. Localized Spectral CNNs [boscaini2015learning] proposed by Boscaini et al. apply Windowed Fourier transform
Windowed Fourier transformto non-Euclidean space. Anisotropic Convolutional Neural Networks (ACNNs) [boscaini2016learning] further designed an anisotropic heat kernel to replace the isotropic patch operator in GCNN [masci2015geodesic], which gives another solution to avoid ambiguity. Xu et al. [xu2017directionally] proposed Directionally Convolutional Networks (DCNs), which defined local patches based on faces of the mesh representation. In this work, researchers also designed a two-stream network for 3D shape segmentation, which takes local face normals and the global face distance histogram as input for training. Moti et al. [monti2017geometric] proposed MoNet to replace the weight functions in [masci2015geodesic, boscaini2016learning] with Gaussian kernels with learnable parameters. Fey et al. [fey2018splinecnn] proposed SplineCNN which designed a convolutional operator based on B-splines. Pan et al. [pan2018convolutional] designed a surface CNN for 3D irregular surface to preserve the standard CNN property of translation equivariance by using parallel translation frames and group convolutional operations.
Generative Models. There are also many generative models for the mesh-based representation. Wang et al. [wang2018pixel2mesh] proposed Pixel2Mesh for reconstructing 3D shapes from single images, which generates the target triangular mesh by deforming an ellipsoid template. As shown in Figure 3, the Pixel2Mesh network is implemented based on Graph-based Convolutional Networks (GCNs) [bronstein2017geometric] and generates the target mesh from coarse to fine by an unpooling operation. Wen et al. [wen2019pixel2mesh++] advanced Pixel2Mesh and proposed Pixel2Mesh++, which extends single image 3D shape reconstruction to 3D shape reconstruction from multi-view images. To achieve this, Pixel2Mesh++ introduces a Multi-view Deformation Network (MDN) to the original Pixel2Mesh, and the MDN incorporates the cross-view information into the process of mesh generation. Groueix et al. [groueix2018atlasnet] proposed AtlasNet
, which generates 3D surfaces by multiple patches. AtlasNet learns to convert 2D square patches into 2-manifolds to cover the surface of 3D shapes by MLP (Multi-Layer Perceptron). Ben-Hamu et al.[ben2018multi] proposed a multi-chart generative model for 3D shape generation. The method uses a multi-chart structure as input and builds the network architecture based on standard image GAN [goodfellow2014generative]. The transformation between 3D surface and multi-chart structure is based on [maron2017convolutional].
5 Implicit representations
In addition to explicit representations such as point clouds and meshes, implicit fields have been in greater popularity in recent studies. A major reason is that the implicit representation is not limited by fixed topology and resolution. There are an increasing number of deep models, which define their own implicit representations and building on them further propose various methods for shape analysis and generation.
The Occupancy/Indicator Function is one of the forms to represent 3D shapes implicitly. Occupancy Network was proposed by Mescheder et al. [mescheder2019occupancy] to learn a continuous occupancy function as a new representation of 3D shapes by neural networks. The occupancy function reflects the 3D point status with respect to the 3D shape surface, where 1 means inside the surface and 0 otherwise. Researchers regarded this problem as a binary classification task and designed an occupancy network which inputs 3D point position and 3D shape observation and outputs the probability of occupancy. The generated implicit field is then processed by a Multi-resolution IsoSurface Extraction method MISE and marching cubes algorithm [lorensen1987marching] to obtain meshes. Moreover, researchers introduce encoder networks to obtain latent embeddings. Similarly, Chen et al. [chen2019learning] designed IM-NET as a decoder for learning generative models, which also takes an implicit function in the form of an indicator function.
Signed Distance Functions (SDFs) are also a form of implicit representation. Signed distance functions map a 3D point to a real value instead of a probability, which indicates the spatial relation and distance to the 3D surface. Denote as the signed distance value of a given 3D point . Then if point is outside the 3D shape surface, if point is inside the surface, and means point is on the surface. The absolute value of refers to the distance between point and the surface. Park et al. [park2019deepsdf] proposed DeepSDF and introduced an auto-decoder-based DeepSDF as a new 3D shape representation. Wang et al. [xu2019disn] also proposed Deep Implicit Surface Networks (DISNs) for single-view 3D reconstruction based on SDFs. Thanks to the advantages of SDF, DISN was the first to reconstruct 3D shapes with flexible topology and thin structure in the single-view reconstruction task, which is difficult for other 3D representations.
Function Sets. The occupancy functions and signed distance functions represent the 3D shape surface by a single function learned by a deep neural network. Genova et al. [genova2019learning, genova2019deep] proposed to represent the whole 3D shape by combining a set of shape elements. In [genova2019learning], researchers proposed Structured Implicit Functions (SIFs) where each element is represented by a scaled axis-aligned anisotropic 3D Gaussian, and the sum of these shape elements represents the whole 3D shape. The parameters of Gaussians are learned by the CNN. [genova2019deep] improved the SIF and proposed Deep Structured Implicit Functions (DSIFs) which added deep neural networks as Deep Implicit Functions (DIFs) to provide local geometry details. To summarize, DSIF exploits SIF to depict coarse information of each shape element, and applies DIF for local shape details.
. The above implicit representation models need to sample 3D points in the 3D shape bounding box as ground truth and train with supervised learning. Liu et al.[liu2019learning] first proposed a framework which learns implicit representations without 3D ground truth. The model uses a field probing algorithm to bridge the gap between the 3D shape and 2D images, and designs a silhouette loss to constrain 3D shape outline and geometry regularization to constrain the surface be plausible.
6 Structure-based representations
Recently, more and more researchers began to realize the importance of structure of 3D shapes and integrate structural information into deep learning models. Primitive representations are a typical type of structure-based representation which depict 3D shape structure well. A primitive representation represents the 3D shape with primitives such as oriented 3D boxes. Instead of providing a description of geometry details, the primitive representation concentrates more on the overall structure of 3D shapes. It represents 3D shape structure as several primitives with a compact parameter set. More importantly, obtaining a primitive representation encourages to generate more detailed and plausible 3D shapes.
Linearly Organized. Observing that humans often regard 3D shapes as a collection of parts, Zou et al. [zou20173d] proposed 3D-PRNN, which applies LSTM in a primitive generator, so that 3D-PRNN can generate primitives sequentially. The generated primitive representations show great efficiency in depicting simple and regular 3D shapes. Wu et al. [wu2019pq] further proposed an RCNN-based method called PQ-NET which also regards 3D shape parts as a sequence. The difference is that PQ-NET encodes geometry features in the network. Gao et al. [gao2019sdm] proposed a deep generative model named SDM-NET (Structured Deformable Mesh-Net). They designed a two-level VAE, containing a PartVAE for part geometry and a SP-VAE (Structured Parts VAE) for both structure and geometry features. In [gao2019sdm], each shape part is encoded in a well designed form, which records both the structure information (symmetry, supporting and supported) and geometry features.
Hierarchically Organized. Li et al. [li2017grass] proposed GRASS (Generative Recursive Autoencoders for Shape Structures), which is one of the first attempts to encode the 3D shape structure by a neural network. They describe the shape structure by a hierarchical binary tree, in which the child nodes are merged into the parent node by either adjacency or symmetry relations. Leaves in this structure tree represent the oriented bounding boxes (OBBs) and geometry features for each part, and intermediate nodes represent both the geometry feature of child nodes and the relations between child nodes. Inspired by recursive neural networks (RvNNs) [socher2011parsing, socher2012convolutional], GRASS also recursively merges the codes representing the OBBs into a root code which depicts the whole shape structure. The architecture of GRASS can be divided into three parts: (1) an RvNN autoencoder for encoding a 3D shape into a fixed length code, (2) a GAN for learning the distribution of root codes and generating plausible structures, (3) another autoencoder for synthesizing geometry of each part which is inspired by [Girdhar16b]. Furthermore, to synthesize fine-grained geometry in voxel grids, Structure-aware recursive feature (SARF) is proposed, which contains both the geometry features of each part and global and local OBB layout.
However, the GRASS [li2017grass] uses a binary tree to organize the part structure, which leads to ambiguity. Therefore, binary trees are not suitable for large scale datasets. To address the problem, Mo et al. [mo2019structurenet] proposed StructureNet which organized the hierarchical structure in the form of graphs.
The BSP-Net (Binary Space Partitioning-Net) proposed by Chen et al. [chen2019bsp] is the first method to depict sharp geometry features, which constructs a 3D shape by convexes organized by a BSP-tree. The Binary Space Partitioning (BSP) tree defined in [chen2019bsp]
is used to represent 3D shapes by collections of convexes, which includes three layers, namely hyperplane extraction, hyerplane grouping and shape assembly. The convexes can also be seen as a new form of primitives which can represent geometry details of 3D shapes rather than general structures.
Structure and Geometry. Researchers try to encode the 3D shape structure and geometry features separately [li2017grass] or jointly [wu2019sagnet]. Wang et al. [wang2018global] proposed Global-to-Local (G2L) generative model to generate man-made 3D shapes from coarse to fine. To address the problem that GANs cannot generate geometry details well [wu2016learning], G2L first applies a GAN to generate coarse voxel grids with semantic labels that represent shape structure at the global level, and then puts the voxels separated by semantic labels into an autoencoder called Part Refiner (PR) to optimize part geometry details part by part at the local level. Wu et al. [wu2019sagnet] proposed SAGNet for detailed 3D shape generation, which encodes the structure and geometry jointly by a GRU [cho2014learning] architecture in order to find intra-relation between them. The SAGNet shows better performance in tenon-mortise joints than other structure-based learning methods.
7 Deformation-based representations
Deformable 3D models play an important role in computer animation. However, most of the methods mentioned above mainly focus on rigid 3D models, while paying less attention to the deformation of non-rigid models. Compared with other representations, deformation-based representations parameterize the deformation information and have better performance when used to cope with non-rigid 3D shapes, such as articulated models.
Mesh-based Deformation Description. A mesh can be seen as a graph, which is convenient when manipulating the vertex positions while maintaining the connectivity between vertices. Therefore, a great number of methods choose meshes to represent deformable 3D shapes. Moreover, the graph structure makes it easy to store deformation information as vertices features, which can be seen as deformation representations. Gao et al. [gao2016efficient] designed an efficient and rotation-invariant deformation representation called Rotation-Invariant Mesh Difference (RIMD), which achieves high performance in shape reconstruction, deformation and registration. Based on [gao2016efficient], Tan et al. [tan2018variational] proposed Mesh VAE for deformable shape analysis and synthesis, which takes RIMD as the feature inputs of VAE and uses fully connected layers for the encoder and decoder. In order to overcome the problem that deformation gradient cannot work well in large-scale deformation, Gao et al. [gao2019sparse] designed an as-consistent-as-possible (ACAP) representation to constrain the rotation angle and rotation axes between adjacent vertices in the deformable mesh. Tan et al. [tan2018mesh] proposed the SparseAE based on the ACAP representation [gao2019sparse], which applies graph convolutional operators [duvenaud2015convolutional] to the network. Gao et al. [gao2018automatic] proposed VC-GAN (VAE CycleGAN) for unpaired mesh deformation transfer, which first takes the ACAP representation as input, and encodes the representation into latent space by a VAE, and then transfer between source and target in the latent space domain based on a CycleGAN[zhu2017unpaired] architecture.
Implicit surface based approaches. With the development of implicit surface representations, Jeruzalski et al. [jeruzalski2019nasa] proposed a method to represent articulated deformable shapes by pose parameters, called Neural Articulated Shape Approximation (NASA). The pose parameters mentioned in [jeruzalski2019nasa] record the transformation of bones defined in models. They compared three different network architectures, including unstructured model (U), piecewise rigid model (R) and piecewise deformable model (D) in the training dataset and test dataset, which opens another direction to represent deformable 3D shapes.
With the development of 3D scanners, 3D models become easier to obtain, so there are more and more 3D shape datasets that have been proposed with different 3D representations. The larger datasets with more details bring more challenges for existing techniques, which further promotes the development of deep learning on different 3D representations.
The datasets can be divided into several types in different representations and different applications. Choosing the appropriate dataset benefits the performance and generalization for learning based models.
RGB-D Images. RGB-D image datasets can be collected by depth sensors like Microsoft Kinect. Most of the RGB-D image datasets can be regarded as a sequence of video. The indoor scene RGB-D image dataset NYU Depth [silberman11indoor, silberman2012indoor] was first provided for the segmentation problem, and the v1 version [silberman11indoor] collects 64 categories while the v2 version [silberman2012indoor] collects 464 categories. The KITTI [geiger2013vision] dataset provides outdoor scene images mainly for autonomous driving, which contains 5 categories including ‘Road’, ‘City’, ‘Residential’, ‘Campus’ and ‘Person’. The depth map of images can be calculated by the development kit provided by the KITTI dataset. And the KITTI dataset also contains 3D objects annotations for applications such as object detection.
Man-made 3D Object Datasets. The ModelNet [wu20153d] is one of the famous CAD model datasets for 3D shape analysis, including 127,915 3D CAD Models in 662 categories. ModelNet provides two subsets named ModelNet10 and ModelNet40 respectively. ModelNet10 includes 10 categories from the whole dataset, and the 3D models in ModelNet10 are aligned manually; ModelNet40 includes 40 categories, and the 3D models are also aligned. ShapeNet [chang2015shapenet] provides a larger scale dataset, containing more than 3 million models in more than 4K categories. ShapeNet also contains two smaller subsets: ShapeNetCore and ShapeNetSem. For various geometry applications, ShapeNet [chang2015shapenet] provides rich annotations for 3D objects in the dataset, including category labels, part labels, symmetry information, etc. PartNet provides a more detailed CAD model dataset with fine-grained, hierarchical part annotations, which brings more challenges and resources for 3D object applications such as semantic segmentation, shape editing and shape generation.
Non-Rigid Model Datasets. TOSCA[bronstein2008numerical] is one of the high-resolution 3D non-rigid model datasets, which contains 80 objects in 9 categories. The models are in the mesh representation, and the objects within the same category have the same resolution. FAUST[bogo2014faust] is a dataset of 3D human body scans in 10 different people with a variety of poses and the ground truth correspondences are also provided. Because FAUST was proposed for real-world shape registration, the scans provided in the dataset are noisy and incomplete, but the corresponding ground truth is water-tight and aligned.
|Real-world||RGB-D Images||NYU Depth v1[silberman11indoor]||2011||64||-||Indoor Scene|
|Real-world||RGB-D Images||NYU Depth v2[silberman2012indoor]||2012||464||407024||Indoor Scene|
|Real-world||RGB-D Images||KITTI[geiger2013vision]||2013||5||-||Outdoor Scene|
|Synthetic||3D CAD Models||ModelNet[wu20153d]||2015||662||127915||Mesh Representation|
|Synthetic||3D CAD Models||ModelNet10[wu20153d]||2015||10||4899||-|
|Synthetic||3D CAD Models||ModelNet40[wu20153d]||2015||40||12311||-|
|Synthetic||3D CAD Models||ShpaeNet[chang2015shapenet]||2015||4K||3millions||Rich Annotations|
|Synthetic||3D CAD Models||ShapeNetCore[chang2015shapenet]||2015||55||51300||-|
|Synthetic||3D CAD Models||ShapeNetSem[chang2015shapenet]||2015||270||12000||-|
|Synthetic||3D CAD Models||PartNet[mo2019partnet]||2019||24||26671||573585 Part Instance|
|Real-world||Non-Rigid Models||FAUST[bogo2014faust]||2014||10||300||Human Bodies|
9 Shape Analysis and Reconstruction
The shape representations mentioned above are fundamental for shape analysis and shape reconstruction. In this section, we summarize representative works in these two directions respectively and compare the performance of these works.
9.1 Shape Analysis
Shape analysis methods usually extract the latent codes from different 3D shape representations by different network architectures. The latent codes are then used for specific applications like shape classification, shape retrieval, shape segmentation, etc. And different representations are usually suitable for different applications. We now review the performance of different representations in different models and discuss suitable representations for specific applications.
Shape Classification and Retrieval are the basic problems of shape analysis. Both of them rely on the feature vectors extracted from the analysis networks. For shape classification, the datasets ModelNet10 and ModelNet40 [wu20153d] are widely used and Table 2 shows the accuracy of different methods on ModelNet10 and ModelNet40. For shape retrieval, given a 3D shape as a query, the target is to find the most similar shape(s) in the dataset to match the query. Retrieval methods usually learn to find a compact code to represent the object in a latent space, and query the closest object as the result based on Euclidean distance, Mahalanobis distance or other distance metrics. Different from the classification task, the shape retrieval task has a number of evaluation measures, including precision, recall, mAP (mean average precision), etc.
|Voxel||Qi et al. [qi2016volumetric]||-||86|
|Multi-view||Qi et al. [qi2016volumetric]||-||91.4|
aims to discriminate the part categories of a 3D shape. This task plays an important role in understanding 3D shapes. The mean Intersection-over-Union (mIOU) is often used as the evaluation metric of shape segmentation. Most researchers choose to use the point-based representation for the segmentation task[klokov2017escape, qi2017pointnet, qi2017pointnet++, li2018pointcnn].
9.2 Shape Reconstruction
Learning based generative models have been proposed for different representations. The reconstruction applications include single-view shape reconstruction, shape generation, shape editing, etc. The generation methods can be summarized on the basis of representations. For voxel-based representations, learning based models try to predict the occupancy probability of each voxel in the grid. For point-based representations, learning based models either sample 3D points in the space or fold the 2D grids into target 3D objects. For mesh-based representations, most of the generation methods choose to deform a mesh template into the final mesh. In recent study, more and more methods choose to use structured representation and generate coarse-to-fine 3D shapes.
In this survey, we review a series of deep learning methods based on different 3D object representations. We first overview different 3D representation learning models. And the tendency of the geometry learning can be summarized to be less computation and memory demanding, and more detailed and structured. Then, we introduce 3D datasets which are widely used in the research. These datasets provide rich resources and support evaluation for data-driven learning methods. Finally, we discuss 3D shape applications based on different 3D representations, including shape analysis and shape reconstruction. Different representations are usually suitable for different applications. Therefore, it is vitally important to choose suitable 3D representations for specific tasks.
This work was supported by National Natural Science Foundation of China (No. 61828204 and No. 61872440), Beijing Municipal Natural Science Foundation (No. L182016), Youth Innovation Promotion Association CAS, CCF-Tencent Open Fund.