3D shape analysis and generation are two key tasks in computer graphics. Traditional approaches generally have limited ability to handle complex shapes, or require significant time and effort from users to achieve acceptable results. With the rapid growth of created and captured 3D data, it has become possible to learn the shape space from a large 3D dataset with the aid of machine learning techniques and guide the shape analysis and generation with the learned features. Recently, deep learning with convolution neural networks has been applied to 3D shape analysis and synthesis.
Different from images whose grid-based representation is simple and regular, 3D shapes have a variety of representations because of different demands from real applications. For the learning-based shape generation task, the representation of 3D shapes plays a vital role which affects the design of learning architectures and the quality of generated shapes. The commonly used (dense)-voxel representation is most popular in existing 3D learning and generation frameworks (Wu et al., 2015) (Wu et al., 2016) since it is a natural extension to 2D images and is well-suited to existing learning frameworks, like convolutional neural networks. However, its high memory storage and costly computation are a major drawback, and high-resolution outputs are hard to produce in practice. Multi-view images (Su et al., 2015) have been widely used in shape generation. The generated multi-view images can be fused to reconstruct the complete shape. Proper view selection, enforcing consistency of different views, and shape occlusion are the main challenges for this representation. Recently, points, as another common 3D representation, has become a suitable representation for shape analysis and generation with the development of PointNet (Qi et al., 2017a) and its variants. However, its output quality is limited by the number of points, and extracting high-quality surfaces from the point cloud requires additional processing. As the favorite 3D format in computer graphics, the polygonal mesh has recently been used in learning-based shape generation. Surface patches or meshes can be predicted directly by a neural network that deforms a template mesh or finds a 2D-to-3D mapping (Groueix et al., [n. d.]; Kato et al., 2018; Wang et al., 2018). However, the predefined mesh topology and the regular tessellation of the template mesh prevent generating high-quality results, especially for irregular and complex shapes.
The octree, which is the most representative sparse-voxel representation, has been integrated with convolution neural networks recently for shape analysis (Riegler et al., 2017a; Wang et al., 2017) and its memory and computational efficiency property is suitable for generating high-resolution shapes (Tatarchenko et al., 2017; Häne et al., 2017)
. The octree-based generation network usually predicts the occupancy probability of an octant:occupied, free and boundary, and splits the octant with label boundary. The prediction and splitting procedures are recursively performed until the predefined max depth of the octree is reached. At the finest level the non-empty leaf octants represent the predicted surface. In existing work, the non-empty leaf octants at the finest level can be regarded as uniform samples of the shape in the x, y, and z directions. We observe that it is actually not necessary to store the shape information in this uniform way since the local shape inside some octants can be represented by simple patches, like planar patches. Therefore, by storing the patch information and terminating the octant split early if the patch associated with the octant well approximates the local shape, the generated octree can have a more compact and adaptive representation. Furthermore, the stored patch has a higher order approximation accuracy than using the center or one of the corners of the octant as the sample of the surface.
Based on the above observations, we propose a novel 3D convolutional neural network for 3D shape called Adaptive Octree-based CNN, or Adaptive O-CNN for short. Adaptive O-CNN is based on a novel patch-guided adaptive octree shape representation which adaptively splits the octant according to the approximation error of the estimated simple patch to the local shape contained by the octant. The decoder of Adaptive O-CNN predicts the occupancy probability of octants: empty, surface-well-approximated, and surface-poorly-approximated; infers the local patch at each non-empty octant at each level, and split octants whose label is surface-poorly-approximated. It results in an adaptive octree whose estimated local patches at non-empty leaf octants are a multi-scale and adaptive representation of the predicted shape. Besides the decoder, we also develop an efficient 3D encoder for adaptive octrees and use it for shape classification and as a 3D autoencoder.
Our Adaptive O-CNN inherits the advantages of octree-based CNNs and gains substantial efficiency in memory and computation cost compared with the existing octree-based CNNs due to the use of the adaptive octree data structure. The local patch estimation at each level also enhances the generated shape quality significantly. With all of these features, Adaptive O-CNN is capable of generating high-resolution and high-quality shapes efficiently. We evaluate Adaptive O-CNN on different tasks, including shape classification, 3D autoencoding, shape prediction from a single image, and shape completion for incomplete data. We demonstrate the superiority of Adaptive O-CNN over the state-of-the-art learning-based shape generation techniques in terms of shape quality.
2. Related Work
Shape representations for 3D CNNs
Due to the variety of 3D shape representations, there is not a universal representation for 3D learning. (dense)-voxel representation equipped with binary occupancy signals or signed distance values is popular in existing 3D CNN frameworks (Wu et al., 2015; Maturana and Scherer, 2015) due to its simplicity and similarity to its 2D counterpart — images. Voxel-based 3D CNNs often suffer from the high-memory issue, thus they have difficulty in supporting high-resolution input. Since the 3D shape only occupies a small region in its bounding volume, there is a trend toward building a sparse-voxel representation for 3D CNNs. A series of works including (Graham, 2015; Uhrig et al., 2017; Wang et al., 2017; Riegler et al., 2017a) explore the sparsity of voxels and define proper convolution and pooling operations on sparse voxels with the aid of the octree structure and its variants. Our patch-guided adaptive octree also belongs to this type of representation, but with greater sparsity and better accuracy because of its adaptiveness and patch fitting. The multi-view representation regards the shape as a collection of images rendered from different views (Su et al., 2015)
. The images can contain RGB color information or view-dependent depth values, and it is easy to feed them to 2D CNNs and utilize networks pretrained on ImageNet(Deng et al., 2009). However, the multi-view representation may miss partial information of the shape due to occlusion and insufficient views. Recently, the point-based representation has become popular due to its simplicity and flexibility. PointNet (Qi et al., 2017a) and its successor PointNet++ (Qi et al., 2017b) regard a shape as an unorganized point cloud and use symmetric functions to achieve the permutation invariance of points. These point-based CNNs are suited to applications whose input can be well approximated by a set of points or naturally has a point representation, like LiDAR scans. For mesh inputs where the neighbor region is well-defined, graph-based CNNs (Bronstein et al., 2017) and manifold-based CNNs (Boscaini et al., 2015; Masci et al., 2015) find their unique advantages for shape analysis, especially on solving the shape corresponding problem.
Developing effective 3D decoders is the key to the learning-based shape generation task. The existing work can be categorized according to their shape representations.
Dense voxel-based decoder. Brock (2016) proposed a voxel-based variational autoencoder (Kingma and Welling, 2014) to reconstruct 3D shapes and utilized the trained latent code for shape classification. Choy (2016)
combined the power of the 3D volumetric autoencoder and the long short-term memory (LSTM) technique to reconstruct a volumetric grid from single-view or multi-view images. Generative adversarial networks (GAN)(Goodfellow et al., 2016) were introduced to voxel-based shape generation and reconstruction (Wu et al., 2016; Yang et al., 2018b) with different improvement strategies. However, the low resolution of the voxel representation still exists.
Sparse voxel-based decoder. The works of (Tatarchenko et al., 2017; Häne et al., 2017) show that the octree-based representation offers better efficiency and higher resolution than the (dense)-voxel representation for shape prediction. Riegler (2017b) demonstrated the usage of the octree-based decoder on depth fusion. , and even higher resolution outputs are made possible by octree-based decoders. Our Adaptive O-CNN further improves the efficiency and the prediction accuracy of the octree-based generation network.
Multi-view decoder. Soltani ([n. d.]) proposed to learn a generative model over multi-view depth maps or their corresponding silhouettes, and reconstruct 3D shapes via a deterministic rendering function. Lun (2017) used an autoencoder structure with a GAN to infer view-dependent depths of a category-specified shape from a single or two sketch inputs and fused all the outputs to reconstruct the shape. Lin (2018) used the projection loss between the point cloud assembled from different views and the ground-truth to further refine the predicted shape.
Point-based decoder. Su (2017)
designed PointSetGen to predict point coordinates from a single image. The Chamfer distance and Earth Mover’s distance metrics are used as the loss functions to penalize the deviation between the prediction and the ground truth. The generated point set roughly approximates the expected shape. Recently, Achlioptas(2018) adapted the GAN technique to improve the point-set generation.
Mesh-based decoder. By assuming that the topology of the generated shape is genus-zero or of a disk topology, a series of works (Sinha et al., 2017; Wang et al., 2018; Kato et al., 2018; Yang et al., 2018a) predicts the deformation of template mesh/point cloud vertices via CNNs. Groueix ([n. d.]) relaxed the topology constraints by introducing multiple 2D patches and predicting the mappings from 2D to 3D. They achieved better quality shapes. However, the uncontrolled distortion by the deformation or the mapping often yields highly irregular and distorted mesh elements that degrade the predicted shape quality.
Primitive decoder. Many shapes like human-made objects consist of simple parts. So instead of predicting low-level elements like points and voxels, predicting middle-level or even high-level primitives is essential to understanding the shape structure. Li (2017) proposed a recursive neural network based on an autoencoder to generate the hierarchical structure of shapes. Tulsiani (2017)
abstracted the input volume by a set of simple primitives, like cuboids, via an unsupervised learning approach. Zou(2017)
built a training dataset where the shapes are approximated by a set of primitives as the ground-truth, and they proposed a generative recurrent neural network to generate a set of simple primitives from a single image to reconstruct the shape. Sharma(2018) attempted to solve a more challenging problem: decoding a shape to a CSG tree. We regard a 3D shape as a collection of simple surface patches and use an adaptive octree to organize them for efficient processing. In our Adaptive O-CNN, a simple primitive patch — planar patch — is estimated at each octant to approximate the local shape.
The octree technique (Meagher, 1982) partitions a three-dimensional space recursively by subdividing it into eight octants. It serves as a central technique for many computer graphics applications, like rendering, shape reconstruction and collision detection. Due to its spatial efficiency and friendliness to GPU implementation, the octree and its variants have been used as the shape representation for 3D learning as described above. The commonly-used octant partitioning depends on the existence of the shape inside the octant and the partitioning is performed until the max tree depth is reached, and it usually results in a uniform sampling. Since the shape signal is actually distributed unevenly in the space, an adaptive sampling strategy can be integrated with the octree to further reduce the size of the octree. Frisken (2000)
proposed the octree-based adaptive distance field (ADF) to maintain high sampling rates in regions where the distance field contains fine detail and low sampling rates where the field varies smoothly. They subdivide a cell in which the distance field can not be well approximated by bilinear interpolation of the corner values. The ADF greatly reduces the memory cost and accelerates many processing operations. Our patch-guided adaptive octree follows this adaptive principle and uses the fitting error of the local patch to guide the partitioning. The shape is approximated by all the patches at the leaf octants of the octree with a guaranteed approximation accuracy.
3. Patch-guided Adaptive Octree
We introduce a patch-guided partitioning strategy to generate adaptive octrees. For a given surface , we start with its bounding box and perform 1-to-8 subdivision. For octant , denote as the local surface of restricted by the cubical region of . If , we approximate a simple surface patch to . In this paper, we choose the simplest surface — a planar patch — to guide the adaptive octree construction. The best plane with the least approximation error to is the minimizer of the following objective:
is the unit normal vector of the plane and the plane equation is. To make the normal direction consistent to the underlying shape normal, we check whether the angle between and the average normals of is less than degrees: if not, and are multiplied by . In the rest of the paper, we always assume that the local planes are reoriented in this way.
We denote as the planar patch of restricted by the cubical region of . The shape approximation quality of the local patch, , is defined by the Hausdorff distance between and :
The revised partitioning rule of the octree is: For any octant which is not at the max depth level, subdivide it if and is larger than the predefined threshold .
By following this rule, a patch-guided adaptive octree can be generated. The patches at all the non-empty leaf octants provide a good approximation to the input 3D shape — the Hausdorff distance between them and the input is bounded by . In practice, we set , where is the edge length of the finest grid of the octree.
Figure 2 shows a planar-patch-guided adaptive octree for the 3D Bunny model. We can see that the planar patches are of different sizes due to the adaptiveness of the octree.
For better visualization, we also illustrate the adaptive octree idea in 2D (see Figure 3) for a 2D curve input. It is clear that the line-segment-guided adaptive quadtree takes much less memory compared to the quadtree, and the collection of line segments is a good approximation to the input.
Watertight mesh conversion
Due to the approximation error, the local patches between adjacent non-empty leaf octants are not seamlessly connected, gaps exist on the boundary region of octants. This artifact can be found in Figure 2(e) and Figure 3-right. To fill these gaps, surface reconstruction (Kazhdan and Hoppe, 2013; Fuhrmann and Goesele, 2014) and polygonal repairing (Ju, 2004; Attene et al., 2013) techniques can be employed.
4. Adaptive O-CNN
The major components of a 3D CNN include the encoder and the decoder, which are essential to shape classification, shape generation and other tasks. In Section 4.1 and Section 4.2, we introduce the 3D encoder and decoder of our adaptive octree-based CNN.
4.1. 3D Encoder of Adaptive O-CNN
Since the main difference between the octree and the adaptive octree is the subdivision rule, the efficient GPU implementation of the octree (Wang et al., 2017) can be adapted to handle the adaptive octree easily. In the following, we first briefly review O-CNN (Wang et al., 2017), then introduce the Adaptive O-CNN 3D encoder.
Recap of O-CNN encoder
The key idea of O-CNN is to store the sparse surface signal, such as normals, in the finest non-empty octants and constrain the CNN computation within the octree. In each level of the octree, each octant is identified by its shuffled key (Wilhelms and Van Gelder, 1992). The shuffled keys are sorted in ascending order and stored in a contiguous array. Given the shuffled key of an octant, we can immediately calculate the shuffled keys of its neighbor octants and retrieve the corresponding neighborhood information, which is essential to implementing efficient CNN operations. To obtain the parent-children correspondence between the octants in two consecutive octree levels and mark out the empty octants, an additional Label array is introduced to record the information for each octant. Common CNN operations defined on the octree, such as convolution and pooling, are similar to volumetric CNN operations. The only difference is that the octree-based CNN operations are constrained within the octree by following the principle: “where there is an octant, there is CNN computation”
. Initially, the shape signal exists in the finest octree level, then at each level of the octree, the CNN operations are applied sequentially. When the stride of the CNN operation is 1, the signal is processed with unchanged resolution and it remains in the current octree level. When the stride of the CNN operation is larger than 1, the signal is condensed and flows along the octree from the bottom to the top.Figure 4-upper illustrates the encoder structure of O-CNN.
We reuse the O-CNN’s octree implementation for the 3D encoder of Adaptive O-CNN. The data storage of the adaptive octree in the GPU, the convolution, and pooling operations are as same as for O-CNN. There are two major differences between Adaptive O-CNN and O-CNN: (1) the input signal appears at all the octants, not only at the finest octants; (2) the computation starts from leaf octants at different levels simultaneously and the computed features are assembled across different levels. We detail these differences as follows.
Different from O-CNN which only stores the shape signals at the finest octants, we utilize all the estimated local planes as the input signal. For an octant at the -level whose local plane is , we set a four-channel input signal in it: . Here is the center point of and . Note that is the same plane equation. Here we use instead of because is bounded by the grid size of -level and it is a relative value while has a large range since measures the distance from the origin to the plane. For an empty octant, its input signal is set to .
Adaptive O-CNN 3D encoder
We design a novel network structure to take the adaptive octree as input. On each level of the octree, we apply a series of convolution operators and ReLUs to the features on all the octants at this level and the convolution kernel is shared by these octants. Then the processed features at the -th level is downsampled to the -th level via pooling and are fused with the features at the -th level by the element-wise max operation. These new features can be further processed and fused with features at the -th level, -th level, …, up to the coarsest level. In our implementation, the coarsest level is set to 2, where the octants at the nd-level are enforced to be full so that the features all have the same dimension. Figure 4-lower illustrates our Adaptive O-CNN 3D encoder architecture.
4.2. 3D Decoder of Adaptive O-CNN
We design a 3D decoder to generate an adaptive octree from a given latent code. The decoder structure is shown in Figure 5. At each octree level, we train a neural network to predict the patch approximation status for each octant — empty, surface-well-approximated, and surface-poorly-approximated — and regress the local patch parameters. Octants with label surface-poorly-approximated will be subdivided and their features in them are passed to their children via a deconvolution operation (also known as “transposed convolution” or “up-convolution”). The label surface-well-approximated within an octant implies that the local patch can well approximate the local shape where the error is bounded by , and the network stops splitting at such octants and leaves them as leaf nodes at the current octree level. An adaptive octree can be predicted in this recursive way until the max depth is reached.
The neural network for prediction is simple. It consists of “FC + BN + ReLU + FC” operations. Here BN represents Batch Normalization and FC represents Fully Connected layer. This module is shared across all octants at the same level of the adaptive octree. The output of the prediction module includes the patch approximation status and the plane patch parameters . The patch approximation status guides the subdivision of the octree and is also required by the octree-based deconvolution operator.
The loss function of the Adaptive O-CNN decoder includes the structure loss and the patch loss. The structure loss measures the difference between the predicted octree structure and its ground-truth. Since the determination of octant status is a 3-class classification, we use the cross entropy loss to define the structure loss. Denote as the cross entropy at -level of the octree, the structure loss is formed by the weighted sum of cross entropies across all the levels:
Here is the number of octants at the -th level of the predicted octree, is the max depth, and is the weight defined on each level. Similar to the encoder, the coarsest level of the octree is 2 and is full of octants, so starts with in the above equation. In our implementation, we set to 1.
The patch loss measures the squared distance error between the plane parameters and the ground truth at all the leaf octants in each level:
Here and are the predicted parameters, and are the corresponding ground-truth values, and is the number of leaf octants at the -th level of the predicted octree, is set to . In our implementation we make the octree adaptive when the octree level is over 4, so starts with in the above equation. Note that for the wrongly generated octants that do not exist in the ground-truth, there is no patch loss for them, and they are penalized by the structure loss only.
We use as the loss function for our decoder. Since the predicted plane should pass through the octant (otherwise it violates the assumption that the planar patch is inside the octant cube), we add the constraint where is the grid size of -level octants, by utilizing the function on the network output.
5. Experiments and Comparisons
To evaluate Adaptive O-CNN, we conduct three experiments: 3D shape classification, 3D autoencoding and 3D shape prediction from a single image. All the experiments were done on a desktop computer with an Intel Core I7-6900K CPU (3.2GHz) and a GeForce GTX Titan X GPU (12 GB memory). Our implementation is based on the Caffe framework(Jia et al., 2014) and the source code is available at https://github.com/Microsoft/O-CNN. The detailed Adaptive O-CNN network configuration is provided in the supplemental material.
For building the training dataset for our experiments, we first follow the approach of (Wang et al., 2017) to obtain a dense point cloud with oriented normals by virtual 3D scanning, then we build the planar-patch-guided adaptive octree from it via the construction procedure introduced in Section 3.
5.1. Shape classification
We evaluate the efficiency and efficacy of our Adaptive O-CNN encoder on the 3D shape classification task.
We performed the shape classification task on the ModelNet40 dataset (Wu et al., 2015), which contains 12,311 well annotated CAD models from 40 categories. The training data is augmented by rotating each model along its upright axis at 12 uniform intervals. The planar-patch-guided adaptive octrees are generated with different resolutions: , , , and . We conducted the shape classification experiment on these data respectively.
To clearly demonstrate the advantages of our adaptive octree-based encoder over the O-CNN (Wang et al., 2017), we use the same network parameters of O-CNN including the parameters of CNN operations, the number of training parameters and the dropout strategy. The only difference is the encoder network structure as shown in Figure 4. After training, we use the orientation pooling technique (Qi et al., 2016; Su et al., 2015) to vote for the results from the 12 predictions of the same object under different poses.
|Memory||Voxel||0.71 GB||3.7 GB||—||—|
|O-CNN||0.58 GB||1.1 GB||2.7 GB||6.4 GB|
|Adaptive O-CNN||0.51 GB||0.95 GB||1.5 GB||1.7 GB|
|Time||Voxel||425 ms||1648 ms||—||—|
|O-CNN||41 ms||117 ms||334 ms||1393 ms|
|Adaptive O-CNN||34 ms||63 ms||112 ms||307 ms|
|PointNet (Qi et al., 2017a)||89.2%||PointNet++ (Qi et al., 2017b)||91.9%|
|VRN Ensemble (Brock et al., 2016)||95.5%||SubVolSup (Qi et al., 2016)||89.2%|
|OctNet (Riegler et al., 2017a)||86.5%||O-CNN (Wang et al., 2017)||90.6%|
|Kd-Network (Klokov and Lempitsky, 2017)||91.8%||Adaptive O-CNN||90.5%|
We record the peak memory consumption and the average time of one forward and backward iteration with batch size 32, and report them with the classification accuracy on the test dataset in Table 1. The experiments show that the classification accuracy of Adaptive O-CNN is comparable to O-CNN under all the resolutions, and the memory and computational cost of Adaptive O-CNN is significantly lower, especially on the high-resolution input: Adaptive O-CNN under resolution gains about a 4-times speed-up and reduces GPU memory consumption by 73% compared to O-CNN. Compared with state-of-the-art learning-based methods, the classification accuracy of Adaptive O-CNN is also comparable (see Table 2).
As seen from Table 1, when the input resolution is beyond , the classification accuracy of Adaptive O-CNN drops slightly. We find that when the input resolution increases from to , the training loss decreases from 0.168 to 0.146, whereas the testing loss increases from 0.372 to 0.375. We conclude that Adaptive O-CNN with a deeper octree tends to overfit the training data of ModelNet40. The result is also consistent with the observation of (Wang et al., 2017) on O-CNN. With more training data, for instance, by rotating each training object 24 times around their upright axis uniformly, the classification accuracy can increase by 0.2% under the resolution of .
5.2. 3D Autoencoding
The Autoencoder technique is able to learn a compact representation for the input and recovers the signal from the latent code via a decoder. We use the Adaptive O-CNN encoder and decoder presented in Section 4 to form a 3D autoencoder.
We trained our 3D autoencoder on the ShapeNet Core v2 dataset (Chang et al., 2015), which consists of 39,715 3D models from 13 categories. The training and the test splitting rule is the same as the ones used in AtlasNet (Groueix et al., [n. d.]) and the point-based decoder (PSG) (Su et al., 2017). The adaptive octree we used is of max-depth 7, , the voxel resolution is .
We evaluate the quality of the decoded shape via measuring the Chamfer distance between it and its ground-truth shape. With the ground-truth point cloud denoted by , and the points predicted by the neural network by , the Chamfer distance between and is defined as:
Because our decoder outputs a patch-guided adaptive octree, to calculate the Chamfer distance, we sample a set of dense points uniformly from the estimated planar patches: we first subdivide the planar patch contained in the non-empty leaf node of the adaptive octree towards the resolution of , then randomly sample one point on each of the subdivided planar patches to form the output point cloud. For the ground-truth dense point cloud, we also uniformly sample points from it under the resolution of .
The quality measurement is summarized in Table 3, and we also compare with two state-of-the-art 3D autoencoder methods: AtlasNet (Groueix et al., [n. d.]) that generates a set of mesh patches as the approximation of the shape, and PointSetGen (PSG) (Su et al., 2017) that generates a point cloud output. Compared with two types of AtlasNets which predict 25 and 125 mesh patches, respectively, our Adaptive O-CNN autoencoder achieves the best quality on average. Note that the loss function of AtlasNet is the Chamfer distance exactly, while our autoencoder has not been specified for this loss but still performs well. Compared with PSG (Su et al., 2017), it is clear that our method and AtlasNet are much better.
Our Adaptive O-CNN performs worse than AtlasNet in some categories, such as plane, chair and firearm. We found on relatively thin parts of the models in those categories, such as the wing of the plane, arm of the chair, as well as the barrel of the gun, our Adaptive O-CNN has a larger geometry deviation from the original shape. However, for AtlasNet, although its deviation is smaller, we found that it approximates the thin parts with a single patch or messy patches (e.g. the single patch for the right arm of the chair, the folded patches for the gun barrel and the plane wing as seen in Figure 6), and the volume structure of the thin parts is totally lost and it is difficult or even impossible to define the inside and outside on those regions. We conclude that the Chamfer distance loss function used in AtlasNet does not penalize this structure loss. On the contrary, because our Adaptive O-CNN is trained with both the octree-based structure loss and patch loss, it successfully approximates the thin parts with better volume structures (e.g. the cylinder like shape for the gun barrel, chair support and the two-side surfaces for the plane wing). The zoom-in images in Figure 6 highlight these differences.
We designed three baseline autoencoders based on the standard octree structure to demonstrate the need for using the adaptive octree:
O-CNN(binary): A vanilla octree based autoencoder. The encoder is presented in Figure 4-upper. The decoder is similar to the Adaptive O-CNN decoder but with two differences: (1) the prediction module only predicts whether a given octant has an intersection with the ground-truth surface. If intersected, the octant will be further subdivided; (2) the loss only involves the structure loss.
O-CNN(patch): An enhanced version of O-CNN(binary). The prediction module also predicts the plane patch on each leaf node at the finest level and the patch loss is added.
O-CNN(patch*): An enhanced version of O-CNN(patch). The prediction module predicts the plane patch on each leaf node at each level and the patch loss is added.
These three networks are trained on the 3D autoencoding task. The statistics of the results are also summarized in Table 3. The Chamfer distance metric of O-CNN(patch) is slightly better than O-CNN(binary) since the regression of plane patches at the finest level enables sub-voxel precision. By considering the patch loss at each depth level, O-CNN(patch*) further improves the reconstruction accuracy due to the hierarchical supervision in the training. However, it is still worse than Adaptive O-CNN. The reason is as follows: during the shape generation of Adaptive O-CNN, if the plane patch generated in an octant in the coarser level can well approximate the ground-truth shape, the Adaptive O-CNN will stop subdividing this octant and the network layers in the finer level are trained to focus on the region with more geometry details. As a result, the Adaptive O-CNN not only avoids the holes in the region that can be well approximated by a large planar patch, but also generates better results for the surface region with more shape details. On the contrary, no matter whether a region can be modeled by a large plane patch or not, the O-CNN based networks subdivide all non-empty octants at each level and predict the surface at the finest level. Therefore, the O-CNN has more chances to predict the occupancy of the finest level voxels wrongly.
The visualization in Figure 7 also demonstrates that Adaptive O-CNN generates more visually pleasing results and outputs large planar patches on flat regions, while the outputs of O-CNN(binary), O-CNN(patch) and O-CNN(patch*) contain more holes due to the inaccurate prediction.
Application: shape completion
A 3D autoencoder can be used to recover the missing part of a geometric shape and fair the noisy input. We conduct a shape completion task to demonstrate the efficacy of our Adaptive O-CNN. We choose the car category from the ShapeNet Core v2 dataset as the ground-truth data. For each car, we choose 3 to 5 views randomly and sample dense points from these views. On each view, we also randomly crop some regions to mimic holes and perturb point positions slightly to model scan noise. These views are assembled together to serve as the incomplete and noisy data. We trained our Adaptive O-CNN based autoencoder on this synthetic dataset with the incomplete shape as input and the corresponding complete shape as the target. For reference, we also trained the O-CNN(patch) based autoencoder on it. The max depth of the octree in all the networks is set to 7. Figure 8 shows two completion examples. The results from Adaptive O-CNN are closer to the ground-truth, while the O-CNN(patch) misses filling some holes.
5.3. Shape reconstruction from a single image
Reconstructing 3D shapes from 2D images is an important topic in computer vision and graphics. With the development of 3D deep learning techniques, the task of inferring a 3D shape from a single image has gained much attention in the research community. We conduct experiments on this task for our Adaptive O-CNN and compare it with the state-of-art methods(Groueix et al., [n. d.]; Su et al., 2017; Tatarchenko et al., 2017).
For the comparisons with the AtlasNet (Groueix et al., [n. d.]) and PointSetGen (PSG) (Su et al., 2017), we use the same dataset which is originally from (Choy et al., 2016). The ground-truth 3D shapes come from ShapeNet Core v2 (Chang et al., 2015), and each object is rendered from 24 viewpoints with a transparent background. For the comparison with OctGen (Tatarchenko et al., 2017), since OctGen was only trained on the car category with the octree of resolution , we also trained our network on the car dataset with the same resolution.
We report the Chamfer distance between the predicted points and points sampled from the original mesh for PointSetGen, AtlasNet and our method in Table 4. As mentioned in (Groueix et al., [n. d.]), they randomly selected 260 shapes (20 per category) to form the testing database. To compare with PointSetGen, they ran the ICP algorithm (Besl and McKay, 1992) to align the predicted points from both PointSetGen and AtlasNet with the ground-truth point cloud. Note that after the ICP alignment, the Chamfer distance error is slightly improved. To have a fair comparison, we also ran the ICP algorithm to align our results with the ground-truth. Our method achieves the best performance on 8 out of 13 categories, especially for the objects with large flat regions, such as car and cabinet. In Figure 9 & Figure 1 we illustrate some sample outputs from these networks. It is clear that our outputs are more visually pleasing. For the flip phone image in the last row of Figure 9, the reconstruction quality is relatively lower than other input images for all methods. This is because flip phones are rare in the training dataset.
For computing the Chamfer distance for the output of OctGen, we densely sample the points from the boundary octant boxes for evaluation. Our Adaptive O-CNN has the lower Chamfer distance error than OctGen: 0.274 vs. 0.294. A visual comparison is shown in Figure 10: our results preserve more details than OctGen, and the resulting surface patches are much more faithful to the ground-truth, especially on the flat regions.
We present a novel Adaptive O-CNN for 3D encoding and decoding. The encoder and decoder of Adaptive O-CNN utilize the nice properties of the patch-guided adaptive octree structure: compactness, adaptiveness, and high-quality approximation of the shape. We show the high memory and computational efficiency of Adaptive O-CNN, and demonstrate its superiority over other state-of-the-art methods including existing octree-based CNNs on some typical 3D learning tasks, including 3D autoencoding, surface completion from noisy and incomplete point clouds, and surface prediction from images.
One limitation in our implementation is that the adjacent patches in the adaptive octree are not seamless. To obtain a well-connected mesh output, we need to use other mesh repairing or surface reconstruction techniques. In fact, we observe that most of the seams can be stitched by snapping the nearby vertices of adjacent patches. We would like to add a regularization loss function to reduce the seam, and develop a post-processing method to stitch all the gaps.
Another limitation is that the planar patch we used in Adaptive O-CNN does not approximate curved features very well, for instance, see the car wheel in Figure 7. In the future, we would like to explore the use of non-planar surface patches in Adaptive O-CNN. Quadratic surface patches or its subclasses — parabolic surface patches and ellipsoidal patches are promising patches because they have simple expressions and planes are a special case of them. Another direction is to use other fitting quality metrics to guide the subdivision of octants, for instance, using the topological similarity between the local fitted patch and the ground-truth surface patch as guidance to ensure that the fitted patch approximates the local shape well both in geometry and topology.
Acknowledgements.We wish to thank the authors of ModelNet and ShapeNet for sharing data, Stephen Lin for proofreading the paper, and the anonymous reviewers for their valuable feedback.
- Achlioptas et al. (2018) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. 2018. Learning representations and generative models for 3D point clouds. In International Conference on Learning Representations.
- Attene et al. (2013) Marco Attene, Marcel Campen, and Leif Kobbelt. 2013. Polygon mesh repairing: An application perspective. ACM Comput. Surv. 45, 2 (2013), 15:1–15:33.
- Besl and McKay (1992) P. J. Besl and N. D. McKay. 1992. A method for registration of 3-D shapes. IEEE Trans. Pattern. Anal. Mach. Intell. 14, 2 (1992), 239–256.
- Boscaini et al. (2015) D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and P. Vandergheynst. 2015. Learning class-specific descriptors for deformable shapes using localized spectral convolutional networks. Comput. Graph. Forum 34, 5 (2015), 13–23.
- Brock et al. (2016) Andrew Brock, Theodore Lim, J.M. Ritchie, and Nick Weston. 2016. Generative and discriminative voxel modeling with convolutional neural networks. In 3D deep learning workshop (NIPS).
- Bronstein et al. (2017) M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. 2017. Geometric deep learning: going beyond Euclidean data. IEEE Sig. Proc. Magazine 34 (2017), 18 – 42. Issue 4.
- Chang et al. (2015) Angel X. Chang, Thomas Funkhouser, and etal. 2015. ShapeNet: an information-rich 3D model repository. arXiv:1512.03012 [cs.GR].
- Choy et al. (2016) Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In European Conference on Computer Vision (ECCV). 628–644.
Deng et al. (2009)
Jia Deng, Wei Dong,
Richard Socher, Li jia Li,
Kai Li, and Li Fei-fei.
ImageNet: a large-scale hierarchical image
Computer Vision and Pattern Recognition (CVPR).
- Frisken et al. (2000) Sarah F. Frisken, Ronald N. Perry, Alyn P. Rockwood, and Thouis R. Jones. 2000. Adaptively sampled distance fields: A general representation of shape for computer graphics. In SIGGRAPH. 249–254.
- Fuhrmann and Goesele (2014) Simon Fuhrmann and Michael Goesele. 2014. Floating scale surface reconstruction. ACM Trans. Graph. (SIGGRAPH) 33, 4 (2014), 46:1–46:11.
- Goodfellow et al. (2016) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2016. Generative adversarial networks. In Neural Information Processing Systems (NIPS).
- Graham (2015) Ben Graham. 2015. Sparse 3D convolutional neural networks. In British Machine Vision Conference (BMVC).
- Groueix et al. ([n. d.]) Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, and Mathieu Aubry. [n. d.]. e.
- Häne et al. (2017) Christian Häne, Shubham Tulsiani, and Jitendra Malik. 2017. Hierarchical surface prediction for 3D object reconstruction. In Proc. Int. Conf. on 3D Vision (3DV).
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).
- Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: convolutional architecture for fast feature embedding. In ACM Multimedia (ACMMM). 675–678.
- Ju (2004) Tao Ju. 2004. Robust repair of polygonal models. ACM Trans. Graph. (SIGGRAPH) 23, 3 (2004), 888–895.
- Kato et al. (2018) Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3D Mesh Renderer. In Computer Vision and Pattern Recognition (CVPR).
- Kazhdan and Hoppe (2013) Michael Kazhdan and Hugues Hoppe. 2013. Screened Poisson surface reconstruction. ACM Trans. Graph. 32, 3 (2013), 29:1–29:13.
- Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In International Conference on Learning Representations.
- Klokov and Lempitsky (2017) Roman Klokov and Victor Lempitsky. 2017. Escape from cells: Deep Kd-networks for the recognition of 3D point cloud models. In International Conference on Computer Vision (ICCV).
- Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
- Li et al. (2017) Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas Guibas. 2017. GRASS: Generative recursive autoencoders for shape structures. ACM Trans. Graph. (SIGGRAPH) 36, 4 (2017), 52:1–52:14.
- Lin et al. (2018) Chen-Hsuan Lin, Chen Kong, and Simon Lucey. 2018. Learning efficient point cloud generation for dense 3D object reconstruction. arXiv:1706.07036 [cs.CV]. In AAAI Conference on Artificial Intelligence.
- Lun et al. (2017) Zhaoliang Lun, Matheus Gadelha, Evangelos Kalogerakis, Subhransu Maji, and Rui Wang. 2017. 3D shape reconstruction from sketches via multi-view convolutional networks. In Proc. Int. Conf. on 3D Vision (3DV).
- Masci et al. (2015) Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. 2015. Geodesic convolutional neural networks on Riemannian manifolds. In International Conference on Computer Vision (ICCV).
- Maturana and Scherer (2015) D. Maturana and S. Scherer. 2015. VoxNet: A 3D convolutional neural network for real-time object recognition. In International Conference on Intelligent Robots and Systems (IROS).
- Meagher (1982) Donald Meagher. 1982. Geometric modeling using octree encoding. Computer Graphics and Image Processing 19 (1982), 129–147.
- Qi et al. (2017a) Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017a. PointNet: Deep learning on point sets for 3D classification and segmentation. In Computer Vision and Pattern Recognition (CVPR).
- Qi et al. (2016) Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. 2016. Volumetric and multi-view CNNs for object classification on 3D data. In Computer Vision and Pattern Recognition (CVPR).
- Qi et al. (2017b) Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Neural Information Processing Systems (NIPS).
- Riegler et al. (2017b) Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. 2017b. OctNetFusion: Learning depth fusion from data. In Proc. Int. Conf. on 3D Vision (3DV).
- Riegler et al. (2017a) Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017a. OctNet: Learning deep 3D representations at high resolutions. In Computer Vision and Pattern Recognition (CVPR).
- Sharma et al. (2018) Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. 2018. CSGNet: Neural shape parser for constructive solid geometry. In Computer Vision and Pattern Recognition (CVPR).
- Sinha et al. (2017) Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik Ramani. 2017. SurfNet: Generating 3D shape surfaces using deep residual networks. In Computer Vision and Pattern Recognition (CVPR).
- Soltani et al. ([n. d.]) Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D. Kulkarni, and Joshua B. Tenenbaum. [n. d.]. In Computer Vision and Pattern Recognition (CVPR).
- Su et al. (2017) Hao Su, Haoqiang Fan, and Leonidas Guibas. 2017. A point set generation network for 3D object reconstruction from a single image. In Computer Vision and Pattern Recognition (CVPR).
- Su et al. (2015) H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. 2015. Multi-view convolutional neural networks for 3D shape recognition. In International Conference on Computer Vision (ICCV).
- Tatarchenko et al. (2017) M. Tatarchenko, A. Dosovitskiy, and T. Brox. 2017. Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In International Conference on Computer Vision (ICCV).
- Tulsiani et al. (2017) Shubham Tulsiani, Hao Su, Leonidas Guibas, Alexei A. Efros, and Jitendra Malik. 2017. Learning shape abstractions by assembling volumetric primitives. In Computer Vision and Pattern Recognition (CVPR).
- Uhrig et al. (2017) Jonas Uhrig, Nick Schneider, Lukas Schneider, Thomas Brox, and Andreas Geiger. 2017. Sparsity invariant CNNs. In Proc. Int. Conf. on 3D Vision (3DV).
- Wang et al. (2018) Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. 2018. Pixel2Mesh: Generating 3D mesh models from single RGB images. In Computer Vision and Pattern Recognition (CVPR).
- Wang et al. (2017) Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Trans. Graph. (SIGGRAPH) 36, 4 (2017), 72:1–72:11.
- Wilhelms and Van Gelder (1992) Jane Wilhelms and Allen Van Gelder. 1992. Octrees for faster isosurface generation. ACM Trans. Graph. 11, 3 (1992), 201–227.
- Wu et al. (2016) Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, and Joshua B. Tenenbaum. 2016. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Neural Information Processing Systems (NIPS).
- Wu et al. (2015) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shape modeling. In Computer Vision and Pattern Recognition (CVPR).
- Yang et al. (2018b) Bo Yang, Stefano Rosa, Andrew Markham, Niki Trigoni, and Hongkai Wen. 2018b. 3D object dense reconstruction from a single depth view. arXiv:1802.00411 [cs.CV].
- Yang et al. (2018a) Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. 2018a. FoldingNet: Point cloud auto-encoder via deep grid deformation. In Computer Vision and Pattern Recognition (CVPR).
- Zou et al. (2017) Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 2017. 3D-PRNN: Generating shape primitives with recurrent neural networks. In International Conference on Computer Vision (ICCV).
Appendix: Parameter Setting of Adaptive O-CNN
The detailed Adaptive O-CNN encoder and decoder networks for an octree with max-depth 7 is shown in Figure 11. In the figure, represents the input feature at the octree level. represents the convolution operation with kernel size and output channel number . represents the deconvolution operation with kernel size , stride and output channel number . The kernel size and stride of the operation are both . represents the fully connected layer with output channel number . is the prediction module introduced in Section 4.2, which includes two operations. Here is the number of output channels of the first operation, and is fixed to . And is the number of output channels of the second operation, with which the plane parameters and the octant statuses are predicted. Since we make the octree adaptive from the level, the value of at the second and the third level in Figure 11 is set to , predicting whether to split an octant or not. From the octree level the value of is set to : channels of the output are used to predict the octant fitting status: empty, surface-well-approximated, and surface-poorly-approximated; the other channels are used to regress the plane parameters . The input latent code dimension of the decoder is set to 128.
We use the SGD solver to optimize the neural network, and the batch size is set to 32. In the shape classification experiment, the initial learning rate is 0.1, and it is decreased by a factor of 10 after every 10 epochs, and stops after 40 epochs. In the 3D autoencoding experiment, the initial learning rate is 0.1, and it is decreased by a factor of 10 after 100k, 200k, 250k iterations respectively, and stops after 350k iterations. In the shape prediction from a single image task, the initial learning rate is 0.1, and it is decreased by a factor of 10 after 150k, 300k, 350k iterations respectively, and stops after 400k iterations.