1. Introduction
3D shape analysis and generation are two key tasks in computer graphics. Traditional approaches generally have limited ability to handle complex shapes, or require significant time and effort from users to achieve acceptable results. With the rapid growth of created and captured 3D data, it has become possible to learn the shape space from a large 3D dataset with the aid of machine learning techniques and guide the shape analysis and generation with the learned features. Recently, deep learning with convolution neural networks has been applied to 3D shape analysis and synthesis.
Different from images whose gridbased representation is simple and regular, 3D shapes have a variety of representations because of different demands from real applications. For the learningbased shape generation task, the representation of 3D shapes plays a vital role which affects the design of learning architectures and the quality of generated shapes. The commonly used (dense)voxel representation is most popular in existing 3D learning and generation frameworks (Wu et al., 2015) (Wu et al., 2016) since it is a natural extension to 2D images and is wellsuited to existing learning frameworks, like convolutional neural networks. However, its high memory storage and costly computation are a major drawback, and highresolution outputs are hard to produce in practice. Multiview images (Su et al., 2015) have been widely used in shape generation. The generated multiview images can be fused to reconstruct the complete shape. Proper view selection, enforcing consistency of different views, and shape occlusion are the main challenges for this representation. Recently, points, as another common 3D representation, has become a suitable representation for shape analysis and generation with the development of PointNet (Qi et al., 2017a) and its variants. However, its output quality is limited by the number of points, and extracting highquality surfaces from the point cloud requires additional processing. As the favorite 3D format in computer graphics, the polygonal mesh has recently been used in learningbased shape generation. Surface patches or meshes can be predicted directly by a neural network that deforms a template mesh or finds a 2Dto3D mapping (Groueix et al., [n. d.]; Kato et al., 2018; Wang et al., 2018). However, the predefined mesh topology and the regular tessellation of the template mesh prevent generating highquality results, especially for irregular and complex shapes.
The octree, which is the most representative sparsevoxel representation, has been integrated with convolution neural networks recently for shape analysis (Riegler et al., 2017a; Wang et al., 2017) and its memory and computational efficiency property is suitable for generating highresolution shapes (Tatarchenko et al., 2017; Häne et al., 2017)
. The octreebased generation network usually predicts the occupancy probability of an octant:
occupied, free and boundary, and splits the octant with label boundary. The prediction and splitting procedures are recursively performed until the predefined max depth of the octree is reached. At the finest level the nonempty leaf octants represent the predicted surface. In existing work, the nonempty leaf octants at the finest level can be regarded as uniform samples of the shape in the x, y, and z directions. We observe that it is actually not necessary to store the shape information in this uniform way since the local shape inside some octants can be represented by simple patches, like planar patches. Therefore, by storing the patch information and terminating the octant split early if the patch associated with the octant well approximates the local shape, the generated octree can have a more compact and adaptive representation. Furthermore, the stored patch has a higher order approximation accuracy than using the center or one of the corners of the octant as the sample of the surface.Based on the above observations, we propose a novel 3D convolutional neural network for 3D shape called Adaptive Octreebased CNN, or Adaptive OCNN for short. Adaptive OCNN is based on a novel patchguided adaptive octree shape representation which adaptively splits the octant according to the approximation error of the estimated simple patch to the local shape contained by the octant. The decoder of Adaptive OCNN predicts the occupancy probability of octants: empty, surfacewellapproximated, and surfacepoorlyapproximated; infers the local patch at each nonempty octant at each level, and split octants whose label is surfacepoorlyapproximated. It results in an adaptive octree whose estimated local patches at nonempty leaf octants are a multiscale and adaptive representation of the predicted shape. Besides the decoder, we also develop an efficient 3D encoder for adaptive octrees and use it for shape classification and as a 3D autoencoder.
Our Adaptive OCNN inherits the advantages of octreebased CNNs and gains substantial efficiency in memory and computation cost compared with the existing octreebased CNNs due to the use of the adaptive octree data structure. The local patch estimation at each level also enhances the generated shape quality significantly. With all of these features, Adaptive OCNN is capable of generating highresolution and highquality shapes efficiently. We evaluate Adaptive OCNN on different tasks, including shape classification, 3D autoencoding, shape prediction from a single image, and shape completion for incomplete data. We demonstrate the superiority of Adaptive OCNN over the stateoftheart learningbased shape generation techniques in terms of shape quality.
2. Related Work
Shape representations for 3D CNNs
Due to the variety of 3D shape representations, there is not a universal representation for 3D learning. (dense)voxel representation equipped with binary occupancy signals or signed distance values is popular in existing 3D CNN frameworks (Wu et al., 2015; Maturana and Scherer, 2015) due to its simplicity and similarity to its 2D counterpart — images. Voxelbased 3D CNNs often suffer from the highmemory issue, thus they have difficulty in supporting highresolution input. Since the 3D shape only occupies a small region in its bounding volume, there is a trend toward building a sparsevoxel representation for 3D CNNs. A series of works including (Graham, 2015; Uhrig et al., 2017; Wang et al., 2017; Riegler et al., 2017a) explore the sparsity of voxels and define proper convolution and pooling operations on sparse voxels with the aid of the octree structure and its variants. Our patchguided adaptive octree also belongs to this type of representation, but with greater sparsity and better accuracy because of its adaptiveness and patch fitting. The multiview representation regards the shape as a collection of images rendered from different views (Su et al., 2015)
. The images can contain RGB color information or viewdependent depth values, and it is easy to feed them to 2D CNNs and utilize networks pretrained on ImageNet
(Deng et al., 2009). However, the multiview representation may miss partial information of the shape due to occlusion and insufficient views. Recently, the pointbased representation has become popular due to its simplicity and flexibility. PointNet (Qi et al., 2017a) and its successor PointNet++ (Qi et al., 2017b) regard a shape as an unorganized point cloud and use symmetric functions to achieve the permutation invariance of points. These pointbased CNNs are suited to applications whose input can be well approximated by a set of points or naturally has a point representation, like LiDAR scans. For mesh inputs where the neighbor region is welldefined, graphbased CNNs (Bronstein et al., 2017) and manifoldbased CNNs (Boscaini et al., 2015; Masci et al., 2015) find their unique advantages for shape analysis, especially on solving the shape corresponding problem.3D decoders
Developing effective 3D decoders is the key to the learningbased shape generation task. The existing work can be categorized according to their shape representations.

[leftmargin=*, itemsep=1mm]

Dense voxelbased decoder. Brock (2016) proposed a voxelbased variational autoencoder (Kingma and Welling, 2014) to reconstruct 3D shapes and utilized the trained latent code for shape classification. Choy (2016)
combined the power of the 3D volumetric autoencoder and the long shortterm memory (LSTM) technique to reconstruct a volumetric grid from singleview or multiview images. Generative adversarial networks (GAN)
(Goodfellow et al., 2016) were introduced to voxelbased shape generation and reconstruction (Wu et al., 2016; Yang et al., 2018b) with different improvement strategies. However, the low resolution of the voxel representation still exists. 
Sparse voxelbased decoder. The works of (Tatarchenko et al., 2017; Häne et al., 2017) show that the octreebased representation offers better efficiency and higher resolution than the (dense)voxel representation for shape prediction. Riegler (2017b) demonstrated the usage of the octreebased decoder on depth fusion. , and even higher resolution outputs are made possible by octreebased decoders. Our Adaptive OCNN further improves the efficiency and the prediction accuracy of the octreebased generation network.

Multiview decoder. Soltani ([n. d.]) proposed to learn a generative model over multiview depth maps or their corresponding silhouettes, and reconstruct 3D shapes via a deterministic rendering function. Lun (2017) used an autoencoder structure with a GAN to infer viewdependent depths of a categoryspecified shape from a single or two sketch inputs and fused all the outputs to reconstruct the shape. Lin (2018) used the projection loss between the point cloud assembled from different views and the groundtruth to further refine the predicted shape.

Pointbased decoder. Su (2017)
designed PointSetGen to predict point coordinates from a single image. The Chamfer distance and Earth Mover’s distance metrics are used as the loss functions to penalize the deviation between the prediction and the ground truth. The generated point set roughly approximates the expected shape. Recently, Achlioptas
(2018) adapted the GAN technique to improve the pointset generation. 
Meshbased decoder. By assuming that the topology of the generated shape is genuszero or of a disk topology, a series of works (Sinha et al., 2017; Wang et al., 2018; Kato et al., 2018; Yang et al., 2018a) predicts the deformation of template mesh/point cloud vertices via CNNs. Groueix ([n. d.]) relaxed the topology constraints by introducing multiple 2D patches and predicting the mappings from 2D to 3D. They achieved better quality shapes. However, the uncontrolled distortion by the deformation or the mapping often yields highly irregular and distorted mesh elements that degrade the predicted shape quality.

Primitive decoder. Many shapes like humanmade objects consist of simple parts. So instead of predicting lowlevel elements like points and voxels, predicting middlelevel or even highlevel primitives is essential to understanding the shape structure. Li (2017) proposed a recursive neural network based on an autoencoder to generate the hierarchical structure of shapes. Tulsiani (2017)
abstracted the input volume by a set of simple primitives, like cuboids, via an unsupervised learning approach. Zou
(2017)built a training dataset where the shapes are approximated by a set of primitives as the groundtruth, and they proposed a generative recurrent neural network to generate a set of simple primitives from a single image to reconstruct the shape. Sharma
(2018) attempted to solve a more challenging problem: decoding a shape to a CSG tree. We regard a 3D shape as a collection of simple surface patches and use an adaptive octree to organize them for efficient processing. In our Adaptive OCNN, a simple primitive patch — planar patch — is estimated at each octant to approximate the local shape.
Octree techniques
The octree technique (Meagher, 1982) partitions a threedimensional space recursively by subdividing it into eight octants. It serves as a central technique for many computer graphics applications, like rendering, shape reconstruction and collision detection. Due to its spatial efficiency and friendliness to GPU implementation, the octree and its variants have been used as the shape representation for 3D learning as described above. The commonlyused octant partitioning depends on the existence of the shape inside the octant and the partitioning is performed until the max tree depth is reached, and it usually results in a uniform sampling. Since the shape signal is actually distributed unevenly in the space, an adaptive sampling strategy can be integrated with the octree to further reduce the size of the octree. Frisken (2000)
proposed the octreebased adaptive distance field (ADF) to maintain high sampling rates in regions where the distance field contains fine detail and low sampling rates where the field varies smoothly. They subdivide a cell in which the distance field can not be well approximated by bilinear interpolation of the corner values. The ADF greatly reduces the memory cost and accelerates many processing operations. Our patchguided adaptive octree follows this adaptive principle and uses the fitting error of the local patch to guide the partitioning. The shape is approximated by all the patches at the leaf octants of the octree with a guaranteed approximation accuracy.
3. Patchguided Adaptive Octree
We introduce a patchguided partitioning strategy to generate adaptive octrees. For a given surface , we start with its bounding box and perform 1to8 subdivision. For octant , denote as the local surface of restricted by the cubical region of . If , we approximate a simple surface patch to . In this paper, we choose the simplest surface — a planar patch — to guide the adaptive octree construction. The best plane with the least approximation error to is the minimizer of the following objective:
(1) 
Here
is the unit normal vector of the plane and the plane equation is
. To make the normal direction consistent to the underlying shape normal, we check whether the angle between and the average normals of is less than degrees: if not, and are multiplied by . In the rest of the paper, we always assume that the local planes are reoriented in this way.We denote as the planar patch of restricted by the cubical region of . The shape approximation quality of the local patch, , is defined by the Hausdorff distance between and :
The revised partitioning rule of the octree is: For any octant which is not at the max depth level, subdivide it if and is larger than the predefined threshold .
By following this rule, a patchguided adaptive octree can be generated. The patches at all the nonempty leaf octants provide a good approximation to the input 3D shape — the Hausdorff distance between them and the input is bounded by . In practice, we set , where is the edge length of the finest grid of the octree.
Figure 2 shows a planarpatchguided adaptive octree for the 3D Bunny model. We can see that the planar patches are of different sizes due to the adaptiveness of the octree.
For better visualization, we also illustrate the adaptive octree idea in 2D (see Figure 3) for a 2D curve input. It is clear that the linesegmentguided adaptive quadtree takes much less memory compared to the quadtree, and the collection of line segments is a good approximation to the input.
Watertight mesh conversion
Due to the approximation error, the local patches between adjacent nonempty leaf octants are not seamlessly connected, gaps exist on the boundary region of octants. This artifact can be found in Figure 2(e) and Figure 3right. To fill these gaps, surface reconstruction (Kazhdan and Hoppe, 2013; Fuhrmann and Goesele, 2014) and polygonal repairing (Ju, 2004; Attene et al., 2013) techniques can be employed.
4. Adaptive OCNN
The major components of a 3D CNN include the encoder and the decoder, which are essential to shape classification, shape generation and other tasks. In Section 4.1 and Section 4.2, we introduce the 3D encoder and decoder of our adaptive octreebased CNN.
4.1. 3D Encoder of Adaptive OCNN
Since the main difference between the octree and the adaptive octree is the subdivision rule, the efficient GPU implementation of the octree (Wang et al., 2017) can be adapted to handle the adaptive octree easily. In the following, we first briefly review OCNN (Wang et al., 2017), then introduce the Adaptive OCNN 3D encoder.
Recap of OCNN encoder
The key idea of OCNN is to store the sparse surface signal, such as normals, in the finest nonempty octants and constrain the CNN computation within the octree. In each level of the octree, each octant is identified by its shuffled key (Wilhelms and Van Gelder, 1992). The shuffled keys are sorted in ascending order and stored in a contiguous array. Given the shuffled key of an octant, we can immediately calculate the shuffled keys of its neighbor octants and retrieve the corresponding neighborhood information, which is essential to implementing efficient CNN operations. To obtain the parentchildren correspondence between the octants in two consecutive octree levels and mark out the empty octants, an additional Label array is introduced to record the information for each octant. Common CNN operations defined on the octree, such as convolution and pooling, are similar to volumetric CNN operations. The only difference is that the octreebased CNN operations are constrained within the octree by following the principle: “where there is an octant, there is CNN computation”
. Initially, the shape signal exists in the finest octree level, then at each level of the octree, the CNN operations are applied sequentially. When the stride of the CNN operation is 1, the signal is processed with unchanged resolution and it remains in the current octree level. When the stride of the CNN operation is larger than 1, the signal is condensed and flows along the octree from the bottom to the top.
Figure 4upper illustrates the encoder structure of OCNN.We reuse the OCNN’s octree implementation for the 3D encoder of Adaptive OCNN. The data storage of the adaptive octree in the GPU, the convolution, and pooling operations are as same as for OCNN. There are two major differences between Adaptive OCNN and OCNN: (1) the input signal appears at all the octants, not only at the finest octants; (2) the computation starts from leaf octants at different levels simultaneously and the computed features are assembled across different levels. We detail these differences as follows.
Input signal
Different from OCNN which only stores the shape signals at the finest octants, we utilize all the estimated local planes as the input signal. For an octant at the level whose local plane is , we set a fourchannel input signal in it: . Here is the center point of and . Note that is the same plane equation. Here we use instead of because is bounded by the grid size of level and it is a relative value while has a large range since measures the distance from the origin to the plane. For an empty octant, its input signal is set to .
Adaptive OCNN 3D encoder
We design a novel network structure to take the adaptive octree as input. On each level of the octree, we apply a series of convolution operators and ReLUs to the features on all the octants at this level and the convolution kernel is shared by these octants. Then the processed features at the th level is downsampled to the th level via pooling and are fused with the features at the th level by the elementwise max operation. These new features can be further processed and fused with features at the th level, th level, …, up to the coarsest level. In our implementation, the coarsest level is set to 2, where the octants at the ndlevel are enforced to be full so that the features all have the same dimension. Figure 4lower illustrates our Adaptive OCNN 3D encoder architecture.
4.2. 3D Decoder of Adaptive OCNN
We design a 3D decoder to generate an adaptive octree from a given latent code. The decoder structure is shown in Figure 5. At each octree level, we train a neural network to predict the patch approximation status for each octant — empty, surfacewellapproximated, and surfacepoorlyapproximated — and regress the local patch parameters. Octants with label surfacepoorlyapproximated will be subdivided and their features in them are passed to their children via a deconvolution operation (also known as “transposed convolution” or “upconvolution”). The label surfacewellapproximated within an octant implies that the local patch can well approximate the local shape where the error is bounded by , and the network stops splitting at such octants and leaves them as leaf nodes at the current octree level. An adaptive octree can be predicted in this recursive way until the max depth is reached.
Prediction module
The neural network for prediction is simple. It consists of “FC + BN + ReLU + FC” operations. Here BN represents Batch Normalization and FC represents Fully Connected layer. This module is shared across all octants at the same level of the adaptive octree. The output of the prediction module includes the patch approximation status and the plane patch parameters . The patch approximation status guides the subdivision of the octree and is also required by the octreebased deconvolution operator.
Loss function
The loss function of the Adaptive OCNN decoder includes the structure loss and the patch loss. The structure loss measures the difference between the predicted octree structure and its groundtruth. Since the determination of octant status is a 3class classification, we use the cross entropy loss to define the structure loss. Denote as the cross entropy at level of the octree, the structure loss is formed by the weighted sum of cross entropies across all the levels:
(2) 
Here is the number of octants at the th level of the predicted octree, is the max depth, and is the weight defined on each level. Similar to the encoder, the coarsest level of the octree is 2 and is full of octants, so starts with in the above equation. In our implementation, we set to 1.
The patch loss measures the squared distance error between the plane parameters and the ground truth at all the leaf octants in each level:
(3) 
Here and are the predicted parameters, and are the corresponding groundtruth values, and is the number of leaf octants at the th level of the predicted octree, is set to . In our implementation we make the octree adaptive when the octree level is over 4, so starts with in the above equation. Note that for the wrongly generated octants that do not exist in the groundtruth, there is no patch loss for them, and they are penalized by the structure loss only.
We use as the loss function for our decoder. Since the predicted plane should pass through the octant (otherwise it violates the assumption that the planar patch is inside the octant cube), we add the constraint where is the grid size of level octants, by utilizing the function on the network output.
5. Experiments and Comparisons
To evaluate Adaptive OCNN, we conduct three experiments: 3D shape classification, 3D autoencoding and 3D shape prediction from a single image. All the experiments were done on a desktop computer with an Intel Core I76900K CPU (3.2GHz) and a GeForce GTX Titan X GPU (12 GB memory). Our implementation is based on the Caffe framework
(Jia et al., 2014) and the source code is available at https://github.com/Microsoft/OCNN. The detailed Adaptive OCNN network configuration is provided in the supplemental material.Dataset preprocessing
For building the training dataset for our experiments, we first follow the approach of (Wang et al., 2017) to obtain a dense point cloud with oriented normals by virtual 3D scanning, then we build the planarpatchguided adaptive octree from it via the construction procedure introduced in Section 3.
5.1. Shape classification
We evaluate the efficiency and efficacy of our Adaptive OCNN encoder on the 3D shape classification task.
Dataset
We performed the shape classification task on the ModelNet40 dataset (Wu et al., 2015), which contains 12,311 well annotated CAD models from 40 categories. The training data is augmented by rotating each model along its upright axis at 12 uniform intervals. The planarpatchguided adaptive octrees are generated with different resolutions: , , , and . We conducted the shape classification experiment on these data respectively.
Network configuration
To clearly demonstrate the advantages of our adaptive octreebased encoder over the OCNN (Wang et al., 2017), we use the same network parameters of OCNN including the parameters of CNN operations, the number of training parameters and the dropout strategy. The only difference is the encoder network structure as shown in Figure 4. After training, we use the orientation pooling technique (Qi et al., 2016; Su et al., 2015) to vote for the results from the 12 predictions of the same object under different poses.
Method  

Memory  Voxel  0.71 GB  3.7 GB  —  — 
OCNN  0.58 GB  1.1 GB  2.7 GB  6.4 GB  
Adaptive OCNN  0.51 GB  0.95 GB  1.5 GB  1.7 GB  
Time  Voxel  425 ms  1648 ms  —  — 
OCNN  41 ms  117 ms  334 ms  1393 ms  
Adaptive OCNN  34 ms  63 ms  112 ms  307 ms  
Accuracy  OCNN  90.4%  90.6%  90.1%  90.2% 
Adaptive OCNN  90.5%  90.4%  90.0%  90.2% 
Method  Accuracy  Method  Accuracy 

PointNet (Qi et al., 2017a)  89.2%  PointNet++ (Qi et al., 2017b)  91.9% 
VRN Ensemble (Brock et al., 2016)  95.5%  SubVolSup (Qi et al., 2016)  89.2% 
OctNet (Riegler et al., 2017a)  86.5%  OCNN (Wang et al., 2017)  90.6% 
KdNetwork (Klokov and Lempitsky, 2017)  91.8%  Adaptive OCNN  90.5% 
mean  pla.  ben.  cab.  car  cha.  mon.  lam.  spe.  fir.  cou.  tab.  cel.  wat.  

PSG  1.91  1.11  1.46  1.91  1.59  1.90  2.20  3.59  3.07  0.94  1.83  1.83  1.71  1.69 
AtlasNet(25)  1.56  0.87  1.25  1.78  1.58  1.56  1.72  2.30  2.61  0.68  1.83  1.52  1.27  1.33 
AtlasNet(125)  1.51  0.86  1.15  1.76  1.56  1.55  1.69  2.26  2.55  0.59  1.69  1.47  1.31  1.23 
Adaptive OCNN  1.44  1.19  1.27  1.01  0.96  1.65  1.41  2.83  1.97  1.06  1.14  1.46  0.73  1.82 
OCNN(binary)  1.60  1.12  1.30  1.06  1.02  1.79  1.62  3.71  2.56  0.98  1.17  1.67  0.79  1.88 
OCNN(patch)  1.59  1.10  1.29  1.06  1.02  1.79  1.60  3.70  2.55  0.97  1.18  1.66  0.79  1.87 
OCNN(patch*)  1.53  1.09  1.31  0.91  0.97  1.77  1.58  3.64  2.28  0.97  1.14  1.65  0.73  1.91 
Experimental results
We record the peak memory consumption and the average time of one forward and backward iteration with batch size 32, and report them with the classification accuracy on the test dataset in Table 1. The experiments show that the classification accuracy of Adaptive OCNN is comparable to OCNN under all the resolutions, and the memory and computational cost of Adaptive OCNN is significantly lower, especially on the highresolution input: Adaptive OCNN under resolution gains about a 4times speedup and reduces GPU memory consumption by 73% compared to OCNN. Compared with stateoftheart learningbased methods, the classification accuracy of Adaptive OCNN is also comparable (see Table 2).
Discussion
As seen from Table 1, when the input resolution is beyond , the classification accuracy of Adaptive OCNN drops slightly. We find that when the input resolution increases from to , the training loss decreases from 0.168 to 0.146, whereas the testing loss increases from 0.372 to 0.375. We conclude that Adaptive OCNN with a deeper octree tends to overfit the training data of ModelNet40. The result is also consistent with the observation of (Wang et al., 2017) on OCNN. With more training data, for instance, by rotating each training object 24 times around their upright axis uniformly, the classification accuracy can increase by 0.2% under the resolution of .
5.2. 3D Autoencoding
The Autoencoder technique is able to learn a compact representation for the input and recovers the signal from the latent code via a decoder. We use the Adaptive OCNN encoder and decoder presented in Section 4 to form a 3D autoencoder.
Dataset
We trained our 3D autoencoder on the ShapeNet Core v2 dataset (Chang et al., 2015), which consists of 39,715 3D models from 13 categories. The training and the test splitting rule is the same as the ones used in AtlasNet (Groueix et al., [n. d.]) and the pointbased decoder (PSG) (Su et al., 2017). The adaptive octree we used is of maxdepth 7, , the voxel resolution is .
Quality metric
We evaluate the quality of the decoded shape via measuring the Chamfer distance between it and its groundtruth shape. With the groundtruth point cloud denoted by , and the points predicted by the neural network by , the Chamfer distance between and is defined as:
Because our decoder outputs a patchguided adaptive octree, to calculate the Chamfer distance, we sample a set of dense points uniformly from the estimated planar patches: we first subdivide the planar patch contained in the nonempty leaf node of the adaptive octree towards the resolution of , then randomly sample one point on each of the subdivided planar patches to form the output point cloud. For the groundtruth dense point cloud, we also uniformly sample points from it under the resolution of .
Experimental results
The quality measurement is summarized in Table 3, and we also compare with two stateoftheart 3D autoencoder methods: AtlasNet (Groueix et al., [n. d.]) that generates a set of mesh patches as the approximation of the shape, and PointSetGen (PSG) (Su et al., 2017) that generates a point cloud output. Compared with two types of AtlasNets which predict 25 and 125 mesh patches, respectively, our Adaptive OCNN autoencoder achieves the best quality on average. Note that the loss function of AtlasNet is the Chamfer distance exactly, while our autoencoder has not been specified for this loss but still performs well. Compared with PSG (Su et al., 2017), it is clear that our method and AtlasNet are much better.
Discussion
Our Adaptive OCNN performs worse than AtlasNet in some categories, such as plane, chair and firearm. We found on relatively thin parts of the models in those categories, such as the wing of the plane, arm of the chair, as well as the barrel of the gun, our Adaptive OCNN has a larger geometry deviation from the original shape. However, for AtlasNet, although its deviation is smaller, we found that it approximates the thin parts with a single patch or messy patches (e.g. the single patch for the right arm of the chair, the folded patches for the gun barrel and the plane wing as seen in Figure 6), and the volume structure of the thin parts is totally lost and it is difficult or even impossible to define the inside and outside on those regions. We conclude that the Chamfer distance loss function used in AtlasNet does not penalize this structure loss. On the contrary, because our Adaptive OCNN is trained with both the octreebased structure loss and patch loss, it successfully approximates the thin parts with better volume structures (e.g. the cylinder like shape for the gun barrel, chair support and the twoside surfaces for the plane wing). The zoomin images in Figure 6 highlight these differences.
Ablation study
We designed three baseline autoencoders based on the standard octree structure to demonstrate the need for using the adaptive octree:

[leftmargin=*]

OCNN(binary): A vanilla octree based autoencoder. The encoder is presented in Figure 4upper. The decoder is similar to the Adaptive OCNN decoder but with two differences: (1) the prediction module only predicts whether a given octant has an intersection with the groundtruth surface. If intersected, the octant will be further subdivided; (2) the loss only involves the structure loss.

OCNN(patch): An enhanced version of OCNN(binary). The prediction module also predicts the plane patch on each leaf node at the finest level and the patch loss is added.

OCNN(patch*): An enhanced version of OCNN(patch). The prediction module predicts the plane patch on each leaf node at each level and the patch loss is added.
These three networks are trained on the 3D autoencoding task. The statistics of the results are also summarized in Table 3. The Chamfer distance metric of OCNN(patch) is slightly better than OCNN(binary) since the regression of plane patches at the finest level enables subvoxel precision. By considering the patch loss at each depth level, OCNN(patch*) further improves the reconstruction accuracy due to the hierarchical supervision in the training. However, it is still worse than Adaptive OCNN. The reason is as follows: during the shape generation of Adaptive OCNN, if the plane patch generated in an octant in the coarser level can well approximate the groundtruth shape, the Adaptive OCNN will stop subdividing this octant and the network layers in the finer level are trained to focus on the region with more geometry details. As a result, the Adaptive OCNN not only avoids the holes in the region that can be well approximated by a large planar patch, but also generates better results for the surface region with more shape details. On the contrary, no matter whether a region can be modeled by a large plane patch or not, the OCNN based networks subdivide all nonempty octants at each level and predict the surface at the finest level. Therefore, the OCNN has more chances to predict the occupancy of the finest level voxels wrongly.
The visualization in Figure 7 also demonstrates that Adaptive OCNN generates more visually pleasing results and outputs large planar patches on flat regions, while the outputs of OCNN(binary), OCNN(patch) and OCNN(patch*) contain more holes due to the inaccurate prediction.
Application: shape completion
A 3D autoencoder can be used to recover the missing part of a geometric shape and fair the noisy input. We conduct a shape completion task to demonstrate the efficacy of our Adaptive OCNN. We choose the car category from the ShapeNet Core v2 dataset as the groundtruth data. For each car, we choose 3 to 5 views randomly and sample dense points from these views. On each view, we also randomly crop some regions to mimic holes and perturb point positions slightly to model scan noise. These views are assembled together to serve as the incomplete and noisy data. We trained our Adaptive OCNN based autoencoder on this synthetic dataset with the incomplete shape as input and the corresponding complete shape as the target. For reference, we also trained the OCNN(patch) based autoencoder on it. The max depth of the octree in all the networks is set to 7. Figure 8 shows two completion examples. The results from Adaptive OCNN are closer to the groundtruth, while the OCNN(patch) misses filling some holes.
5.3. Shape reconstruction from a single image
Reconstructing 3D shapes from 2D images is an important topic in computer vision and graphics. With the development of 3D deep learning techniques, the task of inferring a 3D shape from a single image has gained much attention in the research community. We conduct experiments on this task for our Adaptive OCNN and compare it with the stateofart methods
(Groueix et al., [n. d.]; Su et al., 2017; Tatarchenko et al., 2017).Dataset
For the comparisons with the AtlasNet (Groueix et al., [n. d.]) and PointSetGen (PSG) (Su et al., 2017), we use the same dataset which is originally from (Choy et al., 2016). The groundtruth 3D shapes come from ShapeNet Core v2 (Chang et al., 2015), and each object is rendered from 24 viewpoints with a transparent background. For the comparison with OctGen (Tatarchenko et al., 2017), since OctGen was only trained on the car category with the octree of resolution , we also trained our network on the car dataset with the same resolution.
Image encoders
Experimental results
We report the Chamfer distance between the predicted points and points sampled from the original mesh for PointSetGen, AtlasNet and our method in Table 4. As mentioned in (Groueix et al., [n. d.]), they randomly selected 260 shapes (20 per category) to form the testing database. To compare with PointSetGen, they ran the ICP algorithm (Besl and McKay, 1992) to align the predicted points from both PointSetGen and AtlasNet with the groundtruth point cloud. Note that after the ICP alignment, the Chamfer distance error is slightly improved. To have a fair comparison, we also ran the ICP algorithm to align our results with the groundtruth. Our method achieves the best performance on 8 out of 13 categories, especially for the objects with large flat regions, such as car and cabinet. In Figure 9 & Figure 1 we illustrate some sample outputs from these networks. It is clear that our outputs are more visually pleasing. For the flip phone image in the last row of Figure 9, the reconstruction quality is relatively lower than other input images for all methods. This is because flip phones are rare in the training dataset.
For computing the Chamfer distance for the output of OctGen, we densely sample the points from the boundary octant boxes for evaluation. Our Adaptive OCNN has the lower Chamfer distance error than OctGen: 0.274 vs. 0.294. A visual comparison is shown in Figure 10: our results preserve more details than OctGen, and the resulting surface patches are much more faithful to the groundtruth, especially on the flat regions.
Method  mean  pla.  ben.  cab.  car  cha.  mon.  lam.  spe.  fir.  cou.  tab.  cel.  wat. 

PSG  6.41  3.36  4.31  8.51  8.63  6.35  6.47  7.66  15.9  1.58  6.92  3.93  3.76  5.94 
AtlasNet(25)  5.11  2.54  3.91  5.39  4.18  6.77  6.71  7.24  8.18  1.63  6.76  4.35  3.91  4.91 
Adaptive OCNN  4.63  2.45  2.69  2.67  1.80  6.13  6.27  10.92  9.43  1.68  4.42  4.19  2.51  5.04 
6. Conclusion
We present a novel Adaptive OCNN for 3D encoding and decoding. The encoder and decoder of Adaptive OCNN utilize the nice properties of the patchguided adaptive octree structure: compactness, adaptiveness, and highquality approximation of the shape. We show the high memory and computational efficiency of Adaptive OCNN, and demonstrate its superiority over other stateoftheart methods including existing octreebased CNNs on some typical 3D learning tasks, including 3D autoencoding, surface completion from noisy and incomplete point clouds, and surface prediction from images.
One limitation in our implementation is that the adjacent patches in the adaptive octree are not seamless. To obtain a wellconnected mesh output, we need to use other mesh repairing or surface reconstruction techniques. In fact, we observe that most of the seams can be stitched by snapping the nearby vertices of adjacent patches. We would like to add a regularization loss function to reduce the seam, and develop a postprocessing method to stitch all the gaps.
Another limitation is that the planar patch we used in Adaptive OCNN does not approximate curved features very well, for instance, see the car wheel in Figure 7. In the future, we would like to explore the use of nonplanar surface patches in Adaptive OCNN. Quadratic surface patches or its subclasses — parabolic surface patches and ellipsoidal patches are promising patches because they have simple expressions and planes are a special case of them. Another direction is to use other fitting quality metrics to guide the subdivision of octants, for instance, using the topological similarity between the local fitted patch and the groundtruth surface patch as guidance to ensure that the fitted patch approximates the local shape well both in geometry and topology.
Acknowledgements.
We wish to thank the authors of ModelNet and ShapeNet for sharing data, Stephen Lin for proofreading the paper, and the anonymous reviewers for their valuable feedback.References
 (1)
 Achlioptas et al. (2018) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. 2018. Learning representations and generative models for 3D point clouds. In International Conference on Learning Representations.
 Attene et al. (2013) Marco Attene, Marcel Campen, and Leif Kobbelt. 2013. Polygon mesh repairing: An application perspective. ACM Comput. Surv. 45, 2 (2013), 15:1–15:33.
 Besl and McKay (1992) P. J. Besl and N. D. McKay. 1992. A method for registration of 3D shapes. IEEE Trans. Pattern. Anal. Mach. Intell. 14, 2 (1992), 239–256.
 Boscaini et al. (2015) D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and P. Vandergheynst. 2015. Learning classspecific descriptors for deformable shapes using localized spectral convolutional networks. Comput. Graph. Forum 34, 5 (2015), 13–23.
 Brock et al. (2016) Andrew Brock, Theodore Lim, J.M. Ritchie, and Nick Weston. 2016. Generative and discriminative voxel modeling with convolutional neural networks. In 3D deep learning workshop (NIPS).
 Bronstein et al. (2017) M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. 2017. Geometric deep learning: going beyond Euclidean data. IEEE Sig. Proc. Magazine 34 (2017), 18 – 42. Issue 4.
 Chang et al. (2015) Angel X. Chang, Thomas Funkhouser, and etal. 2015. ShapeNet: an informationrich 3D model repository. arXiv:1512.03012 [cs.GR].
 Choy et al. (2016) Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3DR2N2: A unified approach for single and multiview 3D object reconstruction. In European Conference on Computer Vision (ECCV). 628–644.

Deng et al. (2009)
Jia Deng, Wei Dong,
Richard Socher, Li jia Li,
Kai Li, and Li Feifei.
2009.
ImageNet: a largescale hierarchical image
database. In
Computer Vision and Pattern Recognition (CVPR)
.  Frisken et al. (2000) Sarah F. Frisken, Ronald N. Perry, Alyn P. Rockwood, and Thouis R. Jones. 2000. Adaptively sampled distance fields: A general representation of shape for computer graphics. In SIGGRAPH. 249–254.
 Fuhrmann and Goesele (2014) Simon Fuhrmann and Michael Goesele. 2014. Floating scale surface reconstruction. ACM Trans. Graph. (SIGGRAPH) 33, 4 (2014), 46:1–46:11.
 Goodfellow et al. (2016) Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2016. Generative adversarial networks. In Neural Information Processing Systems (NIPS).
 Graham (2015) Ben Graham. 2015. Sparse 3D convolutional neural networks. In British Machine Vision Conference (BMVC).
 Groueix et al. ([n. d.]) Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, and Mathieu Aubry. [n. d.]. e.
 Häne et al. (2017) Christian Häne, Shubham Tulsiani, and Jitendra Malik. 2017. Hierarchical surface prediction for 3D object reconstruction. In Proc. Int. Conf. on 3D Vision (3DV).
 He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: convolutional architecture for fast feature embedding. In ACM Multimedia (ACMMM). 675–678.
 Ju (2004) Tao Ju. 2004. Robust repair of polygonal models. ACM Trans. Graph. (SIGGRAPH) 23, 3 (2004), 888–895.
 Kato et al. (2018) Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3D Mesh Renderer. In Computer Vision and Pattern Recognition (CVPR).
 Kazhdan and Hoppe (2013) Michael Kazhdan and Hugues Hoppe. 2013. Screened Poisson surface reconstruction. ACM Trans. Graph. 32, 3 (2013), 29:1–29:13.
 Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Autoencoding variational Bayes. In International Conference on Learning Representations.
 Klokov and Lempitsky (2017) Roman Klokov and Victor Lempitsky. 2017. Escape from cells: Deep Kdnetworks for the recognition of 3D point cloud models. In International Conference on Computer Vision (ICCV).
 Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Li et al. (2017) Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas Guibas. 2017. GRASS: Generative recursive autoencoders for shape structures. ACM Trans. Graph. (SIGGRAPH) 36, 4 (2017), 52:1–52:14.
 Lin et al. (2018) ChenHsuan Lin, Chen Kong, and Simon Lucey. 2018. Learning efficient point cloud generation for dense 3D object reconstruction. arXiv:1706.07036 [cs.CV]. In AAAI Conference on Artificial Intelligence.
 Lun et al. (2017) Zhaoliang Lun, Matheus Gadelha, Evangelos Kalogerakis, Subhransu Maji, and Rui Wang. 2017. 3D shape reconstruction from sketches via multiview convolutional networks. In Proc. Int. Conf. on 3D Vision (3DV).
 Masci et al. (2015) Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. 2015. Geodesic convolutional neural networks on Riemannian manifolds. In International Conference on Computer Vision (ICCV).
 Maturana and Scherer (2015) D. Maturana and S. Scherer. 2015. VoxNet: A 3D convolutional neural network for realtime object recognition. In International Conference on Intelligent Robots and Systems (IROS).
 Meagher (1982) Donald Meagher. 1982. Geometric modeling using octree encoding. Computer Graphics and Image Processing 19 (1982), 129–147.
 Qi et al. (2017a) Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017a. PointNet: Deep learning on point sets for 3D classification and segmentation. In Computer Vision and Pattern Recognition (CVPR).
 Qi et al. (2016) Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. 2016. Volumetric and multiview CNNs for object classification on 3D data. In Computer Vision and Pattern Recognition (CVPR).
 Qi et al. (2017b) Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Neural Information Processing Systems (NIPS).
 Riegler et al. (2017b) Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. 2017b. OctNetFusion: Learning depth fusion from data. In Proc. Int. Conf. on 3D Vision (3DV).
 Riegler et al. (2017a) Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017a. OctNet: Learning deep 3D representations at high resolutions. In Computer Vision and Pattern Recognition (CVPR).
 Sharma et al. (2018) Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. 2018. CSGNet: Neural shape parser for constructive solid geometry. In Computer Vision and Pattern Recognition (CVPR).
 Sinha et al. (2017) Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik Ramani. 2017. SurfNet: Generating 3D shape surfaces using deep residual networks. In Computer Vision and Pattern Recognition (CVPR).
 Soltani et al. ([n. d.]) Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D. Kulkarni, and Joshua B. Tenenbaum. [n. d.]. In Computer Vision and Pattern Recognition (CVPR).
 Su et al. (2017) Hao Su, Haoqiang Fan, and Leonidas Guibas. 2017. A point set generation network for 3D object reconstruction from a single image. In Computer Vision and Pattern Recognition (CVPR).
 Su et al. (2015) H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. 2015. Multiview convolutional neural networks for 3D shape recognition. In International Conference on Computer Vision (ICCV).
 Tatarchenko et al. (2017) M. Tatarchenko, A. Dosovitskiy, and T. Brox. 2017. Octree generating networks: efficient convolutional architectures for highresolution 3D outputs. In International Conference on Computer Vision (ICCV).
 Tulsiani et al. (2017) Shubham Tulsiani, Hao Su, Leonidas Guibas, Alexei A. Efros, and Jitendra Malik. 2017. Learning shape abstractions by assembling volumetric primitives. In Computer Vision and Pattern Recognition (CVPR).
 Uhrig et al. (2017) Jonas Uhrig, Nick Schneider, Lukas Schneider, Thomas Brox, and Andreas Geiger. 2017. Sparsity invariant CNNs. In Proc. Int. Conf. on 3D Vision (3DV).
 Wang et al. (2018) Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and YuGang Jiang. 2018. Pixel2Mesh: Generating 3D mesh models from single RGB images. In Computer Vision and Pattern Recognition (CVPR).
 Wang et al. (2017) PengShuai Wang, Yang Liu, YuXiao Guo, ChunYu Sun, and Xin Tong. 2017. OCNN: Octreebased convolutional neural networks for 3D shape analysis. ACM Trans. Graph. (SIGGRAPH) 36, 4 (2017), 72:1–72:11.
 Wilhelms and Van Gelder (1992) Jane Wilhelms and Allen Van Gelder. 1992. Octrees for faster isosurface generation. ACM Trans. Graph. 11, 3 (1992), 201–227.
 Wu et al. (2016) Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, and Joshua B. Tenenbaum. 2016. Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In Neural Information Processing Systems (NIPS).
 Wu et al. (2015) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shape modeling. In Computer Vision and Pattern Recognition (CVPR).
 Yang et al. (2018b) Bo Yang, Stefano Rosa, Andrew Markham, Niki Trigoni, and Hongkai Wen. 2018b. 3D object dense reconstruction from a single depth view. arXiv:1802.00411 [cs.CV].
 Yang et al. (2018a) Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. 2018a. FoldingNet: Point cloud autoencoder via deep grid deformation. In Computer Vision and Pattern Recognition (CVPR).
 Zou et al. (2017) Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 2017. 3DPRNN: Generating shape primitives with recurrent neural networks. In International Conference on Computer Vision (ICCV).
Appendix: Parameter Setting of Adaptive OCNN
The detailed Adaptive OCNN encoder and decoder networks for an octree with maxdepth 7 is shown in Figure 11. In the figure, represents the input feature at the octree level. represents the convolution operation with kernel size and output channel number . represents the deconvolution operation with kernel size , stride and output channel number . The kernel size and stride of the operation are both . represents the fully connected layer with output channel number . is the prediction module introduced in Section 4.2, which includes two operations. Here is the number of output channels of the first operation, and is fixed to . And is the number of output channels of the second operation, with which the plane parameters and the octant statuses are predicted. Since we make the octree adaptive from the level, the value of at the second and the third level in Figure 11 is set to , predicting whether to split an octant or not. From the octree level the value of is set to : channels of the output are used to predict the octant fitting status: empty, surfacewellapproximated, and surfacepoorlyapproximated; the other channels are used to regress the plane parameters . The input latent code dimension of the decoder is set to 128.
We use the SGD solver to optimize the neural network, and the batch size is set to 32. In the shape classification experiment, the initial learning rate is 0.1, and it is decreased by a factor of 10 after every 10 epochs, and stops after 40 epochs. In the 3D autoencoding experiment, the initial learning rate is 0.1, and it is decreased by a factor of 10 after 100k, 200k, 250k iterations respectively, and stops after 350k iterations. In the shape prediction from a single image task, the initial learning rate is 0.1, and it is decreased by a factor of 10 after 150k, 300k, 350k iterations respectively, and stops after 400k iterations.
Comments
There are no comments yet.