H-CNN: Spatial Hashing Based CNN for 3D Shape Analysis

03/30/2018 ∙ by Tianjia Shao, et al. ∙ 0

We present a novel spatial hashing based data structure to facilitate 3D shape analysis using convolutional neural networks (CNNs). Our method well utilizes the sparse occupancy of 3D shape boundary and builds hierarchical hash tables for an input model under different resolutions. Based on this data structure, we design two efficient GPU algorithms namely hash2col and col2hash so that the CNN operations like convolution and pooling can be efficiently parallelized. The spatial hashing is nearly minimal, and our data structure is almost of the same size as the raw input. Compared with state-of-the-art octree-based methods, our data structure significantly reduces the memory footprint during the CNN training. As the input geometry features are more compactly packed, CNN operations also run faster with our data structure. The experiment shows that, under the same network structure, our method yields comparable or better benchmarks compared to the state-of-the-art while it has only one-third memory consumption. Such superior memory performance allows the CNN to handle high-resolution shape analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D shape analysis such as classification, segmentation, and retrieval has long stood as one of the most fundamental tasks for computer graphics. While many algorithms have been proposed (e.g. see [1]

), they are often crafted for a sub-category of shapes by manually extracting case-specific features. A general-purpose shape analysis that handles a wide variety of 3D geometries is still considered challenging. On the other hand, convolutional neural networks (CNNs) are skilled at learning essential features out of the raw training data. They have demonstrated great success in many computer vision problems for 2D images/videos 

[2, 3, 4]. The impressive results from these works drive many follow-up investigations of leveraging various CNNs to tackle more challenging tasks in 3D shape analysis.

Projecting a 3D model into multiple 2D views is a straightforward idea which maximizes the re-usability of existing 2D CNNs frameworks [5, 6, 7, 8]. If the input 3D model has complex geometry however, degenerating it to multiple 2D projections could miss original shape features and lower quality of the final result. It is known that most useful geometry information only resides at the surface of a 3D model. While embedded in , this is essentially two-dimensional. Inspired by this fact, some prior works try to directly extract features out of the model’s surface [9, 10] using, for instance the Laplace-Beltrami operator [11]. These methods assume that the model’s surface be second-order differentiable, which may not be the case in practice. In fact, many scanned or man-made 3D models are of multiple components, which are not even manifolds with the presence of a large number of holes, dangling vertices and intersecting/interpenetrating polygons. Using dense voxel-based discretization is another alternative [12, 13]. Unfortunately, treating a 3D model as a voxelized volume does not scale up as both memory usage and computational costs increase cubically with the escalated voxel resolution. The input data would easily exceed the GPU memory limit under moderate resolutions.

Octree-based model discretization significantly relieves the memory burden for 3D shape analysis [14, 15]. For instance, Wang et al. [15] proposed a framework named O-CNN (abbreviated as OCNN in this paper), which utilizes the octree to discretize the surface of a 3D shape. In octree-based methods, whether or not an octant is generated depends on whether or not its parent octant intersects with the input model. As a result, although octree effectively reduces the memory footprint compared to the “brute-force” voxelization scheme, its memory overhead is still considerable since many redundant empty leaf octants are also generated, especially for high-resolution models.

In this paper, we provide a better answer to the question of how to wisely exploit the sparse occupancy of 3D models and structure them in a way that conveniently interfaces with various CNN architectures, as shown in Figure 1. In our framework, 3D shapes are packed using the perfect spatial hashing (PSH) [16] and we name our framework as Hash-CNN or HCNN. PSH is nearly minimal meaning the size of the hash table is almost the same as the size of the input 3D model. As later discussed in Section 5.3, our memory overhead is tightly bounded by in the worst case while OCNN has a memory overhead of , not to mention other voxel-based 3D CNNs (here, denotes the voxel resolution at the finest level). Due to the superior memory performance, HCNN is able to handle high-resolution shapes, which are hardly possible for the state-of-the-art. Our primary contribution is investigating how to efficiently parallelize CNN operations using hash-based models. To this end, two GPU algorithms namely hash2col and col2hash are contrived to facilitate CNN operations like convolution and pooling. Our experiments show that HCNN achieves comparable or better benchmarks under various shape analysis tasks compared with existing 3D CNN methods. In addition, HCNN consumes much less memory and it also runs faster due to its compact data packing.

2 Related Work

Fig. 1: An overview of HCNN framework for shape analysis. We construct a set of hierarchical PSHs to pack surface geometric features of an input airplane model at different resolution levels. Compared with existing 3D CNN frameworks, our method fully utilizes the spatial sparsity of 3D models, and the PSH data structure is almost of the same size as the raw input. Therefore, we can perform high-resolution shape analysis with 3D CNN efficiently. The final segmentation results demonstrate a clear advantage of high-resolution models. Each part of the airplane model is much better segmented at the resolution of , which is currently only possible with HCNN.

3D shape analysis [17, 18, 1]

is one of the most fundamental tasks in computer graphics. Most existing works utilize manually crafted features for dedicated tasks such as shape retrieval and segmentation. Encouraged by great successes in 2D images analysis using CNN-based machine learning methods 

[19, 20, 21], many research efforts have been devoted to leverage CNN techniques for 3D shape analysis.

A straightforward idea is to feed multiple projections of a 3D model as the CNN input [5, 6, 7, 8] so that the existing CNN architectures for 2D images can be re-used. However, self-occlusion is almost inevitable for complicated shapes during the projection, and the problem of how to faithfully restore complete 3D information out of 2D projections remains an unknown one to us.

Another direction is to perform CNN operations over the geometric features defined on 3D model surfaces [22]. For instance, Boscaini et al. [10]

used windowed Fourier transform and Masci et al. 

[9] used local geodesic polar coordinates to extract local shape descriptors for the CNN training. These methods, however require that input models should be smooth and manifold, and therefore cannot be directly used for 3D models composed of point clouds or polygon soups. Alternatively, Sinha et al. [23] parameterized a 3D shape over a spherical domain and re-represented the input model using a geometry image [24], based on which the CNN training was carried out. Guo et al. [25] computed a collection of shape features and re-shaped them into a matrix as the CNN input. Recently, Qi et al. [26]

used the raw point clouds as the network input, which is also referred to as PointNet. This method used shared multi-layer perceptrons and max pooling for the feature extraction. Maron et al. 

[27] applied CNN to sphere-type shapes using a global parametrization to a planar flat-torus.

Similar to considering images as an array of 2D pixels, discretizing 3D models into voxels is a good way to organize the shape information for CNN-based shape analysis. Wu et al. [12]

proposed 3D ShapeNets for 3D object detection. They represented a 3D shape as a probability distribution of binary variables on voxels. Maturana and Scherer 

[13]

used similar strategy to encode large point cloud datasets. They used a binary occupancy grid to distinguish free and occupied spaces, a density grid to estimate the probability that the voxel would block a sensor beam, and a hit grid to record the hit numbers. Such volumetric discretization consumes memory cubically w.r.t. the voxel resolution, thus is not feasible for high-resolution shape analysis. Observing the fact that the spatial occupancy of 3D data is often sparse, Wang et al. 

[28] designed a feature-centric voting algorithm named Vote3D for fast recognition of cars, pedestrians and bicyclists from the KITTI database [29] using the sliding window method. More importantly, they demonstrated mathematical equivalence between the sparse convolution and voting. Based on this, Engelcke et al. [30] proposed a method called Vote3Deep converting the convolution into voting procedures, which can be simply applied to the non-empty voxels. However, with more convolution layers added to the CNN, this method quickly becomes prohibitive.

Octree-based data structures have been proven an effective way to reduce the memory consumption of 3D shapes. For example, Riegler et al. [14] proposed a hybrid grid-octree data structure to support high-resolution 3D CNNs. Our work is most relevant to OCNN [15], which used an octree to store the surface features of a 3D model and reduced the memory consumption for 3D CNNs to . For the octree data structure, an octant is subdivided into eight children octants if it intersects with the model’s surface regardless if all of those eight children octants are on the model. Therefore, an OCNN’s subdivision also yields futile octants that do not contain useful features of the model. On the other hand, we use multi-level PSH [16] to organize voxelized 3D models. PSH is nearly minimal while retaining an as cache-friendly as possible random access. As a result, the memory footprint of HCNN is close to the theoretic lower bound. Unlike in the original PSH work [16], the main hash table only stores the data index, and the real feature data is compactly assembled in a separate data array. We investigate how to seamlessly synergize hierarchical PSH-based models with CNN operations so that they can be efficiently executed on the GPU.

3 Spatial Hashing for 3D CNN

Fig. 2: An illustrative 2D example of the constitution of our PSH. The domain consists of 2D voxels or pixels. The red-shaded pixels stand for the input model. The green, blue, yellow and brown tables are the offset table (), hash table (), position tag () and data array () respectively.

For a given input 3D shape, either a triangle soup/mesh or a point cloud, we first uniformly scale it to fit a unit sphere pivoted at the model’s geometry center. Then, an axis-aligned bounding cube is built, whose dimension equals to the sphere’s diameter. Doing so ensures that the model remains inside the bounding box under arbitrary rotations, so that we can further apply the training data augmentation during the training (see e.g. Section 5.4). This bounding cube is subdivided into grid cells or voxels along , , and axes. A voxel is a small equilateral cuboid. It is considered non-empty when it encapsulates a small patch of the model’s boundary surface. As suggested in [15]

, we put extra sample points on this embedded surface patch, and the averaged normal of all the sample points is fed to the CNN as the input signal. For an empty voxel, its input is simply a zero vector.

3.1 Multi-level PSH

A set of hierarchical PSHs are built. At each level of the hierarchy, we construct a data array , a hash table , an offset table and a position tag . The data array at the finest level stores the input feature (i.e. normal direction of the voxel). Let be a -dimensional discrete spatial domain with voxels, out of which the sparse geometry data occupies grid cells (i.e. ). In other words, represents all the voxels within the bounding cube at the given resolution, and represents the set of voxels intersecting with the input model. We seek for a hash table , which is a -dimensional array of size and a -dimensional offset table of size . By building maps from to the hash table and from on the offset table , one can obtain the perfect hash function mapping each non-empty voxel on the 3D shape to a unique slot in the hash table as:

(1)

Note that the hash table possesses slightly excessive slots (i.e. ) to make sure that the hashing representation of is collision free. A NULL value is stored at those redundant slots in . Clearly, these NULL

values should not participate in the CNN operations like batch normalization and scale. To this end, we assemble all the data for

into a compact -dimensional array of size . only houses the data index in . If a slot in is redundant, it is indexed as so that the associated data query is skipped.

Fig. 3: PSH data structures for a mini-batch of two models. All the feature data for the red-shaded pixels are stored in the super data array , which consists of the data arrays of each individual models. Super hash table , position tag and model index table are of the same size. For a give hash slot indexed at , one can instantly know that this voxel, if not empty, is on the -th model in the batch by checking the model index table . This information bridges the data sets from different hierarchy levels. With the auxiliary accumulated index tables , , and , we can directly pinpoint the data using local index offset by , and respectively. For instance in this simple example, when the local hash index is computed using Eq. (1) for a non-empty voxel on the second model, its hash index in can then be obtained by offsetting the local hash index by .

Empty voxels (i.e. when ) may also be visited during CNN operations like convolution and pooling. Plugging these voxels’ indices into Eq. (1) is likely to return incorrect values that actually correspond to other non-empty grid cells. To avoid this mismatch, we adopt the strategy used in [16] adding an additional position tag table , which has the same size of . stores the voxel index for the corresponding slot at . Therefore when a grid cell is queried, we first check its data index in or . If it returns a valid index other than -1, we further check the position tag to make sure . Otherwise, is an off-model voxel and the associated CNN operation should be skipped. In our implementation, we use a 16-bit position tag for each , and index, which supports the voxelization resolution up to .

Figure 2 gives an illustrative 2D toy example. The domain is a 2D pixel grid. The red-shaded pixels stand for the input model, thus . We have a hash table (i.e. and it is the blue table in the figure) and a offset table (i.e. and it is the green table in the figure). Assume that the pixel is queried and yields . gives the 2D index in the offset table. , which is added to to compute the final index in H: . Before we access the corresponding data cell in (the fourth cell in this example because ), the position tag table (the yellow table) is queried. Since , which equals to the original pixel index of , we know that is indeed on the input model. Note that in this example, is a redundant slot (colored in dark blue in Figure 2). Therefore, the corresponding index is -1.

3.2 Mini-batch with PSH

During the CNN training, it is typical that the network parameters are optimized over a subset of the training data, referred to as a mini-batch. Let be the batch size and be the resolution level. In order to facilitate per-batch CNN training, we build a “super-PSH” by attaching , , for all the models in a batch: , , and as illustrated in Figure 3. That is we expand each of these -dimensional tables into a 1D array and concatenate them together. The data array of the batch is shaped as a row-major by matrix, where is the number of channels at level , and is the total number of non-empty voxels of all the models in the batch. A column of is a -vector, and it stores the features of the corresponding voxel. The dimensionality of , , and is also packed as , , and .

In addition, we also record accumulated indices for , and as: , and where

Indeed, , and store the super table (i.e. , , , and ) offsets of the -th model in the batch. For instance, the segment of starting from to corresponds to the hash table ; the segment from to corresponds to the offset table ; the segment from to corresponds to the position tag ; and the segment from to is the data array . Lastly, we build a model index table for the inverse query. Here, has the same size as does, and each of its slots stores the model’s index in a batch: .

4 CNN Operations with Multi-level PSH

In this section we show that how to apply CNN operations like convolution/transposed convolution, pooling/unpooling, batch normalization and scale to the PSH-based data structure so that they can be efficiently executed on the GPU.

Convolution   The convolution operator in the unrolled form is:

(2)

where is a neighboring voxel of voxel . and are the feature vector and the kernel weight of the -th channel. This nested summation can be reshaped as a matrix product [31] and computed efficiently on the GPU:

(3)

Let and denote the input and output hierarchy levels of the convolution. is essentially the matrix representation of the output data array . Each column of is the feature signal of an output voxel. A row vector in concatenates vectorized kernel weights for all the input channels, and the number of rows in equals to the number of convolution kernels employed. We design a subroutine hash2col to assist the assembly of matrix , which fetches feature values out of the input data array so that a column of stacks feature signals within the receptive fields covered by kernels.

Input: , , , , , , , , , , , , ,
Output:
launch threads;
/* is the thread index */
for  do
        ;
         // is the channel index
        ;
        ;
         // is the model index in the mini-batch
        ;
         // is the column index
        if  then
               return ;
                // points to an empty hash slot
              
        end if
       else
               ;
                // is the voxel position
               ;
                // is the receptive field on
               /* and

are the stride size and padding size */

               if  then
                      ;
                     
               end if
              else
                      ;
                     
               end if
              ;
                // is current row index in
               /* iterate all the voxels on the receptive field */
               for  do
                      ;
                      );
                      if  and  then
                             ;
                             ;
                            
                      end if
                     else
                             ;
                            
                      end if
                     /* assume is iterated according to its spatial arrangement in */
                      ;
                     
               end for
              
        end if
       
end for
Algorithm 1 hash2col subroutine

The algorithmic procedure for hash2col is detailed in Algorithm 1. In practice, we launch CUDA threads in total, where is the number of input channels. Recall that is the last entry of the accumulated index array such that , and it gives the total number of hash slots on . Hence, our parallelization scheme can be understood as assigning a thread to collect necessary feature values within the receptive field for each output hash slot per channel. The basic idea is to find the receptive field that corresponds to an output voxel and retrieve features for . A practical challenge lies in the fact that output and input data arrays may reside on the voxel grids of different hierarchy levels. Therefore, we need to resort to the PSH mapping (Eq. (1)) and the position tag table to build the necessary output-input correspondence.

Given a thread index (), we compute its associated channel index as . Its super hash index (i.e. the index in ) is simply , so that we know that this thread is for the -th model in the batch (recall that is the model index table). If meaning this thread corresponds to a valid non-empty voxel, the index of the column in that houses the corresponding output feature is .

With the help of the position tag table , the index of the output voxel in associated with the thread can be retrieved by , based on which we can obtain the input voxel positions within the receptive field and construct the corresponding column in . Specifically, if the stride size is one, indicating the voxel resolution is unchanged after the convolution or , the input model has the same hash structure as the output. In this case, the receptive field associated with spans from to along each dimension on denoted as . Here, is the kernel size. On the other hand, if the stride size is larger than one, the convolution will down-sample the input feature, and the receptive field on is with the stride size and the padding size . For irregular kernels [32, 33], we can similarly obtain the corresponding receptive field on based on .

As mentioned, for a non-empty voxel within the receptive field of a given output voxel , we know that it belongs to the -th model of the batch, where . Therefore, its offset index in can be computed as:

(4)

where is the accumulated offset index array at level , and returns the starting index of the offset table in the super table . computes the (local) offset index. Thus, the offset value of can be queried by . The index of in the super hash table can be computed similarly as:

(5)

Here, and are maps defined on hierarchy level . If and the position tag is also consistent (i.e. ), we fetch the feature from the data array by , where

(6)

Otherwise, a zero value is returned.

Input: , , , , , , , , , , , , ,
Output:
;
  // all the entries in are initialized as 0
launch threads;
/* is the thread index */
for  do
        ;
         // is the channel index
        ;
        ;
         // is the model index in the mini-batch
        ;
         // is the column index
        if  then
               return ;
                // points to an empty hash slot
              
        end if
       else
               ;
                // is the voxel position
               ;
                // is the receptive field on
               /* and are the stride size and padding size */
               if  then
                      ;
                     
               end if
              else
                      ;
                     
               end if
              ;
                // is current row index in
               /* iterate all the voxels on the receptive field */
               for  do
                      ;
                      ;
                      if  and  then
                             ;
                             ;
                      end if
                     ;
                     
               end for
              
        end if
       
end for
Algorithm 2 col2hash subroutine

Back propagation & weight update   During the CNN training and optimization, the numerical gradient of kernels’ weights is computed as:

(7)

where is the variation of the output data array . In order to apply Eq. (7) in previous CNN layers, we also calculate how the variational error is propagated back:

(8)

Clearly, we need to re-pack the errors in in accordance with the format of the data array so that the resulting matrix can be sent to the previous CNN layer. This process is handled by the col2hash subroutine, outlined in Algorithm 2. As the name implies, col2hash is quite similar to hash2col except at line 26, where variational errors from the receptive field is lumped into a single accumulated error.

Pooling, unpooling & transposed convolution   The pooling layer condenses the spatial size of the input features by using a single representative activation for a receptive field. This operation can be regarded as a special type of convolution with a stride size . Therefore, hash2col subroutine can also assist the pooling operation. The average-pooling is dealt with as applying a convolution kernel with all the weights equal to . For the max-pooling, instead of performing a stretched inner product across the receptive field, we output the maximum signal after the traversal of the receptive field (the for loop at line 20 in Algorithm 1). Unlike OCNN [15], our framework supports any stride sizes for the pooling since the PSH can be generated on the grid of an arbitrary resolution.

The unpooling operation aims to partially revert the input activation after the pooling, which could be useful for understanding the CNN features [34, 35] or restoring the spatial structure of the input activations for segmentation [36], flow estimation [37], and generative modeling [36]. During the max-pooling, we record the index of the maximum activation for each receptive field (known as the switch). When performing the max-unpooling, the entire receptive field corresponding to an input voxel is initialized to be zero, and the feature signal is restored only at the recorded voxel index. The average-unpooling is similarly handled, where we evenly distribute the input activation over its receptive field.

Transposed convolution is also referred to as deconvolution or fractionally strided convolution 

[38], which has been proven useful for enhancing the activation map [34, 39]. Mathematically, the transposed convolution is equivalent to the regular convolution and can be dealt with using hash2col subroutine. However, doing so involves excessive zero padding and thus degenerates network’s performance. In fact, the deconvolution flips the input and output of the forward convolution using a transposed kernel as: , which is exactly how we handle the error back propagation (i.e. Eq. (8)). Therefore, the col2hash subroutine can be directly used for deconvolution operations.

Other CNN operations   Because all the feature values in HCNN are compactly stored in the data array , operations that are directly applied to the feature values like batch normalization [40] and scale can be trivially parallelized on GPU.

5 Experimental Results

Our framework was implemented on a desktop computer equipped with an Intel I7-6950X CPU ( GHz) and an nVidia GeForce 1080 Pascal GPU with 8 GB DDR5 memory. We used Caffe framework [41] for the CNN implementation. The 3D models used are from ModeNet40 [12] and ShapeNet Core55 [42]. Both are publicly available. The source code of HCNN can be found in the accompanying supplementary file. The executable and some of the training data in PSH format ( GB in total) can also be downloaded via the anonymous Google Drive link, which can also be found in the supplementary file. We encourage readers to test HCNN by themselves.

Model rectification   It has been noticed that normal informations on 3D models from the ModeNet database are often incorrect or missing. We fix the normal information by casting rays from 14 virtual cameras (at six faces and eight corners of the bounding cube). Some 3D models use a degenerated 2D plane to represent a thin shape. For instance, the back of a chair model may only consist of two flat triangles. To restore the volumetric information of such thin geometries, we displace the sample points on the model towards its normal direction by , where denotes the voxel resolution at the finest hierarchy level. In other words, the model’s surface is slightly dilated by a half-voxel size.

5.1 Network Architecture

A carefully fine-tuned network architecture could significantly improve the CNN result and relieve the training efforts. Nevertheless, this is neither the primary motivation nor the contribution of this work. In order to report an apple-to-apple comparison with peers and benchmark our method objectively, we employ a network similar to the well-known LeNet [43].

In our framework, the convolution and pooling operations are repeated from the finest level, and ReLU is used as the activation function. A batch normalization (BN) is also applied to reduce the internal covariance shift 

[40]. Our PSH hierarchy allows very dense voxelization at the resolution of (i.e. see Figure 4), where the hierarchy level . Each coarser level reduces the resolution by half, and the coarsest level has the resolution of , where . Such multi-level PSH configuration exactly matches the OCNN hierarchy, which allows us to better evaluate the performance between these two data structures. At each level, we has the same operation sequence as: . The receptive field of kernels is , and the number of channels at the -th level is set as .

Three classic shape analysis tasks namely shape classification, retrieval, and segmentation are benchmarked. For the classification, two fully connected (FC) layers, a softmax layer and two dropout layers 

[44, 45] ordered as: are appended. Here, indicates neurons are set at the FC layer. For the shape retrieval, we use the output from the object classification as the key to search for the most similar shapes to the query. For the segmentation, we follow the DeconvNet [36] structure, which adds a deconvolution network after a convolution network for dense predictions. The deconvolution network simply reverses the convolution procedure where the convolution and pooling operators are replaced by the deconvolution and unpooling operators. Specifically, at each level we apply and then move to the next finer level.

Fig. 4: The benefit of dense voxelization is obvious. The discretized model better captures the geometry of the original shape at higher resolutions.

The reader may notice that our experiment setting transplants the one used in [15] except that all the features are organized using PSH rather than octrees. This is because we consider OCNN [15] as our primary competitor and would like the report an objective side-by-side comparison with it. Lastly, we would like to remind the reader again that HCNN is not restricted to power-of-two resolution changes. To the best of our knowledge, our HCNN is compatible with all the existing CNN architectures and operations.

Training specifications

   The network is optimized using the stochastic gradient descent method. We set momentum as

and weight decay as . A mini-batch consists of models. The dropout ratio is . The initial learning rate is

, which is attenuated by a factor of 10 after 10 epochs.

5.2 PSH Construction

As the data pre-processing, we construct a multi-level PSH for each 3D model. The size of the hash table is set as the smallest value satisfying . Each hash table slot is an int type, which stores the data array index of . Therefore, the hash table supports the high-resolution models up to , which is sufficient in our experiments. Next, we seek to make the offset table as compact as possible. The table size is initialized as the smallest integer such that with the factor empirically set as , as in [16]. An offset table cell is of 24 bits (), and each offset value is a -bit unsigned char, which allows an offset up to at each dimension. If the hash construction fails, we increase by (i.e. double the offset table capacity) until construction succeeds. We refer readers to [16] for implementation details. The construction of PSH is a pre-process and completely offline, yet it could be further accelerated on GPUs as [46].

5.3 Memory Analysis

An important advantage of using PSH is its excelling memory performance over state-of-the-art methods. Our closest competitor is OCNN [15], where the total number of the octants at the finest level does not depend on whether leaf octants intersect with the input model. Instead, it is determined by the occupancy of its parent: when the parent octant overlaps with the model’s boundary, all of its eight children octants will be generated. While OCNN’s memory consumption is quadratically proportional to the voxel resolution in the asymptotic sense, it also wastes memory for leaf octants that are not on the model. On the other hand, the memory overhead of our PSH-based data structure primarily comes from the difference between the actual model size i.e. the number of voxels on the model at the finest level and the hash table size (the offset tables are typically much smaller than the main hash table). Assume that the input model size is . The hash table size is , which is the smallest integer satisfying . By splitting as: , , the memory overhead of PSH can then be estimated via:

(9)

which is . In other words, the memory overhead of our HCNN is polynomially smaller than OCNN.

Fig. 5: The sizes of PSH and octree data structures used to encode the bunny model under resolutions of , , and . Under each resolution, the total number of octants and the sizes of the hash table () and offset table () are reported. Their quadratic growth trends are also plotted.

Figure 5 compares the sizes of the primary data structure for the bunny model (Figure 4) using OCNN and HCNN – the total number of leaf octants and the size of the hash table () at the finest level. The size of the offset table is typically an order smaller than . Besides, the number of voxels on the model is also reported. It can be clearly seen from the figure that the size of the hash table is very close to the actual model size (i.e. the lower bound of the data structure). The latter is highlighted as grey bars in the figure. The asymptotic spatial complexity of both HCNN and OCNN are , however the plotted growth trends show that HCNN is much more memory efficient than OCNN.

Fig. 6: The actual memory consumption using OCNN and HCNN over a mini-batch of models. The physical memory cap of the card is GB. HCNN allows very dense voxelization up to even with pre-stored neighbor information, while OCNN can only handle resolution of with recorded neighborhood.

In reality, the memory footprint follows the similar pattern. Figure 6 compares the memory usage for OCNN and HCNN during the mini-batch training. A mini-batch consists of random models, and memory usage is quite different for different batches. Therefore, we report the batch which uses the largest amount of memory during 1,000 forward and backward iterations. It can be seen from the figure that when the resolution is , OCNN consumes MB memory, and our method just needs MB memory. This is over less memory consumption. When the resolution is further increased to , OCNN is unable to fit the entire batch into 8 GB memory of the 1080 GTX video card, while our method is not even close to the cap, which only uses MB memory. If one chooses to use the entire voxel grid, a mini-batch would need over 2 GB memory (with nVidia cuDNN) under resolution of , which is roughly four times of HCNN. During CNN training, one could accelerate the convolution-like operations by saving the neighborhood information for each non-empty voxel (or each leaf-octant with OCNN). With this option enabled, OCNN is even not able to handle the batch under , while our method is sill able to deal with the batch under . The plotted growth trends also suggest that the gap of the memory consumption between OCNN and HCNN should be quickly widened with the increased voxel resolution.

Network architecture Without voting With voting
HCNN(32)
OCNN(32)
FullVox(32)
HCNN(64)
OCNN(64)
FullVox(64)
HCNN(128)
OCNN(128)
HCNN(256)
OCNN(256)
HCNN(512)
OCNN(512) OOM OOM
VoxNet(32)
Geometry image
SubVolSup(32)
FPNN(64)
PointNet
VRN(32)

TABLE I: Benchmark of shape classification on ModelNet40 dataset. In the forst portion of the table, we report the classification results using HCNN and OCNN. The classification accuracy using fully voxelized models (FullVox) is also reported. The number followed by a network name indicates the resolution of the discretization. In the second half of the table, the benchmarks of other popular nets are listed for the comparison. The best benchmark among a given group is highlighted in blue color.

5.4 Shape Classification

The first shape analysis task is the shape classification, which returns a label out of a pre-defined list that best describes the input model. The dataset used is ModeNet40 [12] consisting of 9,843 training models and 2,468 test models. The upright direction for each model is known, and we rotate each model along the upright direction uniformly generating 12 poses for the training. At the test stage, the scores of these 12 poses can be pooled together to increase the accuracy of the prediction. This strategy is known as orientation voting [13]. The classification benchmarks of HCNN under resolutions from to with and without voting are reported in Table I.

In the first half of the table, we also list the prediction accuracy using OCNN [15] and FullVox under the same resolution. The notion of HCNN(32) in the table means the highest resolution of the HCNN architecture is . The so-called FullVox refers to treating a 3D model as a fully voxelized bounding box, where a voxel either houses the corresponding normal vector, as HCNN or OCNN does, if it intersects with the model’s surface, or a zero vector. In theory, FullVox explicitly presents the original geometry of the model without missing any information – even for empty voxels. All the CNN operations like convolution and pooling are applied to both empty and non-empty voxels. This naïve discretization is not scalable and becomes prohibitive when the voxel resolution goes above . The reported performance of OCNN is based on the published executable at https://github.com/Microsoft/O-CNN. As mentioned above, we shape our HCNN architecture to exactly match the one used in OCNN to avoid any influences brought by different networks. We can see from the benchmarks that under moderate resolutions like and , HCNN, OCNN and FullVox perform equally well, and employing the voting strategy is able to improve the accuracy by another five percentages on average. When the voxel resolution is further increased, overfitting may occur as pointed out in [15], since there are no sufficient training data to allow us to fine-tune the network’s parameters. As a result, the prediction accuracy slightly drops even with voting enabled.

The second half of Table I lists the classification accuracy of some other well-known techniques including VoxNet [13], Geometry image [23], SubVolSup [6], FPNN [47], PointNet [26] and VRN [48]. We also noticed that the performance of OCNN in our experiment is slightly different from the one reported in the original OCNN paper. We suspect that this is because different parameters used during the model rectification stage (i.e. the magnitude of the dilation).

Network architecture
HCNN 25.2 73.1 217.3 794.3 2594.2
OCNN 27.5 78.8 255.0 845.3 OOM
OCNN with neighbor 24.0 72.0 244.4 OOM OOM
HCNN with neighbor 22.9 67.9 205.4 772.7 2555.5
FullVox 39.7 269.0 OOM OOM OOM

TABLE II: Average forward-backward iteration speed using HCNN, OCNN and FullVox (in ). For a fair comparison, we exclude the hard drive I/O time.

Compact hashing also improves the time performance of the networks. Table II reports the average forward-backward time in over iterations. We can see that HCNN is consistently faster than OCNN regardless if the neighbor information is pre-recorded, not to mention the FullVox scheme. The timing information reported does not include the hard drive I/O latency for a fair comparison. In our experiment, HCNN is typically faster than OCNN under the same resolution.

5.5 Shape Retrieval

The next shape analysis task is the shape retrieval. In this experiment, we use the ShapeNet Core55 dataset, which consists of models with categories. Subcategory information associated with models is ignored in this test. of the data is used for training; is used for validation, and the rest is for testing. Data augmentation is performed in this test by rotating 12 poses along the upright direction for each model. The orientational pool is also used [6, 7]

. The neural network produces a vector of the category probability scores for an input query model, and the model is considered belonging to the category of the highest score. The retrieval set corresponding to this input query shape is a collection of models that have the same category label sorted according to the L-2 distance between their feature vectors and the query shape’s. Precision and recall are two widely-used metrics, where precision refers to the percentage of retrieved shapes that correctly match the category label of the query shape, and recall is defined as the percentage of the shapes of the query category that have been retrieved. For a given query shape, with more instances being retrieved, the precision drops when a miss-labeled instance is retrieved. On the other hand, recall quickly goes up since more models out of the query category have been retrieved.

Fig. 7: The precision recall curves for HCNN, OCNN as well as other five famous multi-view CNN methods from SHREC16. The difference between HCNN and OCNN (under resolutions of and ) is quite subtle even after zooming in.

The comparative precision and recall curves are shown in Figure 7. Together with our HCNN under resolutions of and , we also plot the curves for OCNN(32) and OCNN(64) as well as several widely-known methods including GIFT [5], Multi-view CNN [7], Appearance-based feature extraction using pre-trained CNN and Channel-wise CNN [49]. The performance benchmarks of these latter methods are obtained using the published evaluator at https://shapenet.cs.stanford.edu/shrec16/. From the figure, we can see that 3D CNN methods like HCNN and OCNN outperform multi-view based methods, since the geometry information of the original models is much better encoded. The performances of HCNN and OCNN are very close to each other. After enlarging curve segments associated with HCNN(32), HCNN(64), OCNN(32) and OCNN(64) within the precision interval of , one can see that OCNN(32) is slightly below (worse) the other three.

Another interesting finding is that HCNN seems to be quite inert towards the voxel resolution. As shown on the right, HCNN(32) already has a very good result while further increasing the resolution to does not significantly improve the performance. Curves for HCNN(32) to HCNN(512) are hardly discernible. We feel like this actually is reasonable since identifying a high-level semantic label of an input 3D model does not require detailed local geometry information in general – even a rough shape contour may suffice. Similar conclusion can be drawn when evaluating the retrieval performance using other metrics as reported in Table III

. Here in addition to precision and recall, we also compare the retrieval performance in terms of mAP, F-score and NDCG, where mAP is the mean average precision, and F-score is the