Structured 2D Representation of 3D Data for Shape Processing

03/25/2019 ∙ by Kripasindhu Sarkar, et al. ∙ 0

We represent 3D shape by structured 2D representations of fixed length making it feasible to apply well investigated 2D convolutional neural networks (CNN) for both discriminative and geometric tasks on 3D shapes. We first provide a general introduction to such structured descriptors, analyze their different forms and show how a simple 2D CNN can be used to achieve good classification result. With a specialized classification network for images and our structured representation, we achieve the classification accuracy of 99.7% in the ModelNet40 test set - improving the previous state-of-the-art by a large margin. We finally provide a novel framework for performing the geometric task of 3D segmentation using 2D CNNs and the structured representation - concluding the utility of such descriptors for both discriminative and geometric tasks.



There are no comments yet.


page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Network (CNN) has been a very powerful tool in solving problems in 2D domain [14, 30, 9, 5, 6, 24, 16, 8, 23, 15, 20]. Because of the presence of convolution operation (as convolutional layer), these networks require structured data as input - which maps to 2D images seamlessly. However, because of the unstructured and unordered nature of 3D data, application of such learning methods on 3D domain is not straightforward. Therefore, there have been various attempts to transcribe 3D data to a common parameterization - both in 3D and 2D.




% Accuracy












[A]: MVCNN [32]

[B]: MVCNN-MultiR [22]

[C]: VRN [1]

[D]: Pointnet [21]

[E]: Wang et al [31]

[F]: Kd-Net [13]

[G]: MLH-MCVNN [26]

[H]: RotationNet [12]

[I]: GVCNN [4]

[J]: MHBN [39]

(K): Ours

Figure 1: Evolution of classification methods and their performance on ModelNet40 benchmark [36]. The X-axis here shows a rough chronological order of the publications. Using our 10-Layered Slice descriptors and a 2D CNN for classification, we achieve 99.7% accuracy in the ModelNet40 test set.

In the first strategy, 3D structured descriptors such as voxel occupancy provide a structured grid for the application of 3D convolutions. However, because of the expensive nature of the operation, the memory consumption of 3D CNN increases ‘cubically’ w.r.t. the input resolution; limiting the existing networks to an input of size 32 [19, 36, 32, 1]. In the second strategy and an attempt to avoid the memory expensive 3D convolutions, 3D shapes are converted to 2D grid like structure for the application of 2D CNNs [7, 3, 26, 37]. These existing methods design specific 2D descriptors suitable for the problem in hand, which are often not transferable to other tasks. As a result, structured 2D representation for 3D data is not standardized - the way voxel grid and 2D images are in the 3D and 2D domain respectively.

In this paper, we standardize structured 2D representation of 3D data and provide their general description. Based on the properties, we categorize them into different forms and analyze them. We provide a general network structure for performing discriminative tasks on such representation and show how it can be easily integrated to any well investigated 2D CNN designed for specific tasks. We verify this by performing classification of 3D shapes by both vanilla and specialized 2D CNN with the general structured representation as the input. Using a specialized 2D CNN for classification we achieve state-of-the-art accuracy in the competitive ModelNet40 benchmark [36] with all of the forms of the representations. Our classification result is summarized in Figure 1.

Solving geometric tasks such as 3D segmentation using a 2D descriptor is challenging, because a significant amount of information is lost with the reduction of a dimension. In this paper we propose a novel architecture for performing the geometric task of 3D segmentation with height-map based representation as input. Therefore, we show the utility of the structured 2D descriptors in terms of both discriminative and geometric tasks.

The descriptors analyzed in this paper does not need a pre-estimation of 3D mesh structure and topology, and can be computed directly on unordered point clouds. This is in contrast to shape-aware parameterizations that require the knowledge of 3D mesh topology, e.g. mesh quadrangulation, intrinsic shape description in eigenspace of the mesh Laplacian etc

[18, 27]. Structured 2D representation is suited for learning shape features in a large dataset, and therefore is ideal for the current data driven approach of neural networks. As argued before, it is comparable to the standard voxel grid descriptor, but in 2D. It enables all the advantage of 2D CNNs, such as ease of performing convolution, initialization of network weights pretrained on millions of data samples, efficiency of memory etc. in the context of 3D data. It also makes it possible to use well designed core 2D networks such as GoogLeNet [33], ResNet [9] for shape analysis. Our contributions are the following:

  • [leftmargin=*]

  • We generalize structured 2D representations of 3D data and their usage with 2D CNN for discriminative tasks.

  • Using the structured 2D representations as input and a CNN framework for classification, we achieve 99.7% classification accuracy in the ModelNet40 [36] test set and improve the previous state-of-the-art by a large margin.

  • We then propose a novel network for performing geometric tasks on the 2D descriptors and solve the problem of 3D part segmentation and semantic segmentation.

With the publication, we will make different structured descriptors of ModelNet40 (from different orientations) and the implementation of our network public.

The following section describes the related literature. Section 3 describes the structured 2D representation and its different forms. Section 4

introduces general classification methods on the representation followed by the details of the specialized network, that jointly classifies and detects the orientation. We follow it by introducing our novel framework for segmentation in Section

5. We finally present our result of the experiments in Section 6.

2 Related Work

3D voxel grid for 3D convolution

In these methods, 3D shape is represented as a binary occupancy grid in a 3D voxel grid on which 3D CNN is applied. Wu et al.[36] uses deep 3D CNN for voxelized shapes and provides the popular classification benchmark dataset of ModelNet40 and ModelNet10. This work is quickly followed by network design that take ideas from popular 2D CNNs giving a big boost in performance over the baseline [19, 1]. [22, 28] design special CNNs optimized for the task of 3D classification. However, because of the fundamental problem of memory overhead associated with 3D networks, the input size was restricted to , making them the least accurate methods for both discriminative and geometric tasks. In contrast to voxel gird, we use structured 2D descriptors and use 2D CNN and perform better in both classification and segmentation.

Rendered images as descriptor

These methods take virtual snapshots of the shape as input descriptors and perform the task of classification using 2D CNN architecture. Their contributions are novel feature descriptors based on rendering [29, 10, 32] and specialized network design for the purpose of classification [34, 12, 4]. The specialized CNN used for classification in this paper is inspired from [12] where classification and orientation estimation is jointly performed to increase the classification performance. Even though rendered images by definition structured 2D descriptors, they do not provide any direct geometric information. Because of this reason they are not considered a part of the representation in this paper. With the same network architecture all forms of our representation perform significantly better than the rendered images.

Figure 2: (Left) Different forms of structured 2D descriptors - form left to right - (a) Layered Height-map (MLH) (b) Occupancy Slices (c) Occupancy volume. (Right) Visualization of 3-layered MLH and Slice descriptor of a plotted plant.
2D representations for 3D tasks

Detection: These methods project the point-cloud collected from 3D sensors such as LIDAR onto a plane and discetize them to a 2D grid for 2D convolution for 3D object detection [3, 37, 17]. The projection on the ground plane, which is often referred as ‘Bird Eye View’, is augmented with other information and finally fed to a network designed for 3D detection. Here the 3D data is assumed to be sparse along the Z direction - across which convolution is performed.

Classification: Gomez-Donoso et al. [7] represents shape by ‘2D slices’ - the binary occupancy information along the cross section of the shape at a fixed height. Sarkar et al. [26] on the other hand represents shape by height-map at multiple layers from a 2D grid. Both of them combines descriptors from different views by a MVCNN [32] like architecture for classification.

We standardize structured 2D representation of 3D data with the ideas spanning through all these works. We provide a general description of such representation, that can be used by any 2D CNN designed for images classification.

Specialized networks for 3D

Recently, there has been a serious effort to have alternative ways of applying CNNs in 3D data such as OctNet [25] and PointNet [21]. OctNet uses a compact version of voxel based representation where only the occupied grids are stored in an octree instead of the entire voxel grid. PointNet [21]

takes unstructured 3D points as input and gets a global feature by using max pool as a symmetrical function on the output of multi-layer perceptron on individual points. Our method is conceptually different, as it respects the actual spatial ordering of points in the 3D space.

3 Structured 2D representation

In this section, we provide a general description of the structured 2D representation of 3D data. The ideas of such descriptors are taken from the previous literature in 3D computer vision

[7, 3, 26, 37]. We start by providing the notation which is followed throughout the paper.

Notation Multidimensional arrays are denoted by bold letters (e.g. ) and scalars are denoted by non-bold letters (e.g. ). For a given integer , we define . Let be a multidimensional array. denotes the element of at the index . Using as index means to select all indices along that axis. Depending on the context, we use the subscript notation for denoting the sample number as well.


Let be a 3D point-cloud comprising

points, where the i-th column denotes an individual point. The point-cloud is obtained either directly from the 3D scanners such as Kinect, Intel RealSense, LIDAR, etc., or is sampled from a shape. Unlike pixel arrays in images, point-cloud consists of points without any order making them un-ordered and unstructured. This makes it difficult to use them as input to standard machine learning algorithms - such as Convolutional Neural Networks. Given such unstructured point-cloud

, we convert them into a structured 2D representation of dimension . This descriptor is both structured and of fixed-length, which enables the application of standard machine learning algorithm.

3.1 Learning methods on structured input

Given data-samples from structured representation , we search a function mapping an input to a task specific output . The function is often known as the ‘model’ or ‘score function’. Some examples of such models are linear-model, multi-layer perceptron and convolutional neural network (CNN). In the case of CNN, 2D convolutional filters are applied across the first two spatial dimensions of the input (of size ()), while 3rd dimension is taken as the depth of the convolutional filter. Thus the structured 2D representation is encoded as -channel (or -layered) features of spatial dimensions. We use the term ‘channel’ and ‘layer’ interchangeably in this paper to denote the depth axis. Therefore in this notation, -th layer of the feature is

3.2 Forms of structured 2D representations

As a first step of 2D discritization of 3D shapes, a discrete 2D grid of spacial dimension is placed on the 3D data. For each bin of the 2D grid values are sampled according to a criteria. Inspired by the previous literature, we mention here different strategies of sampling.

3.2.1 Layered height-maps

In this strategy for each bin of the 2D grid, physical height values are sampled from all the points falling into the bin. i.e. represents a height value of a sampled-point from the grid plane falling at the bin (i, j). The

height values are chosen as percentile statistics to uniformly distribute height information in all the layers. ie.

layer is computed as percentile of the height values of the points falling in a bin, which evenly samples the height values w.r.t. number of points for each layers (1st layer minimum height, c layer maximum height). An approximation of this strategy can be achieved by dividing points equally into slices and computing a height value for each slice [3]. We use the percentile method when we compute this descriptor and address it as ‘multi-layered height-map (MLH)’ or ‘layered height-map’ in this paper.

3.2.2 Occupancy slices

In this strategy, occupancy slices are computed at predetermined height values. That is for each bin of the 2D grid, at fixed height values across the first two dimensions, slices are made of a specific thickness . Binary occupancy information of the points in the voxel comprising of the 2D bin (the first two dimension) and the slice (the 3rd dimension) are taken as the feature. The slices are considered at heights equidistant to each other to evenly cover the shape cross-section. We address this descriptor as ‘Slice’ or ‘Occupancy slices’ in this paper.

3.2.3 Binary occupancy volume

In this strategy, the entire

dimension tensor represents binary occupancy information of the 3D data. That is a physical dimension

is defined for the scene and a 3D grid of shape is placed on the data. The points inside this cubical space is then discretized with a resolution of as occupancy information. Note that, this is same as the traditional voxel occupancy descriptors. The important differences are a) the resolution along the depth dimension is often different to that of the spatial dimension b) 2D convolution operation is applied along the special dimension instead of expensive 3D convolution along all three axes.

We point out that the binary occupancy volume descriptor reduces to occupancy slices when , and slices taken at even intervals along the depth dimension. Because of their similarity, we chose occupancy slices descriptors to represent both the categories. Figure 2 provides a visualization of the different forms of the aforementioned descriptors.

3.3 Orientation views

As explained before, structured 2D representation requires placement of 2D grid on the point cloud as a first step. Therefore, the feature descriptors are dependent on orientation of the grid; where orientation is defined as the direction of the plane of the grid. A good system should be able to merge the descriptors of an instance coming from different orientations. This can be treated analogous to the ‘camera views’ for generating rendered images of 3D shapes and using them as feature descriptors [29, 10, 32, 34, 12]. We use the geometric structured representation and exploit the careful design choices of the image based network for merging orientation to perform the task of 3D classification.

4 Classification

4.1 Vanilla classification system from a single orientation

In the simple case, structured 2D representations of mean-centered and normalized shapes can be computed w.r.t. to a canonical axis, as its orientation and a classification based learning method can be applied as described in Section 3.1. Here each descriptor is assigned a class label , where is the total number of classes. The network , with a normalization such as softmax at the last layer, outputs probability estimation for each class. The true distribution

is then the one-hot encoding for the class. i.e. its k-th element,


The dissimilarity between the two distribution is measured by a loss function

. A commonly used function for classification is the cross-entropy between two distribution. It is given by


Therefore, we can use a simple feed forward 2D CNNs like VGG, ResNet, etc. whose parameters are optimized by a classification loss. This enables classification from the descriptors computed from a single orientation. To increase the performance of classification, we can either perform orientation augmentation (or ensemble of classifiers) or fine grained analysis of orientations clusters - inspired from the networks that uses 2D images. The main idea is to get different descriptors of an instance of 3D model from different orientation and fuse them in a careful manner for the final classification. The following subsection describes such method in detail.

Figure 3: (Left) Summary of RotationNet. See text for more details. (Right) Orientation setup for the network - we compute feature descriptors at 12 orientations by rotating the grid around the Z axis.

4.2 RotationNet for 3D descriptors

Inspired by the network in [12], we jointly perform classification and orientation estimation for better aggregation of feature descriptors from different orientations. It is shown that when synthetic images of 3D models are used, the classification performance improves by joint estimation of orientation and categories [28].

For each instance of the 3D shape we compute different descriptors from pre-defined orientations. A orientation variable is assigned to such that, when is computed from -th orientation. A new class ‘incorrect view’ with class id is introduced in addition to the different classes for classification. The model consists of a base multi-layered network followed by the concatenation of softmax layers which outputs the class probabilities w.r.t different views. For the input , the true distribution is defined such that its (j,k)-th element is given by


The cross entropy between the two distributions is then


In order to learn the orientations together with the categories, we optimize the summation


by iterating the optimization process over the network parameters and the orientation variables . Note that, minimizing Equation 5 for the orientations reduces to minimizing over all the candidate orientations .

Candidate orientations for structured 2D representation

The choice of the candidate orientations is arbitrary, but needs to be fixed for the entire training and testing. We follow the previous literature and assume, that the input shapes are upright oriented along Z axis [32, 22, 26, 12, 29] in order to derive consistent set of candidate orientations. We mean center the data and place the reference grid through the origin pointing along the X axis (side view). We then create 12 orientations for computing the descriptors by rotating the grid around Z axis by incrementing 30 degrees. The setup is shown in Figure 3 (Right).

5 Segmentation using layered height-map

Unlike rendered images, structured 2D representation is geometric and therefore can be used for solving geometric tasks. Because of the fact that occupancy slices (Section 3.2.2) and binary occupancy volume (Section 3.2.3) are occupancy based descriptors like voxel grids, 3D geometric tasks like segmentation can be solved in a similar way, as it is done for voxel grids using a 3D CNN. Performing such tasks using a 2D CNN is not straightforward.

In contrast, multi-layered height-map (Section 3.2.1) descriptors have rich information along the direction perpendicular to the plane of the grid as it directly stores the height information - making them a good choice for performing geometric tasks. In this section we introduce our novel method, that uses 2D multi-layered height-map as input and solves the problem of point-cloud segmentation.

5.1 Single view network

Input representation

In point-cloud segmentation each point contains a segmentation label , where is the total number of class labels. We convert such labeled point-cloud to 2D MLH descriptor , where denotes the displacement value from the grid. We also prepare a 2D grid for the labels such that has the same label as the point that was considered for the displacement value at .

Network formulation

Given a set input descriptors in the form of and the label maps in the form of , we solve 2D segmentation task for each layer. That is, we solve segmentation task using a 2D CNN with as input and as output labels, where .

The 2D segmentation network has two parts: 1) a base network and 2) classification heads responsible for the final pixelwise classification for all the

layers. In our implementation we chose the final 3 fully convolutional layers of the segmentation network as classification head. The final loss is the summation of all the losses coming from the classification heads. We finally backpropagate the entire network with the gradient computed from the final loss. The network architecture is summarized in Figure


Note that, the 2D CNN for segmentation can be chosen to be any of the well investigated architectures such as FCN, UNet, PSPNet, etc [16, 40, 2]. The only difference is, that we have different final layers for classification after the base network.

Figure 4: Our segmentation network for layered height-map descriptors. For each orientation of the descriptor, we solve 2D segmentation problem with a base CNN and different fully convolutional classification heads.
3D segmentation

During the test time we compute the layer MLH descriptor of the input cloud and pass it through the network to get the layer segmentation mask . Now we are required to transfer this 2D segmentation information back to the original point-cloud. We first convert the descriptor to a point cloud , where and are the size of the bin of the 2D grid in X and Y direction. We assign each point of the computed cloud a label based on the label map at that location i.e. . We then place a voxel grid on the this cloud and assign each voxel bin the mode of all the point-labels inside that bin. We finally use this labeled voxel grid to compute the point label of the input cloud completing the segmentation.

5.2 Multi-view architecture

The MLH descriptors are lossy in nature. Surfaces that are approximately perpendicular to the 2D grid of the descriptors are represented by values, which results to a large loss of geometrical information. This causes large empty areas in the computed label voxel , which leads to unlabeled points of the input cloud. To alleviate this issue, we compute the segmentation labels for 3 different orientations corresponding to the 3 axes of the input shape. We use the intuition, that a surface cannot be perpendicular to all the 3 axes at the same time.

Similar to that of the classification problem, we compute the layered descriptor and the corresponding label map for 3 different orientations. We then use 3 different CNN and update them iteratively during the training in each batch. During the testing time, we combine the point labels for each of the views i.e , where is the computed labeled cloud form each orientation. We finally compute the labeled voxel grid for segmentation on the combined cloud .

Number of layers and design choices

Since in realistic scenario for segmentation the point-cloud is sparse in nature we found to work well. This results the MLH features to capture the shape outline from the outside, which is often the case with real world point-clouds.

6 Experiments

In this section we provide our experimental evaluation of the structured 2D descriptors for the task of classification and segmentation.

Descriptor Accuracy
MLH 5 layer 91.05
Slice 5 layer 88.21
Table 1: Classification accuracy on ModelNet40 with a single orientation.

6.1 3D Shape Classification

Dataset and general settings

We use the popular dataset of ModelNet40 [36] for evaluating our shape classification results. It contains approximately 12k shapes from 40 categories and comes with its own training and testing splits of around 10k and 2.5k shapes. We first densely sample points form each shape and compute multi-layered height-map (MLH) and occupancy slice feature descriptors of dimension . As found by the experiments (Section 6.1.3), we use

for most of the cases. We use the ImageNet pretrained weights for initialization in all our experiments to take advantage of 2D CNNs trained on millions of samples. In the first convolutional layer where the depth of the filter doesn’t match (

instead of 3), we repeat the weights channel-wise.

6.1.1 Vanilla CNN for classification

We solve the problem of classification by a single vanilla CNN and a cross-entropy loss with the descriptors from a single orientation. We use the simple network of VGG (VGG16 with batch normalization)

[30] as the base network for this task and show our result in Table 1. As seen, even with a single orientation and without performing any data augmentation and fine grained analysis, our result with MLH descriptors comes out to be better than the many multi-view baselines, including MVCNN [32]. This shows the strength and advantage of using structured 2D descriptor and 2D CNN for 3D classification.

6.1.2 RotationNet for 3D descriptors

In this section we present our result of classification from the network described in Section 4.2. As discussed, for each mean-centered instance we compute 12 descriptors from the grid rotated along the Z axis. Using VGG as our base network we iteratively optimize the network and the orientation candidates based on Equation 5 and train till the convergence. We use a standard SGD with the initial learning rate of 0.01 and momentum of 0.9

We use both MLH and Slice descriptors with different number of layers in this experiment and present our result in Table 2. As seen, our RotationNet with both MLH and Slice descriptors improve the state-of-the art by a high margin. It is to be noted, that occupancy slice descriptor (for 5 layered feature) is around 1/10-th of the size of MLH. Even with such less amount of information it achieves comparable result with MLH. This shows the superiority of slice descriptor on classification tasks.

Accuracy # Views /
Test instances
PANORAMA NN [29] 91.12 1
Dominant Set Clustering [31] 93.80 12
Pairwise [10] 90.70 12
MVCNN 12-v[32] 89.90 12
MVCNN 80-v[32] 90.10 80
Kd-Net depth 10 [13] 90.60 10
Kd-Net depth 15 [13] 91.80 10
MVCNN- MultiRes [22] 91.40 20
VRN [1] 91.33 24
Voxception [1] 90.56 24
Pointnet [21] 89.20 1
VoxNet [19] 83.00 12
MLH-MV [26] 93.11 3
RotNet-orig 12-v [12] 90.08 12
RotNet-orig 20-v [12] 97.37 20
GVCNN [4] 93.10 8
MHBN [39] 94.70 6
VGG + MLH 1-v 91.05 1
Our RotNet + MLH-5L 99.56 12
Our RotNet + Slice-5L 99.51 12
Our RotNet + Slice-10L 99.76 12
Table 2: Comparison of classification accuracy on ModelNet40 test split.
Figure 5: Effect on the performance on ModelNet40 w.r.t. number of layers for MLH and Slice descriptors.

6.1.3 Analysis of structured 2D representation

We perform experiments to analyze the optimal number of channels/layers for the structured 2D representations. We keep the same experimental setup of 12 view RotationNet, but change the number of layers for both MLH and Slice descriptors. For 1 layer descriptors, we took the minimum height for MLH, and slice occupancy at the mid-section of the shape for Slice descriptor. The result is shown in Figure 5.

We find the single layered structured descriptor to perform extremely poorly in comparison to their multi-layered versions. The result improves quickly with the addition of more layers and saturates after 5 with an accuracy of around 99.5%. Adding more layers in MLH does not increase the accuracy further. In fact, around 10 layer MLH we see a drop in the accuracy. Note that, due to the presence of the floating point height values, 5 layer MLH descriptor already consumes a significant amount of memory. With occupancy slice the accuracy increases upto 99.7% with 10 layers and does not improve further.

6.2 3D Segmentation

6.2.1 Experimental setup

Dataset and feature computation

We use the 3D part segmentation dataset - ShapeNet parts - provided by [38] to evaluate our segmentation method. This dataset is a subset of ShapeNetCore and its segmentation labels are verified manually by experts. It also comes with its own training and testing split. We use 4 categories with the most training instances namely - Table, Chair, Airplane, and Lamp - for our evaluation as other categories do not have enough instances for training a deep neural network; and to compare with [35] whose results are only available for these categories. Note that, we deal with each category separately with different network. It can be argued, that our performance could improve further (and lack of training data for other categories be resolved) if we train our network across different categories with an external input of the category information in the final classification stage. Our intention here is not to push for the best result, but to validate our framework for 3D point segmentation by 2D descriptors.

Figure 6: Qualitative result of our 3D segmentation on the ShapeNet-parts test split.

We use our multi-view architecture for segmentation, and therefore, we compute features from the three axes as orientations (Section 5.2). Because of the low point density in the shapes of the segmentation dataset ( 3000), we compute MLH features and label-map of dimension . We find, that 2 layers are sufficient for the semantic information of the shape as it captures the shape outline from outside. At empty regions in the label map, we put an additional ‘invalid’ label. Therefore, during the optimzation we classify for classes, where is the original number of classes.

2D segmentation network

As discussed, our segmentation framework is not tied to any particular type of 2D segmentation networks. We use the simple model of Fully Convolutional Network (FCN8s) [16] as our base CNN. For the classification heads we choose the final 3 convolutional layers of the network. We added the loss coming from the two classification heads and optimized the network through the backpropagation based on this resultant loss (Figure 4). For training, we use the pretrained VGG weights for initialization and a standard SGD optimizer with momentum with the initial learning rate as 1.0e-10. We perform segmentation of each category by a different network and show our qualitative result in Figure 6. In Figure 7 we show the 2D segmentation result of different layers and views of a Table.

6.2.2 Analysis

We use the point IoU between the ground truth and predicted labels as our evaluation metric. The mean IoU of a category is taken as the mean IoU of all instances of the category. We then compare our result with the method of Wu et al.

[35], PointNet [21], a baseline 3D CNN for segmentation used in [21] and present our results in Table 3. It is seen that our result is quantitatively comparable to the existing state-of-the-art methods.

Our aim here is to provide an experimental verification of our segmentation framework instead of network engineering for the best result. The result can be improved by the following:

  1. [leftmargin=*]

  2. Use of latest 2D segmentation network such as PSPNet [40], DeepLab [2], etc instead of the basic FCNs to improve the 2D segmentation result.

  3. Use of 3D meshes of ShapeNet-parts [11] to get a denser point-cloud to increase the input resolution of the MLH features (without introducing large amount of empty regions).

Table Chair Airplane Lamp
Wu et al. [35] 74.8 73.5 63.2 74.4
PointNet [21] 80.6 89.6 83.4 80.8
3D CNN [21] 77.1 87.2 75.1 74.4
Ours 79.2 82.0 72.1 74.5
Table 3: Quantitative result of 3D Segmentation on ShapeNet-part in terms of mean point IoU %.
Figure 7: Example of segmented (Left) and groud truth (Right) label-map of a Table. Layers are shown in different rows, and the orientations are shown in different columns.

7 Conclusion

In this paper we provided a general introduction to structured 2D representation of 3D data and analyzed its various forms. We showed how a simple vanilla CNN like VGG can be use with such 2D descriptors to achieve good classification result. We then used a specialized 2D CNN to aggregate feature descriptors from different orientation to achieve 99.7% accuracy in the ModelNet test state improving the previous state-of-the art accuracy by a large margin. Finally, we provided our novel framework for performing the geometric task of 3D segmentation using 2D networks and structured 2D representation. Therefore, we provided general framework for the 2D descriptors for solving both discriminative and geometric task. We concluded, that occupancy-slice descriptor provides an excellent choice for shape classification because of its compactness. Where as multi-layered height-map provides a good choice for 2D segmentation because of its richness of information. We hope this work will make structured 2D representation more standard for 3D data processing in the future; similar to that of voxel occupancy grid.


  • [1] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Generative and discriminative voxel modeling with convolutional neural networks. CoRR, abs/1608.04236, 2016.
  • [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016.
  • [3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In IEEE CVPR, 2017.
  • [4] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao. Gvcnn: Group-view convolutional neural networks for 3d shape recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [6] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
  • [7] F. Gomez-Donoso, A. Garcia-Garcia, J. Garcia-Rodriguez, S. Orts-Escolano, and M. Cazorla. Lonchanet: A sliced-based cnn architecture for real-time 3d object recognition. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 412–418, May 2017.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  • [10] E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image sequences for active multi-view recognition. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 3813–3822. IEEE, 2016.
  • [11] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3D shape segmentation with projective convolutional networks. In Proc. IEEE Computer Vision and Pattern Recognition (CVPR), 2017.
  • [12] A. Kanezaki, Y. Matsushita, and Y. Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. pages 5010–5019, 06 2018.
  • [13] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 863–872. IEEE, 2017.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [15] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/1609.04802, 2016.
  • [16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • [17] W. Luo, B. Yang, and R. Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
  • [18] J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In IEEE Workshop on 3D Representation and Recognition (3DRR), pages 37–45, 2015.
  • [19] D. Maturana and S. Scherer. VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. In IROS, 2015.
  • [20] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. 2016.
  • [21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016.
  • [22] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
  • [23] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [24] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
  • [25] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [26] K. Sarkar, B. Hampiholi, K. Varanasi, and D. Stricker. Learning 3d shapes as multi-layered height-maps using 2d convolutional networks. In The European Conference on Computer Vision (ECCV), September 2018.
  • [27] K. Sarkar, K. Varanasi, and D. Stricker.

    3d shape processing by convolutional denoising autoencoders on local patches.

    In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 1925–1934, 2018.
  • [28] N. Sedaghat, M. Zolfaghari, E. Amiri, and T. Brox. Orientation-boosted voxel nets for 3d object recognition. arXiv preprint arXiv:1604.03351, 2016.
  • [29] K. Sfikas, T. Theoharis, and I. Pratikakis. Exploiting the PANORAMA Representation for Convolutional Neural Network Classification and Retrieval. In I. Pratikakis, F. Dupont, and M. Ovsjanikov, editors, Eurographics Workshop on 3D Object Retrieval. The Eurographics Association, 2017.
  • [30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [31] A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1511–1519, 2017.
  • [32] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV, 2015.
  • [33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.
  • [34] C. Wang, M. Pelillo, and K. Siddiqi. Dominant set clustering and pooling for multi-view 3d object recognition. In Proceedings of British Machine Vision Conference (BMVC), 2017.
  • [35] Z. Wu, R. Shou, Y. Wang, and X. Liu. Interactive shape co-segmentation via label propagation. Computers & Graphics, 38:248–254, 2014.
  • [36] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
  • [37] B. Yang, W. Luo, and R. Urtasun. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018.
  • [38] L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections. SIGGRAPH Asia, 2016.
  • [39] T. Yu, J. Meng, and J. Yuan. Multi-view harmonized bilinear network for 3d object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 186–194, 2018.
  • [40] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.