TreeGCN-ED: Encoding Point Cloud using a Tree-Structured Graph Network

10/07/2021 ∙ by Prajwal Kumar Singh, et al. ∙ IIT Gandhinagar 8

Point cloud is an efficient way of representing and storing 3D geometric data. Deep learning algorithms on point clouds are time and memory efficient. Several methods such as PointNet and FoldingNet have been proposed for processing point clouds. This work proposes an autoencoder based framework to generate robust embeddings for point clouds by utilizing hierarchical information using graph convolution. We perform multiple experiments to assess the quality of embeddings generated by the proposed encoder architecture and visualize the t-SNE map to highlight its ability to distinguish between different object classes. We further demonstrate the applicability of the proposed framework in applications like: 3D point cloud completion and Single image based 3D reconstruction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

Code Repositories

TreeGCN-ED

An autoencoder for point cloud encoding-decoding build using tree-GAN as base work.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Encoder-decoder based methods have been widely used for generating information preserving embeddings for different data modalities. In the past decades, several encoder-decoder based methods have been proposed for 2D images [1, 2, 3, 4, 5] and have shown impressive results for image compression and filtering tasks [6, 7, 8]. However, it is challenging to extend these methods for 3D point cloud data due to its irregular structure as compared to that of images and 3D voxels. In this work, we focus on designing a deep-learning based encoder-decoder framewo for processing point cloud data.

Several methods have been proposed for encoding-decoding of point cloud data [9, 10, 11, 12]. PointNet [9] is a pioneer deep-learning-based method for encoding the point cloud data to lower-dimensional embeddings. These embeddings carry rich information about the point clouds and can be used for downstream tasks like segmentation and classification for point clouds. In a recent work [13], the authors propose a tree-structured decoder which uses the idea of graph convolution [14]

to generate a point cloud using a noise vector

sampled from a normal distribution

. They propose aggregating the information from parent nodes at each layer instead of spatially adjacent nodes to leverage the tree-structured decoder architecture when applying graph convolution. The unique definition of graph convolution proposed in [14] highlights the effect of using information from parent nodes at multiple levels during the aggregation stage.

Inspired by the tree-structured decoder architecture in [13], we propose a tree-structured encoder and a graph convolution mechanism for down-sampling. We combine our proposed encoder with the decoder in [13] to create a complete tree-based encoder-decoder framework denoted by TreeGCN-ED 111Code is available at: https://github.com/prajwalsingh/TreeGCN-ED for processing point clouds. To show the effectiveness of proposed framework, we compare its results with FoldingNet [15]

architecture. The results show that TreeGCN-ED performs better than FoldingNet on two different evaluation metrics - Chamfer Distance (CD)

[16] and Fréchet point cloud distance (FPD) [16]

. We also observe that TreeGCN-ED learns inherent semantic information of the point cloud and hence, performs semantic segmentation without any explicit training. This highlights that our encoder generates more information-preserving embeddings. We also perform ablation studies to determine the effect of feature embedding dimension and data augmentation on the proposed TreeGCN-ED network. We further use the learned embeddings in a transfer learning setup and compare the results for point cloud classification on ModelNet10 and ModelNet40 datasets

[17]. Finally, we demonstrate the applicability of proposed framework for 3D point cloud completion and Single image based 3D reconstruction.

Contributions. The following are the major contributions of this work.

  • A tree-structured encoder to generate robust embeddings for point cloud processing using graph convolution.

  • An autoencoder based framework formed by the proposed encoder with the tree-based decoder [13] for better point cloud reconstruction.

Figure 1: TreeGCN-ED Architecture. The encoder part (on the left) consist of downsampling and graph convolution modules for encoding the input 3D point cloud into a feature embedding . The decoder architecture (on right) is taking the embedding as input from the encoder and reconstructing the 3D point cloud.
Figure 2: TreeGCN-ED Down-Branching and Up-Branching Architecture

. The down-branching module (on the left) consists of a fully connected layer followed by max-pooling for feature extraction. Output of the fully connected layer is divided into

equal components that are passed to the max-pooling layer. The up-branching architecture is responsible for collecting information from the feature embedding of the ancestors and upsampling. The upsampled feature is passed to graph convolution layer for further processing.

2 Related Work

Point cloud is an important data structure that can be used to store information about the geometry of any 3D shape. There are various applications related to point cloud processing. Some of the most common applications are point cloud classification and segmentation [9, 18, 19, 20, 21, 22, 23], point cloud completion [24, 25, 26, 15], point cloud auto-encoder [15, 9], and Generative Adverserial Networks (GANs) for point cloud [13, 27].

In [9], the authors propose the first end-to-end deep auto-encoder to directly process point cloud data. The encoder uses 1D CNN and global max-pooling to extract features of the input point cloud. This makes the model permutation invariant. The decoder reconstructs the point cloud using a three-layer fully connected network. FoldingNet [15] model is build up on the idea of PointNet [9], by proposing an auto-encoder network that uses graph-based method to learn encoding of a point cloud. Edge based convolution method is proposed in [23] to learn the local neighborhood as well as global properties of the 3D shape. In [13], the authors have proposed a deep generative model for 3D point cloud generation. This method is unique because the authors have used graph convolution on point cloud data which inherently does not contain any edge connections.

3 Method

3.1 Tree-GAN

Tree-GAN [13] proposes a deep generative model for 3D point cloud generation. It uses a branching method to gather information from neighbouring points. The accumulated information is then distributed to other points using graph convolution. The point cloud thus generated through this method is implicitly segmented.

In Tree-GAN [13], a noise vector is sampled from and is given as input to the generator network. Each layer of generator consists of a branching network and a graph convolution layer. The branching network accumulates the feature vectors from the previous layers which is then upsampled by the graph convolution layer to generate a new feature vector for that layer. This is repeated until the point cloud of the desired dimension is obtained at the output. Note that the feature vector for the first layer is the noise vector itself. The generator and discriminator are trained under WGAN [28].

3.2 TreeGCN Based Point Cloud Encoder-Decoder

Figure 3: Interpolation Result

. Illustration of intra-class (on left) and inter-class (on right) point cloud interpolation.

In this section, we discuss the proposed method for 3D point cloud processing. The key idea of this approach is inspired from the tree-GAN [13], a deep learning based model for generating a 3D point cloud from a noise vector. We use the idea of tree-based graph convolution from [13] to develop an encoder that extracts rich embeddings to performs well on the unseen point cloud data. Our model takes a 3D point cloud of size as input, then passes it through sequences of graph-based operation to generate encoding for the point cloud. The generated encoding is then passed through the decoder network where a sequence of graph-based operations upsample the encoding to obtain a point cloud as the output. The complete network is trained end-to-end by minimizing chamfer loss [16] and is denoted as TreeGCN-ED.

Fig. 1 represents the proposed architecture of TreeGCN-ED. A point cloud is given as input to the model, which then passes through a down branching network for gathering features from the ancestors of each node. Fig. 2 shows the down branching network and here, we are first passing each ancestor to a sequence of fully connected. Then max pooling is applied on it to extract dominant features and then, we pass this feature to the next network, i.e. graph convolution network. The point cloud continuously passes through the down branching and graph convolution network sequence until the desired encoding is obtained. The generated encoding is given as input to the decoder network. In the decoder, the embedding is first passed through up branching network. The internal working of up branching network is shown in Fig. 2, where the encoding is first passed through the fully connected layer. Then the feature vector is concatenated with the ancestor information, which can be useful for reconstructing point clouds. Then the constructed feature is passed on to the graph convolution network. This process is repeated till the point cloud of size is reconstructed. The decoder architecture is similar to tree-GAN [13]

. The overall model is trained in an end-to-end manner using the chamfer loss function

[16].

The branching network is an essential part of the TreeGCN-ED network. It helps in accumulating information from ancestors for each node. Every ancestor feature is passed through a fully connected layer at each stage to help the network learn the relation between a node and its neighbour. This is also useful because the point cloud does not have edge connections between points. We use max pooling for selecting important features from the encoded point cloud. Max pooling has been proved to be a permutation invariant function [9]. We experimented with other pooling functions, such as averaging and adding feature vectors, but max-pooling works better than other methods. Tree graph convolution network learns the semantic segmentation of point cloud implicitly [13].

3.3 Loss Function

To train the TreeGCN-ED, we have used the chamfer loss [16] function as it shows promising results for point cloud based reconstruction [13].

(1)

In Equation 1, and represents two different point clouds. There are two specific reasons for using this loss function. First, it is permutation invariant [16]. Second, it penalises the loss function if a point from one set is not matched with its corresponding nearest neighbour in another set and vice-versa. This forces model to learn information preserving embedding for the point cloud.

3.4 Data Preprocessing

To train our model, we have used ShapeNetBenchmarkV0 dataset [25] consisting of 16 object classes. The dataset is split into training, validation, and testing as per the standard ratio proposed in [25]. We uniformly sample points from the meshes of the ShapeNet dataset [25]. To ensure uniform sampling of points, we make use of barycentric coordinates for the surface sampling.

4 Experiments and Results

4.1 Training and Comparison of Encoder-Decoder Model

We have used ShapeNetBenchmarkV0 dataset [25], which consists of 16 object classes, with train-test split officially available along with the dataset. The dataset is first uniformly sampled for 2048 points and then passed on to the network. We have trained the complete network using the chamfer distance [16] as the loss function till convergence. We compare the performance of TreeGCN-ED with the FoldingNet architecture [15] on the test set of [25]. We use two different metrics for the evaluation task: Chamfer Distance (CD) [16] and Fréchet Point Cloud Distance (FPD) [13]. The results are shown in Table 1 on ShapeNetBenchmarkV0 dataset [25]. For fair evaluation, in case of FoldingNet [15], we have considered the first minimum point cloud distances for calculating CD, owing to the difference in input and output point cloud size. The quantitative results shown in the table highlight that our proposed method performs better than FoldingNet [15] for chamfer distance (CD) as well as Fréchet point cloud distance (FPD).

Models Metrics Object Class
Airplane Bag Cap Car Chair Earphone Guitar Knife Lamp Laptop Motorbike Mug Pistol Rocket Skateboard Table Average
FoldingNet [15]
CD
FPD
0.67
11.10
3.12
87.45
2.82
117.36
1.76
28.47
1.47
12.00
3.34
152.04
0.44
19.55
0.55
19.56
2.60
45.19
1.01
11.19
1.48
33.91
2.28
40.17
1.16
30.14
0.88
32.53
1.35
47.17
1.70
24.62
1.48
44.52
TreeGCN-ED
CD
FPD
0.50
5.79
1.88
21.02
1.62
16.14
1.45
9.47
1.32
7.85
1.91
51.79
0.40
13.90
0.41
14.80
1.97
21.82
0.88
2.56
1.14
14.67
1.72
12.70
0.79
9.62
0.61
23.91
0.78
13.90
1.41
13.90
1.21
11.54
Table 1: Comparison of the efficiency for 3D point cloud encoding-decoding between our proposed architecture and the FoldingNet [15] model on ShapeNetBenchmarkV0 dataset [25].

4.2 Point Cloud Interpolation

To show that our proposed encoder architecture is learning information rich embedding, we perform inter-class and intra-class interpolation experiments between the source and the target point clouds. The interpolation results are shown in Fig. 3.

The intra-class interpolation results illustrate the ability of our model to synthesize novel shapes between two given shapes. We observe that the generated shapes faithfully represent the object class at each interpolation stage and the interpolation is observed to be very smooth. Similarly, in the case of inter-class interpolation, we observe a smooth transition of characteristic class features from one object class to another.

4.3 t-SNE Visualization

We use t-SNE [29] plot to show how well our encoder model can generate feature embedding for each class. We set the perplexity value to 40. Based on the results of t-SNE plot shown in Fig. 4, the inter-class separation is higher. This signifies the discriminative capacity of our proposed encoder model.

Figure 4: The visualization of t-SNE [29] clustering of the feature embeddings obtained from TreeGCN-ED model.

4.4 Ablation Studies

We perform ablation studies to determine how feature embedding dimension and data augmentation affect the ability of TreeGCN-ED to learn a meaningful feature representation. Four different training regimes are compared on the ShapeNetCore.v2 test-set [30] which consist of 55 classes. In Regime 1 and 2, the dimension of feature embedding is fixed to 256 and 512, respectively, without augmentation. Similarly, in Regime 3 and 4, the dimension of feature embedding is fixed to 256 and 512, respectively, but with augmentation. We use ShapeNetCore.v2 dataset [30] to train TreeGCN-ED for all the four regimes. Table 2 shows the variation in chamfer distance for the above mentioned regimes. Based on the results, Regime 4 gives the best model performance.

Dataset No Augmentation Rotation Augmentation
ShapeNetCore.v2 10.90 10.07 8.82 7.88
Table 2: Quantitative results for all four regimes for the task of 3D point cloud encoding-decoding. Chamfer distance [16] is used as the metric for comparison.

Furthermore, we also evaluate the efficiency of feature representation learning of TreeGCN-ED on ModelNet10 and ModelNet40 datasets [17] for all the four regimes. We follow the same procedure as mentioned in [15]

to train a linear SVM classifier on features extracted from trained TreeGCN-ED for the ModelNet datasets

[17]. Table 3 shows the variation in classification accuracy for all the four regimes on the test set of ModelNet datasets [17]. Based on the results, Regime 4 gives the best model performance.

Dataset No Augmentation Rotation Augmentation
ModelNet10 0.83 0.83 0.85 0.85
ModelNet40 0.71 0.72 0.73 0.73
Table 3: Quantitative results for all four regimes for the task of 3D point cloud classification using transfer learning on ModelNet10 and ModelNet40 dataset [17].

It can be easily argued that the tree-GAN [13] decoder itself is enough for point cloud processing at hand. However, to establish the need and examine the strength of the proposed encoder, we perform an additional experiment by replacing it with the PointNet [9] encoder to train the complete network on ShapeNetBenchmarkV0 dataset [25]. We observed that the average CD is with the PointNet encoder and with the proposed encoder. This clearly establishes the efficacy of the proposed encoder.

4.5 Applications

We showcase two potential applications of our proposed method: 3D point cloud completion and single image based 3D reconstruction.

4.5.1 3D Point Cloud Completion

The task of 3D point cloud completion is to reconstruct the incomplete input point cloud. To achieve this, we train TreeGCN-ED on the Completion 3D benchmark dataset [24, 25, 26] and perform a qualitative evaluation on the officially available test set. The qualitative results are shown in Fig. 5. Since the ground truth for the test set is not available, we only perform qualitative analysis.

Figure 5: Qualitative results on the test set of Completion 3D benchmark dataset [24, 25, 26].

4.5.2 Single Image to 3D Reconstruction

3D reconstruction from a single image is an ill-posed problem. This problem arises due to the ambiguity involved in the occluded part of the object, which is not visible in the image. We attempt to solve this problem using TreeGCN-ED architecture. We train the TreeGCN-ED model on 16 different classes of the ShapeNetBenchmarkV0 dataset [25] till convergence. Later, we replace the encoder of TreeGCN-ED with a CNN based architecture to extract image features. We freeze the trained weights of the decoder and train the image encoder network end-to-end for 3D reconstruction. We use Chamfer Distance (CD) [16] as the loss function. We use the synthesized images available in ShapeNetBenchmarkV0 dataset [25] to train the single image to 3D shape reconstruction model. The qualitative results are shown in Fig. 6.

Figure 6: Qualitative results for single image to 3D reconstruction on the test set of ShapeNetBenchmarkV0 dataset [25].

5 Conclusion

In this work, we propose a tree-structured graph convolution-based encoder architecture and combine it with the decoder of tree-GAN to create a complete tree-structured encoder-decoder model for processing 3D point cloud data. The experimental results of our proposed architecture highlight the effectiveness of the encoder model in learning information-rich features. We also showcase that TreeGCN-ED can be used for the task of point cloud completion and single image based 3D reconstruction.

6 Acknowledgments

This research is supported by SERB Matrics and SERB IMPRINT-2 grant. Also, we would like to thank Ashish Tiwari and Dhananjay Singh for their constructive and valuable feedback.

References

  • [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [2] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015.
  • [3] Evan Shelhamer, Jonathan Long, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.
  • [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • [5] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia, “Pyramid scene parsing network,” in

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017, pp. 6230–6239.
  • [6] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [7] Kai Zhang, Wangmeng Zuo, and Lei Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608–4622, 2018.
  • [8] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang, “Learning deep cnn denoiser prior for image restoration,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2808–2817.
  • [9] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” 2017.
  • [10] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen, “Pointcnn: Convolution on x-transformed points,” in NeurIPS, 2018.
  • [11] Yin Zhou and Oncel Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4490–4499, 2018.
  • [12] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 770–779.
  • [13] Dong Wook Shu, Sung Woo Park, and Junseok Kwon,

    “3d point cloud generative adversarial network based on tree structured graph convolutions,” 2019.

  • [14] Thomas Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” ArXiv, vol. abs/1609.02907, 2017.
  • [15] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” 2018.
  • [16] Haoqiang Fan, Hao Su, and Leonidas Guibas, “A point set generation network for 3d object reconstruction from a single image,” 2016.
  • [17] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao, “3d shapenets: A deep representation for volumetric shapes,” 2015.
  • [18] Alexandre Boulch, B. L. Saux, and Nicolas Audebert, “Unstructured point cloud semantic labeling using deep segmentation networks,” in 3DOR@Eurographics, 2017.
  • [19] Simon Christoph Stein, Markus Schoeler, Jeremie Papon, and Florentin Wörgötter, “Object partitioning using local convexity,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 304–311.
  • [20] David Dohan, Brian Matejek, and Thomas A. Funkhouser, “Learning hierarchical semantic segmentations of lidar data,” 2015 International Conference on 3D Vision, pp. 273–281, 2015.
  • [21] Timo Hackel, Jan Dirk Wegner, and Konrad Schindler, “Fast semantic segmentation of 3d point clouds with strongly varying density,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pp. 177–184, 2016.
  • [22] Jing Huang and Suya You,

    “Point cloud labeling using 3d convolutional neural network,”

    2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2670–2675, 2016.
  • [23] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, Oct. 2019.
  • [24] Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and Martial Hebert, “Pcn: Point completion network,” 2019.
  • [25] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu, “Shapenet: An information-rich 3d model repository,” 2015.
  • [26] Lyne P. Tchapmi, Vineet Kosaraju, Hamid Rezatofighi, Ian Reid, and Silvio Savarese, “Topnet: Structural point cloud decoder,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [27] Chun-Liang Li, Manzil Zaheer, Yang Zhang, Barnabás Póczos, and Ruslan Salakhutdinov, “Point cloud gan,” ArXiv, vol. abs/1810.05795, 2019.
  • [28] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein gan,” 2017.
  • [29] Laurens van der Maaten and Geoffrey E. Hinton, “Visualizing data using t-sne,”

    Journal of Machine Learning Research

    , vol. 9, pp. 2579–2605, 2008.
  • [30] Manolis Savva, Fisher Yu, Hao Su, Asako Kanezaki, Takahiko Furuya, Ryutarou Ohbuchi, Zhichao Zhou, Rui Yu, Song Bai, Xiang Bai, Masaki Aono, Atsushi Tatsuma, Spyridon Thermos, Apostolos Axenopoulos, Georgios Th. Papadopoulos, Petros Daras, Xiao Deng, Zhouhui Lian, Bo Li, Henry Johan, Yijuan Lu, and Sanjeev Mk, “Large-scale 3d shape retrieval from shapenet core55,” in 3DOR@Eurographics, 2017.