Hierarchical View Predictor: Unsupervised 3D Global Feature Learning through Hierarchical Prediction among Unordered Views

08/08/2021 ∙ by Zhizhong Han, et al. ∙ Wayne State University University of Maryland Tsinghua University 2

Unsupervised learning of global features for 3D shape analysis is an important research challenge because it avoids manual effort for supervised information collection. In this paper, we propose a view-based deep learning model called Hierarchical View Predictor (HVP) to learn 3D shape features from unordered views in an unsupervised manner. To mine highly discriminative information from unordered views, HVP performs a novel hierarchical view prediction over a view pair, and aggregates the knowledge learned from the predictions in all view pairs into a global feature. In a view pair, we pose hierarchical view prediction as the task of hierarchically predicting a set of image patches in a current view from its complementary set of patches, and in addition, completing the current view and its opposite from any one of the two sets of patches. Hierarchical prediction, in patches to patches, patches to view and view to view, facilitates HVP to effectively learn the structure of 3D shapes from the correlation between patches in the same view and the correlation between a pair of complementary views. In addition, the employed implicit aggregation over all view pairs enables HVP to learn global features from unordered views. Our results show that HVP can outperform state-of-the-art methods under large-scale 3D shape benchmarks in shape classification and retrieval.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Learning discriminative global features is important for 3D shape analysis tasks such as classification (Sharma et al., 2016; Wu et al., 2016; Han et al., 2017, 2016, 2019a; Yang et al., 2018; Achlioptas et al., 2018; Han et al., 2018, 2019d), retrieval (Sharma et al., 2016; Wu et al., 2016; Han et al., 2017, 2016, 2019a; Yang et al., 2018; Achlioptas et al., 2018; Han et al., 2018), correspondence (Huang et al., 2017; Han et al., 2019a, 2017, 2016, 2018), segmentation (Qi et al., 2017a, b; Wen et al., 2020b), and reconstruction (Jiang et al., 2020; Wen et al., 2020c, 2021b, 2021a; Hu et al., 2020; Han et al., 2020c, a, b; Ma et al., 2021; Chen et al., 2021; Xiang et al., 2021). With supervised information (Huang et al., 2017; Qi et al., 2017a, b; Liu et al., 2019a), recent deep learning based methods have achieved remarkable results under large-scale 3D benchmarks. However, intense manual labeling effort is required to obtain supervised information. In contrast, unsupervised 3D feature learning offers a more promising research challenge that avoids the manual labeling effort.

Several studies have addressed this challenge recently (Yan et al., 2016; Sharma et al., 2016; Wu et al., 2016; Han et al., 2017, 2016; Girdhar et al., 2016; Rezende et al., 2016; Choy et al., 2016; Han et al., 2019a; Yang et al., 2018; Achlioptas et al., 2018; Han et al., 2018, 2019d) by extracting “supervised” information in an unsupervised scenario for the training of deep learning models. Extracting self-supervised information is usually achieved by posing different prediction tasks, such as the prediction of a shape from itself by minimizing reconstruction error (Sharma et al., 2016; Wu et al., 2016; Han et al., 2019a; Yang et al., 2018; Achlioptas et al., 2018) or embedded energy (Han et al., 2017, 2016), the prediction of a 3D shape from its context given by 2D views of the shape (Yan et al., 2016; Choy et al., 2016; Han et al., 2019d) or local shape features (Han et al., 2018), or the prediction of a shape from views and itself together (Girdhar et al., 2016; Rezende et al., 2016). Among all these methods, multiple sequential views are usually employed to provide a holistic context of 3D shapes, however, unordered views can still not be leveraged to learn.

In this paper, we propose a novel model for 3D shape feature learning using a self-supervised view-based prediction task, which is formulated using unordered views and not restricted to sequential views. As we demonstrate in our results, this leads to highly discriminative 3D shape features and state-of-the-art performance in 3D shape analysis tasks such as classification and retrieval.

A key idea of our deep learning model, called the Hierarchical View Predictor (HVP), is that it operates on pairs of opposing views of each shape. HVP learns to hierarchically make local view predictions in each view pair, i.e., patches to patches, patches to view, and view to view, and then aggregates the knowledge learned from the predictions in all view pairs into global features. Specifically, unordered views are taken around a 3D shape on a sphere, where a current view and its opposite view form a view pair. Splitting unordered views into multiple view pairs enables HVP to handle the lack of order among views. In each view pair, given a set of patches of the current view, HVP performs hierarchical view prediction by first predicting the complementary set of patches in the current view, and then completing the whole current view and its opposite from any one of the two sets of patches. Hierarchical view prediction aims to learn the structure of 3D shapes from the correlation between patches in the same view and the correlation between appearances in the pair of complementary views. In addition, HVP employs an effective aggregation technique that implicitly aggregates the knowledge learned in the hierarchical view predictions of all pairs of opposing views. In summary, our significant contributions are as follows:

Figure 1. The framework of HVP is demonstrated by learning from a view pair of an plane from (a) to (f).
  • We propose HVP as a novel deep learning model to perform unsupervised 3D global feature learning through hierarchical view prediction, which leads to state-of-the-art results in classification and retrieval.

  • Hierarchical view prediction enables predictions among unordered views or ordered views, which facilitates HVP to comprehensively understand a 3D shape by hierarchically capturing the correlation between parts in the same view and the correlation between appearances in the pair of complementary views.

  • With simultaneously mining “supervised” information inside a view and between two views, HVP eliminates the requirements of learning from dense neighboring view set, which enables HVP to achieve high performance under sparse neighboring view set.

2. Related work

Supervised 3D feature learning. With class labels, various deep learning models have been proposed to learn 3D features by capturing the distribution patterns among voxels (Wu et al., 2015; Qi et al., 2016b; Wang et al., 2017a), meshes (Han et al., 2018), points clouds (Qi et al., 2017a, b; Liu et al., 2019a; Wen et al., 2020a; Liu et al., 2020) and views (Bai et al., 2017; Shi et al., 2015; Sfikas et al., 2017; Sinha et al., 2016; Su et al., 2015; Huang et al., 2017; Savva et al., 2016; Johns et al., 2016; Wang et al., 2017b; Kanezaki et al., 2018; Han et al., 2019b; Liu et al., 2021; Han et al., 2019h, f; Han et al., 2019c). Among these methods, multi-view based methods perform the best, where pooling is widely used for view aggregation.

Unsupervised 3D feature learning. To mine “supervised” information in unsupervised scenario, deep learning based methods adopted different prediction strategies, such as the prediction of a shape from itself by minimizing reconstruction error (Sharma et al., 2016; Wu et al., 2016; Han et al., 2019a; Yang et al., 2018; Achlioptas et al., 2018) or embedded energy (Han et al., 2017, 2016), the prediction of a shape from context (Yan et al., 2016; Choy et al., 2016; Han et al., 2018), or the prediction of a shape from context and itself together (Girdhar et al., 2016; Rezende et al., 2016). These methods employ different kinds of 3D raw representations, such as voxels (Yan et al., 2016; Sharma et al., 2016; Wu et al., 2016; Han et al., 2019a; Girdhar et al., 2016; Rezende et al., 2016), meshes (Han et al., 2017, 2016, 2018) or point clouds (Du et al., 2021; Sauder and Sievers, 2019; Yang et al., 2018; Achlioptas et al., 2018; Han et al., 2019g; Poursaeed et al., 2020; Gao et al., 2020; Zhang and Zhu, 2019; Hassani and Haley, 2019; Liu et al., 2019b), and accordingly, different kinds of context, such as spatial context of virtual words (Han et al., 2018) or views (Girdhar et al., 2016; Choy et al., 2016; Rezende et al., 2016; Yan et al., 2016; Han et al., 2019d; Gao et al., 2021), are employed. Different from these methods, HVP employs a novel hierarchical view prediction among unordered views to mine more and finer “supervised” information.

View synthesis. Early works teach deep learning models to predict novel views according to input views and transformation parameters (Dosovitskiy et al., 2017). To generate views with more detail and less geometric distortions, external image sets (Flynn et al., 2016) or geometric constraints (Zhou et al., 2016; Ji et al., 2017) are further employed. Similarly, the information of multiple past frames is aggregated in video prediction (William et al., 2016). However, these methods cannot aggregate the knowledge learned in each prediction for the discriminability of global features.

3. Hierarchical view predictor

The rationale of hierarchical view prediction is to mimic human perception and understanding of 3D shapes. If a human knows a 3D shape, based on observing one part of a view, they can easily imagine the other part of the view, the full view, and even the opposite view. Therefore, HVP mimics this perception by hierarchical view prediction covering patches to patches, patches to view, and view to view, to learn the structure of a 3D shape from the correlation between parts in the same view and the correlation between appearances in the pair of complementary views.

Rather than learning from context by predicting a single patch using CNN in unsupervised image feature learning method (Pathak et al., 2016), HVP employs an RNN based architecture to predict patch sequence. Since there is only one freely rotated object with the rest of empty background in a rendered view, the context which can be leveraged to learn is much less than in a natural image which contains multiple up-oriented objects. Thus, we want to capture more spatial relationship among parts in a view to remedy the scarce context.

Overview. The framework of HVP is illustrated in Fig. 1. By hierarchical view prediction, HVP aims to learn global feature of a 3D shape from unordered views () taken around on a sphere. We place the cameras at the 20 vertices of a regular dodecahedron, such that the

views are uniformly distributed. The

dimensional feature vector

is learned for each shape similarly as in (Han et al., 2019d) via gradient descent together with training the other parameters in HVP, starting from random initialization.

For each shape in Fig. 1(a), each view and its opposite view form a view pair in Fig. 1(b). In , and are respectively denoted as and for short. Using a grid, we divide the current view into 12 overlapping patches , which are further split into two subsets, an “O” like shape patch set and an “I” like shape patch set , as shown in Fig. 1(c). Each patch is a square of size .

In a pair , hierarchical view prediction consists of three prediction tasks in different spaces. Given either one of the two patch sets, HVP first predicts the other patch set in a feature space computed using a VGG19 network as shown in Fig. 1(d). Then, the current view is predicted in pixel space in Fig. 1(e). Finally, the opposite view is further predicted based on the predicted current view in Fig. 1(f).

Figure 2. The gridding procedure splits a view into two patch sets and .

Gridding. In a view pair , the current view is divided into 12 overlapping patches of size , as demonstrated in Fig. 2. The overlapping patches are obtained by uniformly moving a window across in vertical and horizontal direction, as demonstrated in Fig. 3. This gridding indexes from 1 to 12 in order from left to right and top to bottom. We split the set of 12 patches into an “O” like set and an “I” like set of six patches each, i.e., patches belong while the rest patches belong , where the patches in or are sorted into a patch sequence in index ascending order, respectively, as illustrated in Fig. 2. We formulate patch prediction as a bidirectional task. That is, both the prediction of from and the prediction of from can be conducted using the same network structure (we will take the former for example in the following description).

Figure 3. The gridding procedure is demonstrated by moving a blue window on a current view in vertical and horizontal direction.

Our strategy to define and split the patch set is motivated as follows: First, we want to split the patches of the current view into two equally sized subsets to facilitate bidirectional prediction using the same symmetrical structure in HVP. Second, the two patch sets should be distributed over the current view in a similar manner, which eliminates the bias of one directional prediction over the other. Third, each subset should contain a small number of patches to avoid redundancy and increasing computational cost.

Patch prediction. Patch prediction is performed to capture the correlation between parts in the same view. It predicts the feature of patch in set (or ) from the features of patches in set (or ). According to the indexes established in gridding, we sort all patches in either or into a sequence in ascending order. This leads us to use a seq2seq model to implement patch prediction as shown in Fig. 1(d).

We first extract a 4096 dimensional feature of each patch

by the last fully connected layer of a VGG19 pretrained under ImageNet. Then, we use an encoder RNN

to encode all the in (denoted as ) with their spatial relationship. We provide the global feature of shape , our learning target, at the first step of the encoder . A key characteristic of our approach is that is shared among all view prediction tasks for each shape . Hence it serves as a knowledge container that keeps incorporating the knowledge derived from each hierarchical view prediction performed on . Different from pooling, which is widely used as an explicit view aggregation, our implicit aggregation enables HVP to mine more and finer information from views of . encodes as a 4096 dimensional hidden state of the last step. Finally, based on , a decoder RNN is employ to predict the features of patches in (denoted as ) with their spatial relationship. Similar to the encoder , we provide the global feature at the first step of decoder , which is regarded as a reference for the following patch feature predictions. For each view, we measure the patch prediction performance of HVP using -2 loss in feature space, denoted as loss ,

(1)

Current view prediction. Similar to patch prediction, current view prediction also aims to capture the correlation between parts in the same view, but in pixel space, which forces HVP to understand a 3D shape from the current view in different spaces. HVP predicts the full current view covering both patch sets and based on the encoding of the patch features in set or set . We employ a deconvolutional network to predict the current view in pixel space from , as shown in Fig. 1(e).

By reshaping the 4096 dimensional into 256 feature maps of size , the deconvolutional network starts generating with a resolution of through two deconvolutional layers. The two deconvolutional layers employ 128 and 3 kernels, respectively, and each kernel has size

and a stride of 4, where an ReLu and a tanh are followed in each layer as nonlinear activation functions respectively. For each view, we utilize the

-2 loss between the predicted current view (or ) and the ground truth current view to measure the current view prediction performance of HVP, denoted as loss ,

(2)

Opposite view prediction. We further perform opposite view prediction to capture the correlation between appearances in the pair of complementary views. This helps HVP to bridge one individual view to another and encode their relationship into the global feature. Based on the predicted current view , HVP predicts the opposite view of current view in each view pair , which is the most challenging task in the three predictions in HVP. This is because the opposite view can only be correctly predicted if the predicted current view is very close to the ground truth current view , and no other clue is available to use. This challenging criterion pushes HVP to comprehensively learn the intrinsic structure of a 3D shape.

We employ a convolutional network and the same deconvolutional network to implement the opposite view prediction, as shown in Fig. 1(f). The convolutional network abstracts the predicted current view into a 4096 dimensional feature through three convolutional blocks and one fully connected layer. Each of the three blocks includes 1, 2, and 3 convolutional layers, respectively, which is followed by a maxpool with size of . All convolutional layers in the three blocks employ kernels with size of and a stride of 1, where the ReLu is used as the nolinear activation function. Then, the deconvolutional network generates the predicted opposite views based on the 4096 dimensional feature of . For each view, we also utilize the -2 loss between the predicted opposite view and the ground truth opposite view to measure the opposite view prediction performance of HVP, denoted as loss ,

(3)

Objective function. Finally, in each view pair , HVP is trained to minimize all the aforementioned losses involved in the three prediction tasks. Therefore, we define the objective function of HVP by combining the three losses as in Eq. (4), where the weights and are used to control the balance among them,

(4)

Note that simultaneously with the other network parameters, we also optimize the learning target by minimizing . We use a standard gradient descent approach by iteratively updating as Eq. (5), where is the learning rate,

(5)

Testing modes. There are two typical modes of unsupervised learning of features of 3D shapes for testing, which we call the known-test mode and the unknown-test mode. In known-test mode, the test shapes are given with the training shapes at the same time, such that the features of test shapes can be learned with the features of training shapes together. In unknown-test mode, HVP is first pretrained using the set of training shapes only. At test time, we then iteratively learn the features of test shapes by minimizing Eq. (4) while fixing the other pretrained parameters of HVP.

4. Experimental results and analysis

In this section, the performance of HVP is evaluated and analyzed. First we discuss the setup of parameters involved in HVP. These parameters are tuned to demonstrate how they affect the discriminability of learned features in shape classification under ModelNet10 (Wu et al., 2015). Then, some ablation studies are presented to show the effectiveness of some important elements involved in HVP. Finally, HVP is compared with state-of-the-art methods in shape classification and retrieval under ModelNet10 (Wu et al., 2015) and ModelNet40 (Wu et al., 2015). In addition, some generated current views and opposite views are also visualized to justify HVP better. Note that all classification is conducted by training a linear SVM under the global features learned by HVP.

Dataset and evaluations. The training and testing sets of ModelNet40 consist of 9,843 and 2,468 shapes, respectively. In addition, the training and testing sets of ModelNet10 consist of 3,991 and 908 shapes, respectively.

We employ both average instance accuracy (InsACC) and average class accuracy (ClaACC) to evaluate the classification results. Moreover, we use mAP and precision and recall (PR) curves as metrics in shape retrieval.

Parameter setup. Initially, the dimension of global feature is 4096 which is the same as the dimension of , and the views of all 3D shapes under ModelNet10 are employed to train HVP in known-test mode with a learning rate of . Both and are set to 1, which makes the initial values of loss , and

comparable to each other, where a normal distribution with mean of 0 and standard deviation of 0.02 is used to initialize the parameters involved in HVP. In addition, the patch width

is 128, and both the prediction of from and the prediction of from are performed.

First, we conduct experiments to explore how the learning rate affects the performance of HVP, as shown in Table 1. We employ different learning rates with initial parameters mentioned in the former paragraph. Besides the initial setting, we iteratively use for training. We achieve the best instance accuracy of with .

2 3 4 5 6 7
InsACC 91.74 92.29 92.51 93.61 92.73 92.84
ClaACC 91.48 92.05 92.15 93.25 92.35 92.45
Table 1. The effect of under ModelNet10. , , .

Next, we explore the balance weights and . We summarize the results in Table 2 and Table 3, which shows that the weights are important for the performance of HVP. With , we explore the effect of by iteratively setting to in Table 2. The instance accuracy increases to a best of with , and then, decreases gradually. We observe a similar phenomenon in the exploration of the effect of in Table 3. With obtaining the best result in Table 2, we iteratively set to . The instance accuracy also achieves up to with , and then, decreases a little bit. These results show that both under-fitted and over-fitted patch prediction and opposite view prediction would also affect the discriminability of learned features. In addition, in terms of row averaged accuracies in Table 2 and Table 3, HVP is affected more by patch prediction than by opposite view prediction, since the row averaged accuracies of drop more than the ones of from the same highest accuracies.

0.25 0.5 1 2 4 Avg
InsACC 92.18 92.84 93.61 92.18 92.51 92.66
ClaACC 91.85 92.75 93.25 91.85 92.11 92.36
Table 2. The effect of under ModelNet10. , .
0.25 0.5 1 2 4 Avg
InsACC 92.40 92.84 93.61 93.39 92.73 92.99
ClaACC 92.25 92.43 93.25 92.81 92.31 92.61
Table 3. The effect of under ModelNet10. , .

Finally, we explore how the patch size affects the performance of HVP, as shown in Table 4. Beside the patch size involved in the former experiments, we also iteratively employ patches with size of . Based on a view with size of 224 in our experiments, the increasing patch size makes the results achieve up to with , while decreasing gradually for even larger patches. These results show that both a lack of semantic information in small patches and too much redundant information among big overlapping patches are not helpful to increase the discriminability of learned features. Since smaller sizes make the features of the patches meaningless, while big sizes provide too much redundant information, such that HVP can easily solve the patch prediction, without needing to store information in the global feature that is shared among all predictions for each shape.

Size 128 160 180 200
InsACC 93.61 94.16 93.94 93.73
ClaACC 93.25 93.80 93.60 93.36
Table 4. The effect of patch size under ModelNet10. , .

Ablation studies. Based on the former experiments, we further explore the contribution of each prediction involved in hierarchical view prediction, as highlighted by results in Table 5. We first train HVP only using the current view prediction loss (), then, we incrementally add opposite view prediction loss () or patch prediction loss (), finally, we compare these results with our best results employing all the three losses ().

Compared to the results of “”, each incrementally added loss can improve the performance of HVP in terms of both averaged instance accuracy and averaged class accuracy. In addition, the patch prediction loss can help HVP improve more than the opposite view prediction loss. To better visualize the effect of these losses on HVP, we show the generated current view or generated opposite view involved in the experiments in Table 5. In Fig. 4, the generated views and their distances to the ground truth are shown. The added opposite view prediction loss can degenerate the generated current view compared to the one with only or the one with . While the added patch prediction loss can improve the generated current view or the generated opposite view, as shown by the comparison between and and the comparison between and .

Loss
InsACC 86.78 88.66 90.31 94.16
ClaACC 86.41 87.80 90.10 93.80
Table 5. The contribution of each prediction involved in hierarchical view prediction under ModelNet10. , ,.
Figure 4. The effect of different losses on the generated current view and opposite view in a view pair under ModelNet10. The distance between each generated current view and ground truth is also shown under each generated current view, where the red color means the distance is larger than the distance in the first column in the same row while the green color means it becomes smaller.

Then, we highlight the effect of prediction direction in Table 6. In all the former experiments, we employ the bidirectional prediction This could provide more data to train HVP thanks to the symmetrical structure of HVP. In the following experiments, we train HVP using different kinds of single directional data respectively, such as from to (shown as ), from to (shown as ), and randomly selected single direction for hierarchical view prediction on each view (shown as or ). As shown by these results, bidirectional prediction could learn better features than single direction prediction.

Direction or and
InsACC 93.83 93.61 93.72 94.16
ClaACC 93.38 93.16 93.25 93.80
Table 6. The effect of prediction direction under ModelNet10. , , .

Finally, we justify our view aggregation method in Table 7. We first train HVP without . To obtain a global shape feature without , we use pooling on the features for all view pairs for each shape (recall that we use the to solve the view prediction tasks as shown in Fig. 1

). The results using mean and max pooling without

in the first column in Table 7 exhibit a large drop in performance compared to our approach. To better understand this, we perform a second experiment where we train HVP with as described previously, but we again use the mean- or max-pooled features as the global shape feature. The results in the second column in Table 7 (first and second row) show that using improves the performance of the mean- and max-pooled features . However, our approach that uses itself as the shape feature (third row) achieves even better performance, indicating that HVP’s implicit pooling is superior to explicit schemes such as mean- or max-pooling. Intuitively, this is because max pooling can lose some information in each view pair, while mean pooling weights all view pairs equally. Hence it fails to give more weight to certain highly distinct view pairs.

Without With
Methods InsACC ClaACC InsACC ClaACC
MeanPool 89.65 87.96 90.75 90.58
MaxPool 90.31 89.84 91.19 90.92
Our - - 94.16 93.80
Table 7. The effect of view aggregation under ModelNet10. , , .
Methods Supervised Instance Class
ORION(Sedaghat et al., 2017) Yes - 93.80
3DDescriptorNet(Xie et al., 2018) Yes - 92.40
Pairwise(Johns et al., 2016) Yes - 92.80
GIFT(Bai et al., 2017) Yes - 91.50
VoxNet(Maturana and S., 2015) Yes - 92.00
VRN(Brock et al., 2016) Yes 93.80 -
PANORAMA(Sfikas et al., 2017) Yes - 91.12
LFD(Chen et al., 2003) No 79.90 -
Vconv-DAE(Sharma et al., 2016) No - 80.50
3DGAN(Wu et al., 2016) No - 91.00
VSL(Liu et al., 2018) No 91.00 -
NSampler(Remelli et al., 2019) no 88.70 95.30
LGAN(Achlioptas et al., 2018) No 95.30 -
LGAN(Achlioptas et al., 2018)(MN10) No 92.18 -
FNet(Yang et al., 2018) No 94.40 -
FNet(Yang et al., 2018)(MN10) No 91.85 -
VIPGAN(Han et al., 2019d) No 94.05 93.71
Our No 94.16 93.80
Our(MN40) No 92.18 91.57
Table 8. Classification comparison under MN10.

Classification. We compare HVP with the state-of-the-art methods in classification under ModelNet10 and ModelNet40 in Table 8 and Table 9, respectively. The parameters under ModelNet40 are the same ones with our best results under ModelNet10 in Table 7. Under ModelNet10, HVP outperforms all its unsupervised competitors under ModelNet10, as shown by “Our”, which is also the best result compared to eight top ranked supervised methods. Under ModelNet40, HVP achieves the state-of-the-art results among all the unsupervised and supervised competitors. For fair comparison, the result of VRN (Brock et al., 2016) is presented without ensemble learning. The result of RotationNet (Kanezaki et al., 2018) is presented with views taken by the default camera system orientation that is identical to the others. Although the results of LGAN, FNet and NSampler are better than our results under ModelNet10, it is inconclusive whether they are better than ours. This is because these methods are trained under a version of ShapeNet55 that contains more than 57,000 3D shapes, including a number of 3D point clouds. However, there are only 51,679 3D shapes from ShapeNet55 that are available for public download. Therefore, we cannot use the same amount of training data to train HVP to compare with them. To perform fair comparison with “Our”, we use the codes of LGAN and FNet to conduct experiments only using shapes in ModelNet, as shown by “LGAN()” and “FNet()”, which employs the same training data as ours. Our performing results show that our method is superior to these methods.

With the ability of mining correlation inside each view, HVP does not rely on dense neighboring views to learn. To justify this, we train HVP using two views of each 3D shape from ModelNet40. Although VIPGAN achieves the best results with 12 views under ModelNet40, HVP works much better than VIPGAN under sparse neighboring view set, as shown by the comparison between “Our(Two)” and “VIPGAN(Two)” in Table 9.

Moreover, we evaluate HVP in unknown-test mode by learning features of ModelNet10 using parameters pretrained under ModelNet40 (“Our” in Table 9). As shown by “Our(MN40)” in Table 8

, HVP can still produce good results. These results show that HVP is with remarkable transfer learning ability based on comprehensive 3D shapes understanding, which is benefited by mining correlation inside a view and between complementary views.

Methods Supervised Instance Class
MVCNN(Qi et al., 2016a) Yes 92.0 89.7
MVCNN-Sphere(Qi et al., 2016a) Yes 89.5 86.6
Pairwise(Johns et al., 2016) Yes - 90.70
GIFT(Bai et al., 2017) Yes - 89.50
PointNet++(Qi et al., 2017b) Yes 91.90 -
VRN(Brock et al., 2016) Yes 91.33 -
RotationNet(Kanezaki et al., 2018) Yes 92.37 -
PANORAMA(Sfikas et al., 2017) Yes - 90.70
T-L Network(Girdhar et al., 2016) No - 74.40
Vconv-DAE(Sharma et al., 2016) No - 75.50
3DGAN(Wu et al., 2016) No - 83.30
VSL(Liu et al., 2018) No 84.50 -
LGAN(Achlioptas et al., 2018) No 85.70 -
LGAN(Achlioptas et al., 2018)(MN40) No 87.27 -
FNet(Yang et al., 2018) No 88.40 -
FNet(Yang et al., 2018)(MN40) No 84.36 -
NSampler(Remelli et al., 2019) no 88.70 -
MRTNet(Gadelha et al., 2018) No 86.40 -
3DCapsule(Zhao et al., 2018) No 88.90 -
PointGrow(Sun et al., 2018) No 85.80 -
PCGAN(Li et al., 2018) No 87.80 -
MAPVAE(Han et al., 2019g) No 90.15 -
OrientNet(Poursaeed et al., 2020) No 90.75 -
VIPGAN(Han et al., 2019d) No 91.98 -
VIPGAN(Two)(Han et al., 2019d) No 4.05 2.50
Our No 90.72 87.95
Our(Two) No 88.33 83.53
Table 9. Classification comparison under MN40.

Retrieval. We further evaluate HVP in shape retrieval under the ModelNet40 and ModelNet10 by comparing with the state-of-the-art methods in Table 10. These experiments are conducted under the test set, where each 3D shape is used as a query to retrieve from the rest of the shapes, and the retrieval performance is evaluated by mAP. In addition, we employ the same parameters with our best classification results in Table 9 and Table 8 to extract the global features for the retrieval experiments under ModelNet40 and ModelNet10, respectively. As shown in Table 10, our results outperform all the compared results under ModelNet10 and achieve state-of-the-art under ModelNet40. In addition, the available PR curves under ModelNet40 and ModelNet10 are also compared in Fig. 5, which also demonstrates our outperforming results in shape retrieval.

Methods Range MN40 MN10
SHD (Kazhdan et al., 2003) Test-Test 33.26 44.05
LFD (Chen et al., 2003) Test-Test 40.91 49.82
3DShapeNets (Wu et al., 2015) Test-Test 49.23 68.26
GeomImage (Sinha et al., 2016) Test-Test 51.30 74.90
DeepPano (Shi et al., 2015) Test-Test 76.81 84.18
MVCNN (Su et al., 2015) Test-Test 79.50 -
PANORAMA (Sfikas et al., 2017) Test-Test 83.45 87.39
GIFT (Bai et al., 2017) Random 81.94 91.12
Triplet (He et al., 2018) Test-Test 88.00 -
SliceVoxel (Miyagi and Aono, 2017) Test-Test 77.48 85.34
SV2SL (Han et al., 2019e) Test-Test 89.00 89.55
VIPGAN (Han et al., 2019d) Test-Test 89.23 90.69
Serial (Xu et al., 2019) Test-Test 87.05 -
Ours Test-Test 87.13 91.19
Table 10. The comparison of retrieval in terms of mAP under ModelNet40 and ModelNet10.
Figure 5. The PR curve comparison under ModelNet40 and ModelNet10.
Figure 6. The generated current view and generated opposite view of a shape are visualized.

Visualization. For our classification results, we show the generated current views and the generated opposite views of all ten shape classes under ModelNet10, where two shapes are involved in each shape class, as shown in Fig. 6

. These results show that HVP can generate plausible views based on the comprehensive understanding of 3D shapes. Another interesting observation is that the opposite view can be generated better than the current view. In addition, we show the confusion matrix of our classification results under ModelNet10 and ModelNet40 in Fig. 

7 and Fig. 8

, respectively. In each confusion matrix, an element in the diagonal line means the percentage of how many 3D shapes are correctly classified, while other elements in the same row means the percentage of 3D shapes wrongly classified into other shape classes. The large diagonal elements in each confusion matrix show that HVP is able to learn highly discriminative features for 3D shapes, which facilitates HVP to achieve high performance in classifying large-scale 3D shapes.

Figure 7. The confusion matrix of our results in 3D shape classification under ModelNet10.
Figure 8. The confusion matrix of our results in 3D shape classification under ModelNet40.

For our retrieval results, we show the top 5 retrieved 3D shapes for some queries from ModelNet10 and ModelNet40 in Fig. 9. According to the distance between the query and each retrieved shape (shown under each retrieved shape), we find HVP is capable of learning features to distinguish 3D shapes in detail, which enables retrieving 3D shapes with similar structures.

Figure 9. The top 5 retrieved 3D shapes for some queries from (a) ModelNet40 and (b) ModelNet10 are demonstrated. The distance between the query and each retrieved shape is also shown.

5. Conclusions

We proposed HVP for unsupervised 3D global feature learning from unordered views of 3D shapes. By implementing a novel hierarchical view prediction, HVP successfully mines highly discriminative information among unordered views in an unsupervised manner. Our results show that HVP effectively learns to hierarchically make patch predictions, current view prediction and opposite view prediction in each view pair, and then, comprehensively aggregates the knowledge learned from the predictions in all view pairs into global features. HVP can not only learn from both unordered and ordered view set, but also work well under sparse neighboring view sets, which eliminates the requirement of mining “supervised” information from dense neighboring views. Our results show that HVP outperforms its unsupervised counterparts, as well as some top ranked supervised methods under large-scale benchmarks in shape classification and retrieval.

References

  • (1)
  • Achlioptas et al. (2018) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas J. Guibas. 2018. Learning Representations and Generative Models for 3D Point Clouds. In

    International Conference on Machine Learning

    . 40–49.
  • Bai et al. (2017) Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, and Longin Jan Latecki. 2017. GIFT: Towards Scalable 3D Shape Retrieval. IEEE Transaction on Multimedia 19, 6 (2017), 1257–1271.
  • Brock et al. (2016) Andrew Brock, Theodore Lim, J.M. Ritchie, and Nick Weston. 2016.

    Generative and discriminative voxel modeling with convolutional neural networks. In

    3D deep learning workshop (NIPS).
  • Chen et al. (2021) Chao Chen, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2021. Unsupervised Learning of Fine Structure Generation for 3D Point Clouds by 2D Projections Matching. In

    IEEE International Conference on Computer Vision

    .
  • Chen et al. (2003) Dingyun Chen, Xiaopei Tian, Yute Shen, and Ming Ouhyoung. 2003.

    On visual similarity based 3D model retrieval.

    Computer Graphics Forum 22, 3 (2003), 223–232.
  • Choy et al. (2016) Christopher Bongsoo Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. In European Conference on Computer Vision. 628–644.
  • Dosovitskiy et al. (2017) Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, and Thomas Brox. 2017. Learning to Generate Chairs, Tables and Cars with Convolutional Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 692–705.
  • Du et al. (2021) Bi’an Du, Xiang Gao, Wei Hu, and Xin Li. 2021. Self-Contrastive Learning with Hard Negative Sampling for Self-supervised Point Cloud Learning. CoRR abs/2107.01886 (2021).
  • Flynn et al. (2016) John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 2016. DeepStereo: Learning to Predict New Views From the World’s Imagery. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    .
  • Gadelha et al. (2018) Matheus Gadelha, Rui Wang, and Subhransu Maji. 2018. Multiresolution Tree Networks for 3D Point Cloud Processing. In European Conference on Computer vision.
  • Gao et al. (2020) Xiang Gao, Wei Hu, and Guo-Jun Qi. 2020. GraphTER: Unsupervised Learning of Graph Transformation Equivariant Representations via Auto-Encoding Node-wise Transformations. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Gao et al. (2021) Xiang Gao, Wei Hu, and Guo-Jun Qi. 2021. Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations. ArXiv abs/2103.00787 (2021).
  • Girdhar et al. (2016) Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and Abhinav Gupta. 2016. Learning a Predictable and Generative Vector Representation for Objects. In Proceedings of European Conference on Computer Vision. 484–499.
  • Han et al. (2020a) Zhizhong Han, Chao Chen, Yu-Shen Liu, and Matthias Zwicker. 2020a. DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images. In International Conference on Machine Learning.
  • Han et al. (2019b) Zhizhong Han, Xinhai Liu, Yu-Shen Liu, and Matthias Zwicker. 2019b. Parts4Feature: Learning 3D Global Features from Generally Semantic Parts in Multiple Views. In IJCAI.
  • Han et al. (2017) Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C.L.Philip Chen. 2017.

    Mesh Convolutional Restricted Boltzmann Machines for Unsupervised Learning of Features With Structure Preservation on 3D Meshes.

    IEEE Transactions on Neural Network and Learning Systems

    28, 10 (2017), 2268 – 2281.
  • Han et al. (2019a) Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C.L.P. Chen. 2019a. Unsupervised Learning of 3D Local Features from Raw Voxels Based on A Novel Permutation Voxelization Strategy. IEEE Transactions on Cybernetics 49, 2 (2019), 481–494.
  • Han et al. (2016) Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and Xuelong Li. 2016.

    Unsupervised 3D Local Feature Learning by Circle Convolutional Restricted Boltzmann Machine.

    IEEE Transactions on Image Processing 25, 11 (2016), 5331–5344.
  • Han et al. (2018) Zhizhong Han, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Shuhui Bu, Junwei Han, and CL Philip Chen. 2018. Deep Spatiality: Unsupervised Learning of Spatially-Enhanced Global and Local 3D Features by Deep Neural Network with Coupled Softmax. IEEE Transactions on Image Processing 27, 6 (2018), 3049–3063.
  • Han et al. (2019c) Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C.L. Philip Chen. 2019c. 3D2SeqViews: Aggregating Sequential Views for 3D Global Feature Learning by CNN With Hierarchical Attention Aggregation. IEEE Transactions on Image Processing 28, 8 (2019), 3986–3999.
  • Han et al. (2020b) Zhizhong Han, Baorui Ma, Yu-Shen Liu, and Matthias Zwicker. 2020b. Reconstructing 3D Shapes from Multiple Sketches using Direct Shape Optimization. IEEE Transactions on Image Processing 29 (2020), 8721–8734.
  • Han et al. (2020c) Zhizhong Han, Guanhui Qiao, Yu-Shen Liu, and Matthias Zwicker. 2020c. SeqXY2SeqZ: Structure Learning for 3D Shapes by Sequentially Predicting 1D Occupancy Segments From 2D Coordinates. In European Conference on Computer Vision.
  • Han et al. (2019d) Zhizhong Han, Mingyang Shang, Yu-Shen Liu, and Matthias Zwicker. 2019d. View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions. In AAAI. 8376–8384.
  • Han et al. (2019e) Zhizhong Han, Mingyang Shang, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C.L. Philip Chen. 2019e. SeqViews2SeqLabels: Learning 3D Global Features via Aggregating Sequential Views by RNN With Attention. IEEE Transactions on Image Processing 28, 2 (2019), 685–672.
  • Han et al. (2019f) Zhizhong Han, Mingyang Shang, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019f. Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences. In AAAI. 126–133.
  • Han et al. (2019g) Zhizhong Han, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019g. Multi-Angle Point Cloud-VAE:Unsupervised Feature Learning for 3D Point Clouds from Multiple Angles by Joint Self-Reconstruction and Half-to-Half Prediction. In IEEE International Conference on Computer Vision.
  • Han et al. (2019h) Zhizhong Han, Xiyang Wang, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, and C.L. Philip Chen. 2019h. 3DViewGraph: Learning Global Features for 3D Shapes from A Graph of Unordered Views with Attention. In IJCAI.
  • Hassani and Haley (2019) Kaveh Hassani and Mike Haley. 2019. Unsupervised Multi-Task Feature Learning on Point Clouds. In IEEE International Conference on Computer Vision.
  • He et al. (2018) Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. Triplet-Center Loss for Multi-View 3D Object Retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition.
  • Hu et al. (2020) Tao Hu, Zhizhong Han, and Matthias Zwicker. 2020. 3D Shape Completion with Multi-view Consistent Inference. In AAAI.
  • Huang et al. (2017) H. Huang, E. Kalegorakis, S. Chaudhuri, D. Ceylan, V. Kim, and E. Yumer. 2017. Learning Local Shape Descriptors with View-based Convolutional Neural Networks. ACM Transactions on Graphics (2017).
  • Ji et al. (2017) Dinghuang Ji, Junghyun Kwon, Max McFarland, and Silvio Savarese. 2017. Deep View Morphing. The IEEE Conference on Computer Vision and Pattern Recognition (2017).
  • Jiang et al. (2020) Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. 2020. SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Johns et al. (2016) Edward Johns, Stefan Leutenegger, and Andrew J. Davison. 2016. Pairwise Decomposition of Image Sequences for Active Multi-view Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3813–3822.
  • Kanezaki et al. (2018) Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. 2018.

    RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints. In

    IEEE Conference on Computer Vision and Pattern Recognition.
  • Kazhdan et al. (2003) Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. 2003. Rotation invariant spherical harmonic representation of 3D shape descriptors. In Proceedings of Eurographics Symposium on Geometry Processing. 156–165.
  • Li et al. (2018) Chun-Liang Li, Manzil Zaheer, Yang Zhang, Barnabas Poczos, and Ruslan Salakhutdinov. 2018. Point Cloud GAN. CoRR abs/1810.05795 (2018).
  • Liu et al. (2018) Shikun Liu, C. Lee Giles, and Alexander G. Ororbia II. 2018. Learning a Hierarchical Latent-Variable Model of 3D Shapes. In 2018 International Conference on 3D Vision (3DV).
  • Liu et al. (2020) Xinhai Liu, Zhizhong Han, Fangzhou Hong, Yu-Shen Liu, and Matthias Zwicker. 2020. LRC-Net: Learning discriminative features on point clouds by encoding local region contexts. Computer Aided Geometric Design 79 (2020), 101859.
  • Liu et al. (2019a) Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2019a. Point2Sequence: Learning the Shape Representation of 3D Point Clouds with an Attention-based Sequence to Sequence Network. In AAAI. 8778–8785.
  • Liu et al. (2021) Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2021. Fine-Grained 3D Shape Classification With Hierarchical Part-View Attention. IEEE Transactions on Image Processing 30 (2021), 1744–1758.
  • Liu et al. (2019b) Xinhai Liu, Zhizhong Han, Wen Xin, Yu-Shen Liu, and Matthias Zwicker. 2019b. L2G Auto-encoder: Understanding Point Clouds by Local-to-Global Reconstruction with Hierarchical Self-Attention. In ACM International Conference on Multimedia.
  • Ma et al. (2021) Baorui Ma, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2021. Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces. In International Conference on Machine Learning.
  • Maturana and S. (2015) D. Maturana and Scherer S. 2015. VoxNet: A 3D Convolutional Neural Network for real-time object recognition. In International Conference on Intelligent Robots and Systems. 922–928.
  • Miyagi and Aono (2017) Ryo Miyagi and Masaki Aono. 2017. Sliced voxel representations with LSTM and CNN for 3D shape recognition. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.
  • Pathak et al. (2016) Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. 2016. Context Encoders: Feature Learning by Inpainting. In Computer Vision and Pattern Recognition.
  • Poursaeed et al. (2020) Omid Poursaeed, Tianxing Jiang, Quintessa Qiao, Nayun Xu, and Vladimir G. Kim. 2020.

    Self-supervised Learning of Point Clouds via Orientation Estimation.

    3DV (2020).
  • Qi et al. (2017a) Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017a. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Qi et al. (2016a) C R Qi, H Su, and M Niebner. 2016a. Volumetric and Multi-view CNNs for Object Classification on 3D Data. In IEEE Conference on Computer Vision and Pattern Recognition. 5648–5656.
  • Qi et al. (2016b) Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas Guibas. 2016b. Volumetric and Multi-View CNNs for Object Classification on 3D Data. In IEEE Conference on Computer Vision and Pattern Recognition. 5648–5656.
  • Qi et al. (2017b) Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017b. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems. 5105–5114.
  • Remelli et al. (2019) Edoardo Remelli, Pierre Baque, and Pascal Fua. 2019. NeuralSampler: Euclidean Point Cloud Auto-Encoder and Sampler. CoRR abs/1901.09394 (2019).
  • Rezende et al. (2016) Danilo Jimenez Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. 2016. Unsupervised Learning of 3D Structure from Images. In Advances in Neural Information Processing Systems. 4997–5005.
  • Sauder and Sievers (2019) Jonathan Sauder and Bjarne Sievers. 2019. Self-Supervised Deep Learning on Point Clouds by Reconstructing Space. In Advances in Neural Information Processing Systems. 12962–12972.
  • Savva et al. (2016) M. Savva, F. Yu, Hao Su, M. Aono, B. Chen, D. Cohen-Or, W. Deng, Hang Su, S. Bai, and X. Bai. 2016. SHREC’16 Track Large-Scale 3D Shape Retrieval from ShapeNet Core55. In EG 2016 workshop on 3D Object Recognition.
  • Sedaghat et al. (2017) N. Sedaghat, M. Zolfaghari, E. Amiri, and T. Brox. 2017. Orientation-boosted voxel nets for 3D object recognition. In British Machine Vision Conference.
  • Sfikas et al. (2017) Konstantinos Sfikas, Theoharis Theoharis, and Ioannis Pratikakis. 2017. Exploiting the PANORAMA Representation for Convolutional Neural Network Classification and Retrieval. In Eurographics Workshop on 3D Object Retrieval. 1–7.
  • Sharma et al. (2016) Abhishek Sharma, Oliver Grau, and Mario Fritz. 2016. VConv-DAE: Deep Volumetric Shape Learning Without Object Labels. In Proceedings of European Conference on Computer Vision. 236–250.
  • Shi et al. (2015) B. Shi, S. Bai, Z. Zhou, and X. Bai. 2015. DeepPano: Deep panoramic representation for 3D shape recognition. IEEE Signal Processing Letters 22, 12 (2015), 2339–2343.
  • Sinha et al. (2016) Ayan Sinha, Jing Bai, and Karthik Ramani. 2016. Deep Learning 3D Shape Surfaces Using Geometry Images. In European Conference on Computer Vision. 223–240.
  • Su et al. (2015) Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. 2015. Multi-view convolutional neural networks for 3D shape recognition. In International Conference on Computer Vision. 945–953.
  • Sun et al. (2018) Yongbin Sun, Yue Wang, Ziwei Liu, Joshua E. Siegel, and Sanjay E. Sarma. 2018. PointGrow: Autoregressively Learned Point Cloud Generation with Self-Attention. CoRR abs/1810.05591 (2018). arXiv:1810.05591 http://arxiv.org/abs/1810.05591
  • Wang et al. (2017b) Chu Wang, Marcello Pelillo, and Kaleem Siddiqi. 2017b. Dominant Set Clustering and Pooling for Multi-View 3D Object Recognition. In Proceedings of British Machine Vision Conference.
  • Wang et al. (2017a) Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017a. O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis. ACM Transactions on Graphics 36, 4 (2017), 72:1–72:11.
  • Wen et al. (2021a) Xin Wen, Zhizhong Han, Yan-Pei Cao, Pengfei Wan, Wen Zheng, and Yu-Shen Liu. 2021a. Cycle4Completion: Unpaired Point Cloud Completion using Cycle Transformation with Missing Region Coding. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Wen et al. (2020a) Xin Wen, Zhizhong Han, Xinhai Liu, and Yu-Shen Liu. 2020a. Point2SpatialCapsule: Aggregating Features and Spatial Relationships of Local Regions on Point Clouds Using Spatial-Aware Capsules. IEEE Transactions on Image Processing 29 (2020), 8855–8869.
  • Wen et al. (2020b) Xin Wen, Zhizhong Han, Geunhyuk Youk, and Yu-Shen Liu. 2020b. CF-SIS: Semantic-Instance Segmentation of 3D Point Clouds by Context Fusion with Self-Attention. In ACM International Conference on Multimedia.
  • Wen et al. (2020c) Xin Wen, Tianyang Li, Zhizhong Han, and Yu-Shen Liu. 2020c. Point Cloud Completion by Skip-attention Network with Hierarchical Folding. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Wen et al. (2021b) Xin Wen, Peng Xiang, Zhizhong Han, Yan-Pei Cao, Pengfei Wan, Wen Zheng, and Yu-Shen Liu. 2021b. PMP-Net: Point Cloud Completion by Learning Multi-step Point Moving Paths. In IEEE Conference on Computer Vision and Pattern Recognition.
  • William et al. (2016) Lotter William, Kreiman Gabriel, and Cox David. 2016. Unsupervised Learning of Visual Structure using Predictive Generative Networks. In ICLR.
  • Wu et al. (2016) Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. 2016. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In Advances in Neural Information Processing Systems. 82–90.
  • Wu et al. (2015) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A Deep Representation for Volumetric Shapes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1912–1920.
  • Xiang et al. (2021) Peng Xiang, Xin Wen, Yu-Shen Liu, Yan-Pei Cao, Pengfei Wan, Wen Zheng, and Zhizhong Han. 2021. SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer. In IEEE International Conference on Computer Vision.
  • Xie et al. (2018) Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu. 2018. Learning Descriptor Networks for 3D Shape Synthesis and Analysis. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Xu et al. (2019) Cheng Xu, Zhaoqun Li, Qiang Qiu, Biao Leng, and Jingfei Jiang. 2019. Enhancing 2D Representation via Adjacent Views for 3D Shape Retrieval. In The IEEE International Conference on Computer Vision.
  • Yan et al. (2016) Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. 2016. Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision. In Advances in Neural Information Processing Systems. 1696–1704.
  • Yang et al. (2018) Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. 2018. FoldingNet: Point Cloud Auto-encoder via Deep Grid Deformation. In CVPR.
  • Zhang and Zhu (2019) L. Zhang and Z. Zhu. 2019. Unsupervised Feature Learning for Point Cloud Understanding by Contrasting and Clustering Using Graph Convolutional Neural Networks. In 3DV. 395–404.
  • Zhao et al. (2018) Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico Tombari. 2018. 3D Point-Capsule Networks. CoRR abs/1812.10775 (2018).
  • Zhou et al. (2016) Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A. Efros. 2016. View Synthesis by Appearance Flow. In European Conference on Computer Vision, Vol. 9908. 286–301.