Learning 3D global features from multiple views is an effective approach for 3D shape understanding. A widely adopted strategy is to leverage deep neural networks to aggregate features hierarchically extracted from pixel-level information in each view. However, current approaches can not employ part-level information. In this paper, we show for the first time how extracting part-level information over multiple views can be leveraged to learn 3D global features. We demonstrate that this approach further increases the discriminability of 3D global features and outperforms the state-of-the-art methods on large scale 3D shape benchmarks.
It is intuitive that learning to detect and localize semantic parts could help classify shapes more accurately. Previous studies on fine-grained image recognition also employ this intuition by combining local part detection and global feature learning together. To learn highly discriminative features to distinguish subordinate categories, these methods try to first detect important parts, such as heads, wings and tails of birds, and then collect these part features into a global feature. However, these methods do not tackle the challenges that we are facing in the 3D domain. First, these methods require ground truth parts with specified semantic labels, while 3D shape classification benchmarks do not provide such kind of labels. Second, the part detection knowledge learned by these methods cannot be transferred for general purpose use, such as non-fine-grained image classification, since it is specified for particular shape classes. Third, these methods are not designed to aggregate part information from multiple images, corresponding to multiple views of a 3D shape in our scenario. Therefore, simultaneously learning part detection and further aggregating part-level information from multiple views become a unique challenge in 3D global feature learning.
To address these issues, we propose Parts4Feature, a deep neural network to learn 3D global features from semantic parts in multiple views. With a novel definition of generally semantic parts (GSPs), Parts4Feature learns to detect GSPs in multiple views from different 3D shape segmentation benchmarks. Moreover, it learns a 3D global feature from shape classification data sets, by transferring the learned knowledge of part detection, and leveraging the detected GSPs in multiple views. Specifically, Parts4Feature is mainly composed of a local part detection branch and a global feature learning branch. Both branches share a region proposal module, which enables locally and globally discriminative information to get promoted by each other.
The local part detection branch employs a novel neural network derived from Fast R-CNN [Girshick2015] to learn to detect and localize GSPs in multiple views. In addition, the global feature learning branch incrementally aggregates the detected parts in terms of learned part patterns with multi-attention. We propose a novel multi-attention mechanism to further increase the discriminability of learned features by not only highlighting the distinctive parts and part patterns but also depressing the ambiguous ones. Our novel view aggregation based on semantic parts prevents information loss caused by the widely used pooling, and it can understand each detected part in a more detailed manner. In summary, our contributions are as follows:
We propose Parts4Feature, a novel deep neural network to learn 3D global features from semantic parts in multiple views, by combining part detection and global feature learning together.
We show that the novel structure of Parts4Feature is capable of learning and transferring universal knowledge of part detection, which allows Parts4Feature to leverage discriminative information from another source (3D shape segmentation) for 3D global feature learning.
Our global feature learning branch introduces a novel view aggregation based on semantic parts, where the proposed multi-attention further improves the discriminability of learned features.
2 Related work
Mesh-based deep learning models. To directly learn 3D features from 3D meshes, different novel concepts, such as circle convolution [Han and others2016], mesh convolution [Han and others2017] were proposed to perform in deep learning models. These methods aim to learn global or local features from the geometry and spatial information on meshes to understand 3D shapes.
Voxel-based deep learning models. Similar to images, voxels have regular structure to be learned by deep learning models, such as CRBM [Wu and others2015]
, fully convolutional denoising autoencoders[Sharma et al.2016], CNNs [Qi et al.2016], GAN [Wu and others2016]. These methods usually employ 3D convolution to better capture the contextual information in local regions. Moreover, Tags2Parts [Muralikrishnan et al.2018] discovered semantic regions that strongly correlate with user-prescribed tags by learning from voxels using a novel U-Net.
Deep learning models for point clouds. As a series of pioneering work, PointNet++ [Qi and others2017] inspired various supervised methods to understand point clouds. Through self-reconstruction, FoldingNet [Yang et al.2018] and LatentGAN [Achlioptas and others2018] learned global features with different unsupervised strategies.
View-based deep learning models. Similar to the light field descriptor (LFD), GIFT [Bai and others2017] measured the difference between two 3D shapes using their corresponding view feature sets. Moreover, pooling panorama views [Shi and others2015, Sfikas and others2017] or rendered views [Su and others2015, Han et al.2019] is more widely used to learn global features. Different improvements from camera trajectories [Johns et al.2016], view aggregation [Wang et al.2017, Han and others2019a]
, pose estimation[Kanezaki et al.2018] are also presented. However, these methods can not leverage part-level information. In contrast, Parts4Feature learns and transfers universal knowledge of part detection to facilitate 3D global feature learning.
Overview. Parts4Feature consists of three main components as shown in Fig. 1: a local part detection branch , a global feature learning branch , and a region proposal module , where is shared by and and receives multiple views of a 3D shape as input. We train Parts4Feature simultaneously under a local part detection benchmark and a global feature learning benchmark . The local part detection branch learns to identify GSPs in multiple views under , while learns a global feature from the detected GSPs in multiple views under .
For a 3D shape in either or , we capture views around it, forming a view sequence . First, the region proposal module provides the features of regions proposed in each view , where . Then, by analyzing the region features in , branch learns to predict what and where GSPs are in multiple views. Finally, by aggregating the features of the top region proposals in each in , the global feature learning branch produces the global feature of shape . Our approach to aggregating region proposal features is based on semantic part patterns with multi-attention for 3D shape classification, where are learned across all views in the global feature learning benchmark .
Generally semantic parts. We define a GSP as a local part in any semantic part category of any shape class, such as engines of airplanes or wheels of cars. Although our concept of GSPs simplifies all semantic part categories into a binary label by only determining whether a part is semantic or not, this allows us to exploit discriminative, part-level information from several different 3D shape segmentation benchmarks for global feature learning.
We use three 3D shape segmentation benchmarks involved in [Kalogerakis and others2017], including ShapeNetCore, Labeled-PSB, and COSEG to construct the local part detection benchmark and provide ground truth GSPs. We also split the 3D shapes in each segmentation benchmark into training and test sets according to [Kalogerakis and others2017]. Fig. 2 shows the construction of ground truth GSPs. For each view of a 3D shape shown in Fig. 2(a), we obtain its ground truth segmentation visualized in Fig. 2(b) from the shape segmentation benchmark. Then, we can isolate each part category to precisely locate GSPs, as shown from Fig. 2(c) to Fig. 2(f). We emphasize each isolated part category in blue, where we locate the corresponding GSPs by computing the bounding box (red) of the colored regions. Finally, we show all GSPs in view in Fig. 2(g). We collect all GSPs of shape by repeating these procedures in all its views. Note that we omit small GSPs (for example the landing gear in Fig. 2(f)) whose bounding boxes are smaller than 0.45 of the max bounding box in the same part category.
Region proposal module . provides region proposals in all views and their features , which are then forwarded to the local part detection and global feature learning branches. Shared by all views in , is composed of a Deep Convolutional Network (DCN), and a Region Proposal Network (RPN) with Region of Interest (RoI) pooling [Girshick2015].
DCN is modified from a VGGCNNM1024 network [Chatfield et al.2014], and it produces feature for each view as 512 feature maps of size . Based on , RPN then proposes regions in a sliding-window manner. At each sliding-window location centered at each pixel of , a region is proposed by determining its location
and predicting GSP probabilitieswith an anchor. The location
is a four dimensional vector representing center, width and height of the bounding box. We use 6 scales and 3 aspect ratios to yieldanchors, which ensures a wide range of sizes to accommodate region proposals for GSPs that may be partially occluded in some views. The 6 scales relative to the size of the views are , and the 3 aspect ratios are , , and . Altogether, this leads to regions in each view .
To train RPN to predict GSP probabilities , we assign a binary label to each region indicating whether is a GSP. We assign a positive label if the Intersection-over-Union (IoU) overlap between and any ground-truth GSP in is higher than a threshold , and we use a negative label otherwise. In each view we apply RoI pooling over regions given by on feature maps . Hence, the features of all region proposals are dimensional vectors, which we forward to the local part detection branch . In addition, we provide the features of the top regions according to their GSP probability to the global feature learning branch .
Local part detection branch . The purpose of this branch is to detect GSPs from the region proposals in each view . We employ as an enhancer of RPN, where aims to learn what and where GSPs are in without anchors in a more precise manner. The intuition is that this in turn pushes RPN to propose more GSP-like regions, which we provide to the global feature learning branch .
We feed the region features of into a sequence of fully connected layers followed by two output layers. The first one estimates the GSP probability that
is a GSP using a sigmoid function as an indicator. The second one predicts the corresponding part locationusing a bounding box regressor, where represents the same bounding box parameters as in RPN. Similar to the threshold in , employs another threshold to assign positive and negative labels for training. Denoting the ground truth probabilities and locations of positive and negative samples in RPN as and , and similarly for as and , the objective function of Parts4Feature for GSP detection is formed by the loss in and , which is defined for each region proposal as follows,
where measures the accuracy in terms of GSP probability by the cross-entropy function of positive labels, while measures the accuracy in terms of location by the robust function as in [Ren and others2015]. The parameter balances and in both and . It works well in our experiments with a value of 1. In summary, Parts4Feature has the powerful ability to detect GSPs by simultaneously leveraging the view-level features in and the part-level features in , which addresses the difficulty of GSP detection from multiple views caused by rotation and occlusion effects.
Global feature learning branch . This branch learns to map the features of the top region proposals in each view in to the 3D global feature . To avoid information loss caused by widely used pooling for aggregation, incrementally aggregates all region features in terms of semantic part patterns with multi-attention, where we learn the patterns across all training data in the global feature learning benchmark . The motivation for learning part patterns to aggregate regions is that the appearance of GSPs is so various that it would limit the discriminability of global features . Our multi-attention mechanism includes attention weights for view aggregation on the part-level and the part-pattern-level, denoted by and , respectively. Here, models how each of the patterns weights each of the regions , while measures how the final, global feature weights each of the patterns .
Specifically, we employ a single-layer perceptron to learn, where has the same dimension as . is a matrix, where each entry is the attention paid to each of the regions by the -th pattern . is measured by a softmax function as . With , we first aggregate all region features into a pattern specific aggregation in terms of each pattern by computing . Then, we further aggregate all pattern specific aggregations into the final, global feature of 3D shape . This is performed by linear weighting with the dimensional vector , such that . For clarity of exposition, we explain the details of how we obtain further below.
Finally, we use to classify into one of shape classes by a softmax function after a fully connected layer, where the softmax function outputs the classification probabilities , such that each probability is defined as . The objective function of is the cross entropy between and the ground truth probability ,
The intuition behind modelling part-pattern-level attention is to enable Parts4Feature to weight the pattern specific aggregations according to the 3D shape characteristics that it has learned. This leads Parts4Feature to differentiate shapes in detail. To implement this, is designed to capture the similarities between each of the pattern specific aggregations and the shape classes. To represent the characteristics of shape classes, we propose to employ the weights in the fully connected layer before the last softmax function, as illustrated in Fig. 1. We first project and into a common space using matrices and . Then we compute normalized similarities using a linear mapping with and as follows, , where learnable parameters and are and dimensional matrices, and are and dimensional vectors, means stacking all vectors into a matrix row by row.
Training. We train and together under a local part detection benchmark , and under a global feature learning benchmark . The Parts4Feature objective is to simultaneously minimize Eq. 1 and Eq. 2, which leads to the loss
where the number of samples is a normalization factor and is a balance parameter. Since and are based on the object detection architecture of Fast R-CNN [Girshick2015], we adopt the approximate approach in [Ren and others2015] to jointly train and fast. In addition, we simultaneously update in the softmax classifier in by and . This enables to be learned more flexibly for optimization convergence, which is a connection across . For the case, parameters in , and can be simultaneously updated, otherwise, they are updated alternatively. For example, parameters in and are first updated under , then, parameters in (except RPN) and are updated under , and this process is iterated until convergence. In our following experiments we use .
4 Experiments and analysis
Parameters. We investigate how some important parameters affect Parts4Feature in shape classification under ModelNet [Wu and others2015].
We first explore the IoU thresholds in and in that are used to establish positive GSP samples using ModelNet40 [Wu and others2015] as , as shown in Table 1, where we initially use views, regions, and patterns. With and increasing from 0.5 to 0.8, the mean Average Presicion (mAP) under the test set of decreases, and accordingly, the average instance accuracy under the test set of decreases, compared to the highest classification accuracy . With , we also decrease to 0.5 and increase it to 0.8 respectively. The mAP only slightly drops from 77.28 to 75.39 and 72.32, although the corresponding accuracy decreases too. However, the mAP and the accuracy are not strictly positive correlated, as shown by “(0.6,0.6)”, which has lower mAP but higher accuracy than “(0.8,0.5)” and “(0.5,0.5)”. This comparison also implies that affects part detection more than .
Next, we apply the parameters setting “(0.7,0.5)” under ModelNet10 [Wu and others2015], as shown by the first accuracy of in Table 2. Increasing to 0.7 leads to an even better result of . We also find the slight effect of , , and on the performance.
We visualize part detection and multi-attention involved in our best result under ModelNet10 in Fig. 3 and Fig. 4, respectively. Although there are no ground truth GSPs under ModelNet10, Parts4Feature still successfully transfers the part detection knowledge learned from to detect GSPs in multiple views. Moreover, is learned to focus on the patterns with high part attentions in , where the top-6 patterns with high part attentions in are shown below for clarity.
Ablation study. Finally, in Table 3 we highlight our semantic part based view aggregation and multi-attention method in branch
under ModelNet10. We replace our view aggregation with max pooling, mean pooling, and NetVLAD, where we aggregateregion features for classification. Although these results are good, our novel aggregation with multi-attention can further improve the results. For evaluating multi-attention, we keep unchanged and set all entries in and to 1 (“NoAtt”). This leads to significantly worse performance compared to our “MultiAtt”. Next, we employ and separately. We find that both of part attention and part pattern attention improve “NoAtt”, but (“PtAtt”) contributes less than (“PnAtt”). Moreover, we highlight the effect of branch as an enhancer of module by removing (“No ”) from Parts4Feature, which is also justified by the degenerated results.
Classification. Table 4 compares Parts4Feature with the state-of-the-art in 3D shape classification under ModelNet. The comparison are conducted under the same condition111We use the same modality of views from the same camera system for the comparison, where the results of RotationNet are from Fig.4 (a) and (b) in https://arxiv.org/pdf/1603.06208.pdf. Moreover, the benchmarks are with the standard training and test split..
|MVCN[Su and others2015]||View||90.10||-|
|MVVC[Qi et al.2016]||Voxel||91.40||-|
|3DDt[Xie and others2018]||Voxel||-||92.40|
|PaiV[Johns et al.2016]||View||90.70||92.80|
|Sphe[Cao et al.2017]||View||93.31||-|
|GIFT[Bai and others2017]||View||89.50||91.50|
|RAMA[Sfikas and others2017]||View||90.70||91.12|
|VRN[Brock et al.2016]||Voxel||91.33||93.80|
|RNet[Kanezaki et al.2018]||View||90.65||93.84|
|PNetP[Qi and others2017]||Point||91.90||-|
|DSet[Wang et al.2017]||View||92.20||-|
|VGAN[Wu and others2016]||Voxel||83.30||91.00|
|LAN[Achlioptas and others2018]||Point||85.70||95.30|
|FNet[Yang et al.2018]||Point||88.40||94.40|
|SVSL[Han and others2019a]||View||93.31||94.82|
|VIPG[Han and others2019b]||View||91.98||94.05|
Under both benchmarks, Parts4Feature outperforms all its competitors at the same condition, where “Our” are obtained with the parameters of our best accuracy under ModelNet40 in Table 1 and the ones under ModelNet10 in Table 2. This comparison shows that Parts4Feature effectively employs part-level information to significantly improve the discriminability of learned features. Parts4Feature is also outperforming under ShapeNet55 with the same parameters of our best results under ModelNet10, as shown by the comparison in the last three rows in Table 7.
To better demonstrate our classification results, we visualize the confusion matrix of our classification result under ModelNet10 and ShapeNet55 in Fig. 5 and Fig. 6, respectively. In each confusion matrix, an element in the diagonal line means the classification accuracy in a class, while other elements in the same row means the misclassification accuracy. The large diagonal elements shows that Parts4Feature is good at classifying large-scale 3D shapes.
We also conduct experiments with reduced number of segmented shapes for training under ModelNet10. As shown in Table 5, trained by randomly sampled of 6,386 shapes, our results increase accordingly. The good results with segmented shapes show that we not only learn from pixel-level information in 3D classification benchmarks, similar to existing methods, but also improve performance further by absorbing part-level information from 3D segmentation benchmark.
Retrieval. We further evaluate Parts4Feature in shape retrieval under ModelNet and ShapeNetCore55 by comparing with the state-of-the-art methods in Table 6 and Table 7. These experiments are conducted under the test set, where each 3D shape is used as a query to retrieve from the rest of the shapes, and the retrieval performance is evaluated by mAP. The compared results include LFD, SHD, Fisher vector, 3D ShapeNets [Wu and others2015], Pano [Shi and others2015], MVCN [Su and others2015], GIFT [Bai and others2017], RAMA [Sfikas and others2017] and Trip [He et al.2018] under ModelNet.
As shown in Table 6, Table 7, our results outperform all the compared results in each benchmark. Besides Taco [Cohen et al.2018] in Table 7, the compared micro-averaged results in Table 7 are from SHREC2017 shape retrieval contest [Savva and others2017] with the same names. In addition, the available PR curves under ModelNet40 and ModelNet10 are also compared in Fig. 7, which also demonstrates our outperforming results in shape retrieval.
|SVSL[Han and others2019a]||85.5|
|VIPG[Han and others2019b]||83.0|
Parts4Feature is proposed to learn 3D global features from part-level information in a semantic way. It successfully learns universal knowledge of generally semantic part detection from 3D segmentation benchmarks, and effectively transfers the knowledge to other shape analysis benchmarks by learning 3D global features from detected parts in multiple views. Parts4Feature makes it feasible to improve 3D global feature learning by leveraging discriminative information from another source. Moreover, our novel view aggregation with multi-attention can also benefit Parts4Feature to learn more discriminative features than widely used aggregation procedures. Our outperforming results show that Parts4Feature is superior to other state-of-the-art methods.
This work was supported by National Key R&D Program of China (2018YFB0505400) and NSF under award number 1813583. We thank all anonymous reviewers for their constructive comments.
Panos Achlioptas et al.
Learning representations and generative models for 3D point clouds.
The International Conference on Machine Learning, pages 40–49, 2018.
- [Bai and others2017] Song Bai et al. GIFT: Towards scalable 3D shape retrieval. IEEE Transaction on Multimedia, 19(6):1257–1271, 2017.
[Brock et al.2016]
Andrew Brock, Theodore Lim, J.M. Ritchie, and Nick Weston.
Generative and discriminative voxel modeling with convolutional neural networks.In 3D deep learning workshop (NIPS), 2016.
- [Cao et al.2017] Zhangjie Cao, Qixing Huang, and Karthik Ramani. 3D object classification via spherical projections. In International Conference on 3D Vision. 2017.
- [Chatfield et al.2014] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
- [Cohen et al.2018] Taco S. Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In International Conference on Learning Representations, 2018.
IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
[Han and others2016]
Zhizhong Han et al.
Unsupervised 3D local feature learning by circle convolutional restricted boltzmann machine.IEEE Transactions on Image Processing, 25(11):5331–5344, 2016.
[Han and others2017]
Zhizhong Han et al.
Mesh convolutional restricted boltzmann machines for unsupervised learning of features with structure preservation on 3D meshes.IEEE Transactions on Neural Network and Learning Systems, 28(10):2268 – 2281, 2017.
- [Han and others2019a] Zhizhong Han et al. Seqviews 2seqlabels: Learning 3D global features via aggregating sequential views by rnn with attention. IEEE Transactions on Image Processing, 28(2):1941–0042, 2019.
- [Han and others2019b] Zhizhong Han et al. View inter-prediction gan: Unsupervised representation learning for 3D shapes by learning global shape memories to support local view predictions. In AAAI, 2019.
- [Han et al.2019] Zhizhong Han, Mingyang Shang, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. Y2seq2seq: Cross-modal representation learning for 3D shape and text by joint reconstruction and prediction of view and word sequences. In AAAI, 2019.
[He et al.2018]
Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai.
Triplet-center loss for multi-view 3D object retrieval.
The IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- [Johns et al.2016] Edward Johns, Stefan Leutenegger, and Andrew J. Davison. Pairwise decomposition of image sequences for active multi-view recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3813–3822, 2016.
- [Kalogerakis and others2017] Evangelos Kalogerakis et al. 3D shape segmentation with projective convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6630–6639, 2017.
- [Kanezaki et al.2018] Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- [Muralikrishnan et al.2018] Sanjeev Muralikrishnan, Vladimir G. Kim, and Siddhartha Chaudhuri. Tags2parts: Discovering semantic regions from shape tags. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2926–2935, 2018.
- [Qi and others2017] Charles Qi et al. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5105–5114, 2017.
- [Qi et al.2016] C R Qi, H Su, and M Niebner. Volumetric and multi-view cnns for object classification on 3D data. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5648–5656, 2016.
- [Ren and others2015] Shaoqing Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
- [Savva and others2017] Manolis Savva et al. SHREC’17 Large-Scale 3D Shape Retrieval from ShapeNet Core55. In Eurographics Workshop on 3D Object Retrieval, 2017.
- [Sfikas and others2017] Konstantinos Sfikas et al. Exploiting the PANORAMA Representation for Convolutional Neural Network Classification and Retrieval. In EG Workshop on 3D Object Retrieval, pages 1–7, 2017.
- [Sharma et al.2016] Abhishek Sharma, Oliver Grau, and Mario Fritz. VConv-DAE: Deep volumetric shape learning without object labels. In Proceedings of European Conference on Computer Vision, pages 236–250, 2016.
- [Shi and others2015] B. Shi et al. Deeppano: Deep panoramic representation for 3D shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
- [Su and others2015] Hang Su et al. Multi-view convolutional neural networks for 3D shape recognition. In International Conference on Computer Vision, pages 945–953, 2015.
- [Wang et al.2017] Chu Wang, Marcello Pelillo, and Kaleem Siddiqi. Dominant set clustering and pooling for multi-view 3D object recognition. In Proceedings of British Machine Vision Conference, 2017.
- [Wu and others2015] Zhirong Wu et al. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
- [Wu and others2016] Jiajun Wu et al. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
- [Xie and others2018] Jianwen Xie et al. Learning descriptor networks for 3D shape synthesis and analysis. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- [Yang et al.2018] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.