HRGE-Net: Hierarchical Relational Graph Embedding Network for Multi-view 3D Shape Recognition

08/27/2019 ∙ by Xin Wei, et al. ∙ Xi'an Jiaotong University 0

View-based approach that recognizes 3D shape through its projected 2D images achieved state-of-the-art performance for 3D shape recognition. One essential challenge for view-based approach is how to aggregate the multi-view features extracted from 2D images to be a global 3D shape descriptor. In this work, we propose a novel feature aggregation network by fully investigating the relations among views. We construct a relational graph with multi-view images as nodes, and design relational graph embedding by modeling pairwise and neighboring relations among views. By gradually coarsening the graph, we build a hierarchical relational graph embedding network (HRGE-Net) to aggregate the multi-view features to be a global shape descriptor. Extensive experiments show that HRGE-Net achieves stateof-the-art performance for 3D shape classification and retrieval on benchmark datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D shape recognition is an important research area in computer vision. The 3D shapes, including real scanned or CAD objects, retain richer geometric and shape information for recognition than the 2D images captured from a single view. 3D shape recognition plays a critical role in applications such as automatic drive

[24], archaeology [27], virtual reality / augmented reality [11], etc.

Figure 1: Illustration of a 3D shape and its multi-view images. The images are rendered from 12 views by camera around the shape as shown in (a). The images from neighboring views are related in both poses and appearance as shown in (b). In (c), the left and right pairs of images are symmetric in poses.

There have been tremendous advances in research on deep learning-based 3D shape recognition in recent years. According to shape representation, they can be divided into three categories including voxel-based, point-based, and view-based methods.

Voxel-based methods represent a 3D shape by a collection of voxels [25]

in 3D Euclidean space, then build neural networks on voxels to learn the 3D features for recognition

[34, 20]. Though they are effective in performance, they commonly have great challenges including the computational complexity, voxel resolution, and data sparsity caused by voxelization of the shape surface. Point-based methods directly define networks on point clouds or mesh. PointNet [4] is a simple but powerful deep architecture that takes point positions as input. Succeeding methods, e.g., PointNet++ [26], SpiderCNN [35], PointCNN [18], achieve improved performance for 3D shape recognition. View-based methods [31, 25, 37, 15] recognize shape categories by extracting and aggregating multi-view features, and commonly achieve state-of-the-art performance for shape recognition. However, the challenges are how to project 3D shapes and how to effectively aggregate features learned from multi-view 2D images.

In this work, we focus on the research of view-based 3D shape recognition, and propose a novel network architecture to embed the multi-view features to a global 3D shape descriptor for 3D shape representation. For view-based methods, the simple max-pooling or average-pooling of multi-view features ignores the relations among multi-view images. As shown in Fig. 

1(b), the rendered 2D images captured from neighboring views have strong relation, e.g., the poses and appearance are similar and smoothly changed. Moreover, as shown in Fig. 1(c), the paired multi-view 2D images in different views are also related, e.g., the examples of paired 2D images are symmetric. Therefore, we believe that the 2D images of neighboring views and pairwise views contain valuable information that could be meaningful for aggregating multi-view features.

Motivated by the above analysis, we propose a novel relational graph network to effectively aggregate the multi-view features for 3D shape recognition. We first construct a relational graph over the multi-view images, and then design a network block of relational graph embedding over the graph. This network block explicitly models the pairwise and neighboring view relations among multi-view images respectively by pairwise relation module and neighboring relation module. Based on this relational modeling, we successively coarsen the relational graph, and design a hierarchical relational graph embedding network, dubbed as HRGE-Net, over the graph hierarchy. The learned HRGE-Net can gradually aggregate the multi-view features considering relations among views and produce a discriminative global shape descriptor.

Compared with traditional view-based methods, this relational graph can effectively explore the relations hidden among the views. We evaluate our method on 3D benchmark datasets, and our network achieves state-of-the-art performance, e.g., per class and per instance classification accuracies on ModelNet40 [34], micro-averaged and macro-averaged retrieval mAP on ShapeNet Core55 [29].

2 Related Work

2.1 View-based 3D Shape Recognition

Multi-view images of a 3D shape contain rich shape information, and view-based methods commonly lead to higher accuracy for shape recognition compared with voxel and point-based methods. In [1], an efficient 3D retrieval system was built by extracting features of multi-view images by CNN and matching two shapes by defining similarity between two sets of view features. In [31, 32], multi-view image features were pooled across views by max-pooling and then passed through additional network layers to obtain a compact shape descriptor for recognition.

Recently, several works consider the advanced fusion strategy for multi-view feature aggregation. [37] proposed a harmonized bilinear pooling for aggregating the multi-view features from the perspective of patches to patches similarity. In [5, 12], the sequential multi-view images were selected and / or aggregated by a RNN with attention for producing a global shape representation. Both of [8, 33] investigated the grouping relationship of multi-view features and designed feature pooling on view groups. In [36], a point-view network was proposed for integrating the point cloud and multi-view data for joint 3D shape recognition. RotationNet [15] is one of the state-of-the-art methods for shape recognition, which treated the view index as an optimized latent variable when predicting shape labels.

These research achieved promising results for 3D shape classification and retrieval. Compared with them, our approach models the multi-view images of a shape as a graph and explicitly learns the relations among the pairwise views and neighboring views by a novel hierarchical relational graph embedding network. As shown in the experiments, it achieves state-of-the-art results for shape recognition.

2.2 Relational Graph Network

Modeling the relations among entities or objects is an important task in real-world applications. Objects constitute the nodes of a graph, and the relations among objects can be modeled as graph edges. Early works [10, 9, 21] coped with the object relations by a post-processing operation to re-weight the objects. Recently, the relations are modeled in deep learning framework, where LSTM is utilized for sequential reasoning [17, 30]. Relation network [28] is a pioneer work for relation reasoning with a simple neural network. It first recognizes objects in the raw input data, and then utilizes a relational reasoning module to reason about the objects and their interactions. Based on [28], [22] proposed a recurrent relational reasoning module to model the message passing process on graph. In [14], attention strategy was introduced to relation networks for instance recognition. These works have shown notable performance in applications such as visual QA (question and answer), text-based QA, and dynamic physical systems, justifying the effectiveness of relational modeling of objects / entities.

Our network is inspired by relation network [28, 22], but with novel designs considering domain knowledge in 3D shape recognition. For example, we take the multi-view features as graph nodes instead of objects, we further design two types of relations, i.e., pairwise and neighboring view relations, and organize these relation models to be a hierarchical deep architecture over gradually coarsened graphs. By ablation study, our relational modules and graph architecture are beneficial for improving the performance of shape recognition.

Figure 2: Pipeline of hierarchical relational graph embedding network. It consists of three stages: the view feature extraction, hierarchical relational graph embedding (HRGE) and label prediction stages. The HRGE stage is proposed to aggregate multi-view features.

3 HRGE-Net

In this section, we present the details on the design of our HRGE-Net. As illustrated in Fig. 2, HRGE-Net consists of three stages. In the view feature extraction stage, multi-view features are extracted from projected 2D images of a 3D shape. In the hierarchical relational graph embedding (HRGE) stage, the features are aggregated to be a global feature to represent the shape, which is then sent to the last label prediction stage for predicting its category.

3.1 View Feature Extraction

We project a 3D shape to multi-view 2D images using similar settings as in [31, 32] by Phong reflection model [23]. Given a 3D object, we first rescale it to fit the fixed viewing volume. Then the virtual cameras are radially symmetric placed, and elevated with 30 degrees around the upright direction. Finally, we render the objects to the virtual camera plane to form a series of images with views indexed by , and the rendered images are with black background. For more details, please refer to [31, 32].

For the rendered multi-view images , we extract their features by a fine-tuned ResNet-50 [13]

network. The network is firstly pre-trained on ImageNet

[7] for image classification, then fine-tuned on the shuffled multi-view images of all the 3D shapes for classification. Finally, the features before the last fully connected layer are taken as multi-view features. All the 2D images from different views share ResNet-50 for feature extraction. For multi-view images , after the feature extraction, we derive multi-view features , which are taken as inputs of our following hierarchical relational graph embedding stage.

3.2 Hierarchical Relational Graph Embedding

In this stage, we aim to aggregate the extracted multi-view features to be a global 3D shape descriptor. As illustrated in Fig. 1, the rendered images from neighboring views have strong relations (see Fig. 1(b)) in both poses and appearance. Moreover, we also believe that the rendered images between pairwise views also have strong relations, e.g., the images may be symmetric from some specific paired views (see Fig. 1(c)). These relations should provide additional discriminative information for shape recognition. Motivated by these observations, we are interested to design a network that can learn to aggregate these multi-view features considering their relations.

Multi-view Relational Graph. Given the multi-view features, a graph can be constructed using each view’s feature as a node and the view-based relations (will be discussed later) as edges. Since we assume that the virtual cameras lie on a circle around object, therefore the graph is defined on a circle around the object, as illustrated in HRGE stage in Fig. 2. Our hierarchical relational graph embedding hierarchically aggregates the input view features () from finer graph with nodes to coarser graphs with decreasing number of nodes (i.e., views) by graph coarsening. Without loss of generality, at level of the graph hierarchy, we denote the corresponding relational graph as with its graph node feature as , where is the number of views at level .

At the level of graph hierarchy, we next design relational graph embedding to aggregate shape features on graph considering the relations of pairwise views and neighboring views.

3.2.1 Pairwise Relation Module

This module is responsible for modeling pairwise relation among nodes (i.e., views) in graph to investigate the relations of all pairwise views. As shown in Fig. 3, we set the graph edges as the edges connecting each pair of graph nodes. Then, we define the pairwise relation between nodes of graph as:



denotes the concatenation of two vectors.

is a relation function with learnable parameter , aiming at exploring the relation between two nodes of the graph. In our implementation, we design it as a three-layer MLP with 2048 units in each pairwise relation module.

By Eqn. (1), we derive the relations between any two nodes of the graph, then we further gather these relations to achieve the relational feature for each graph node:


where denotes all nodes that have edges connecting to node . This is similar to the message passing process that collecting information from its connected nodes.

For node , we then design an operation to fuse the relational feature with its original feature as:


where is a feature fusion function with learnable parameter . In our implementation, it is designed as a simple fully connected layer with 2048 units. By Eqn. (3), node features are updated considering pairwise relations on the graph, such that the updated features are with larger receptive field on the graph. After pairwise relation module, we derive the relational graph with updated node features , which are taken as inputs of the following neighboring relation module.

3.2.2 Neighboring Relation Module

We further model the relations among neighboring views, because the neighboring views are related with continuously changed poses and appearance (e.g., Fig. 1(b)). For the graph with its node feature , we compute the neighboring relation for each graph node as:


where is a concatenation operation, is a relation function with learnable parameter , and it is designed to investigate the relations among triplet of neighboring views’ features, and maps the high dimensional concatenated features into a low dimension feature space. In our work, we design as a fully connected layer with 2048 units. By Eqn. (4), we fuse the node features with its neighboring node features in the graph, and it can be seen as a convolution operation on the graph with spatial neighborhood of .

After updating the node features, we then coarsen the graph

by down-sampling the graph nodes with view stride of

, achieving a new relational graph with nodes, whose nodes features are , . We set in our implementation.

3.2.3 Summary of Relational Graph Embedding

In summary, as shown in Fig. 3, the input multi-view features are successively processed by the pairwise relation module and neighboring relation module. The features of each graph node are updated considering features of the other views, and the relations can be automatically learned by training the network. Note that, for simplicity, we implement the relation functions of and feature fusion function with simple neural network layers, and they can also be designed as other complex functions with larger capacity.

3.2.4 Hierarchical Relational Graph Embedding

Figure 3: Relational graph embedding. The node features are first sent to pairwise relation module, followed by neighboring relation module and graph coarsening. The three neighboring views are indicated by green, yellow and red in the middle and right rings.

For one relational graph embedding stage , it embeds multi-view features on a graph to output the updated features on a coarsened graph with less number of views. We concatenate multiple repetitions of relational graph embedding to be a hierarchical deep architecture, i.e., the HRGE-Net. To retain all the shape features in the hierarchy, in each stage except the last graph , we perform max-pooling on multi-view features followed by normalization (Norm) to be a global shape descriptor:


For the last relational graph , we directly max-pool and normalize its node features to get , and the final global shape feature is the concatenation of all the global features at all levels: .

The proposed hierarchical relational graph embedding is summarized in Algorithm 1 for clarity, where () are network parameters to learn. We next present two instances of HRGE-Net with different number of input views.

6-View HRGE-Net: The 6 views of a 3D shape construct a relational graph with 6 nodes. The graph is coarsened once with stride of two, then the HRGE-Net is with relational graph embedding defined over graphs in a hierarchy with 6 nodes and 3 nodes. The final global feature of a 3D shape is concatenation of .

12-View HRGE-Net: Taking 12 view features as a relational graph with 12 nodes, the HRGE-Net performs relational graph embedding defined over graph hierarchy with nodes of 12, 6, 3 respectively. Therefore, the final 3D shape descriptor is the concatenation of . Please refer to Fig. 2 for illustration of 12-view HRGE-Net.

Comparison with CNNs and GNNs: Compared with CNNs defined over image grid for recognition, our deep network is also a hierarchical architecture, but models relational pattern of multi-view features defined over a hierarchy of graphs. Compared with conventional graph neural networks (GNN), our graph network is defined over graphs of multi-view features, and models the inter-relations among multi-views motivated by the domain knowledge in multi-view imaging.

Input: Relational Graph , number of the layers
1 for  do
2       for each node of  do
3             # Learn pairwise relation # Update view feature
4       end for
5      # Global shape feature by MP and Norm for each node of  do
6             # Learn neighboring relation and update view feature
7       end for
8      # Graph coarsening by down-sampling
9 end for
10# Global shape feature by MP and Norm
# Aggregate the feature in all layers Output: global feature
Algorithm 1 Forward computation of HRGE. () are parameters to learn.

3.3 Label Prediction

In this stage, we take the shape features

learned from HRGE as input and predict the label of the 3D shape. In our implementation, the label predictor is taken as a simple fully connected layer followed by softmax operation and a cross-entropy loss function.

3.4 Network Training

We train HRGE-Net by two steps similar to [31]. In the first step, the pre-trained ResNet-50 on ImageNet [7] is fine-tuned on all the multi-view images for classification, then the fine-tuned ResNet-50 without its last fully connected layer is taken as view feature extractor. In the second step, we train the whole pipeline including the feature extractor, HRGE, and label predictor for shape recognition by end-to-end training. The gradients of loss w.r.t. parameters of HRGE and ResNet-50 can be computed by auto-differentiation.

When fine tuning the ResNet-50, we use Adam optimizer with weight decay, batch size, epoch number and initial learning rate as

respectively. The learning rate is reduced by half every epochs. When training the whole architecture, we use Adam optimizer with weight decay, batch size, epoch number and initial learning rate as respectively. The learning rate is reduced by scale of every epochs. These training takes about 8 and 16 hours respectively on a NVIDIA GTX 1080 Ti GPU.

4 Experiments

In this section, we evaluate our HRGE-Net on benchmark datasets for 3D shape classification and retrieval.

4.1 Datasets

ModelNet40 [34]. This dataset consists of 12,311 3D shapes from 40 categories, with 9,483 training models and 2,468 test models for shape classification. There are different number of shapes across categories. Various methods have reported their results on this dataset with different shape representations including voxels, point clouds and multi-view images.

ShapeNet Core55 [29]. This dataset contains a total of 51,162 3D models categorized into 55 classes, which are further divided into 203 sub-categories. The training, validation and test sets consist of 35764, 5133 and 10265 shapes respectively. Different classes have varying number of samples. We select the “normal” version of the dataset, i.e., all of the shapes are consistently aligned and normalized to a unit length cube.

4.2 Evaluation for 3D Shape Classification

We first evaluate our 12-view HRGE-Net on ModelNet40 for shape classification. We compare with diverse methods. MVCNN [31] is an effective multi-view shape recognition method based on deep learning, and MVCNN-new [32] improves its results by improving the rendering technique. PVRNet [36] is an approach fusing the multi-view image and point cloud features. RCPCNN [33] exploits the relation among view features using a clustering strategy, and similar strategy is employed in GVCNN [8]. MHBN [37] aggregates the multi-view features by bilinear pooling. We also compare with models based on points, voxels and mixed representations, including 3DShapeNets [34], VoxNet [20], VRN Ensemble [3], PointNet++ [26], Kd-Networks [16], MVCNN-MultiRes [25].

The classification results of above methods are presented in Table 1. It can be observed that we achieve the highest scores for both per class and per instance accuracies. Among the previous methods, MVCNN-new [32] achieved the highest accuracies, but our HERG-Net is superior on both measurements by achieving and higher scores. Compared with MVCNN-new [32] that aggregates the multi-view features by max-pooling, the improvement demonstrates the effectiveness of our feature aggregation strategy, i.e., fusing the multi-view features gradually with a relational graph. Compared with [33, 8], our HRGE-Net achieved at least 3% higher in per instance accuracy.

Figure 4: Per instance classification accuracy on ModelNet40 for some leading view-based methods with 12 views.

For fair comparison, we also compare with view-based methods with 12 views in Fig. 4. Compared with the traditional view-pooling methods like MVCNN [31], MVCNN-new [32], MHBN [37], RCPCNN [33] and GVCNN [8], our HRGE-Net achieves significantly higher accuracy. This improvement shows the effectiveness of our proposed hierarchical relational graph embedding.

RotationNet [15]

is an interesting approach that estimates poses and tries several variants of camera settings for multi-view projections. Differently, our approach considers how to aggregate the 3D mutli-view features. In Table 

2, with ResNet-50 as multi-view feature extractor and 12 views with upright orientation, our HRGE-Net achieves 6.19% higher in accuracy than RotationNet. By varying the camera poses using 20 views from vertices of a dodecahedron encompassing the object, the mean accuracy of RotationNet is , which is also lower than our HRGE-Net. Our HRGE-Net with less number of views which fixed upright orientation achieves almost the same accuracy with RotationNet-Max with best camera poses with 20 views.

Method Input
Per Class
Per Ins.
3DShapeNets[34] Voxels 77.3
VoxNet[20] 83.0
VRN Ensemble[3] 95.5
PointNet [26] Points 91.9
Kd-Networks [16] 88.5 91.8
MVCNN[31] Images 90.1 90.1
MVCNN-new[32] 94.0 95.5
MVCNN-MultiRes[25] 91.4 93.8
PVRNet[36] 91.6 93.2
GVCNN[8] 90.7 93.1
RCPCNN[33] 93.8
MHBN [37] 93.1 94.7
HRGE-Net Images 95.0 96.8
Table 1: Shape classification accuracy (in ) on ModelNet40.
Method # Views Per Ins. Acc.
RotationNet-mean 20 94.771.10
RotationNet-max 20 96.92
RotationNet 12 90.65
HRGE-Net 12 96.84
Table 2: Shape classification accuracy (in ) on ModelNet40. All of following models take the ResNet-50 as view feature extractor.

4.3 Ablation Study

We next go deeper to justify the effectiveness of each component of our HRGE-Net. We will conduct experiments to show the effectiveness of the pairwise relation module, neighboring relation module, and hierarchical structure. We will also show effects of number of views and feature normalization on performance.

Effectiveness of pairwise and neighboring relations. We first design following baseline architectures. Baseline: the global shape feature is the max-pooling of the multi-view image features without graph embedding. PR: the multi-view features are firstly sent to the pairwise relation module, and then max-pooled as the global shape features. NR: the multi-view features are firstly sent to a neighboring relation module, whose output are then max-pooled as the global shape feature. Note that Baseline is in fact the model of MVCNN-new [32]. The results of the above architectures are presented in Table 3. The improved per class and per instance classification accuracies from Baseline to PR and NR are and respectively, which demonstrate the effectiveness of pairwise relation and neighboring relation modules.

Effectiveness of hierarchical architecture. Given 12 input views, we design the following variants of architectures. HRGE-Net-1L: our HRGE-Net as shown in Algorithm 1 but with defined over a hierarchy of two graphs with 12 and 6 nodes. HRGE-Net: our full 12-view HRGE-Net. As shown in Table 3, compared with Baseline, HRGE-Net-1L achieves about higher per instance accuracy, showing the effectiveness of our proposed relational graph embedding module. HRGE-Net achieves the highest per class accuracy and per instance accuracy, which are and higher than HRGE-Net-1L. Further more, to justify the superiority of our neighboring relation module in the hierarchical architecture, we replace the relation function in HRGE-Net with max-pooling and average-pooling, resulting in HRGE-Net-MP and HRGE-Net-AP. We also design HRGE-Net-ID by designing as a mapping for retaining the feature for the node of interest. As in Table 3, HRGE-Net-MP is even worse than HRGE-Net-ID as a result of destroying the local structure, while HRGE-Net-AP performs an average combination of neighboring features and improves the per instance accuracy by . Compare with the above two pooling methods, our full HRGE-Net can significantly improve the performance of HRGE-Net-ID by about for per class accuracy and for per instance accuracy, which demonstrates the advantages of our proposed neighboring relation module.

Effectiveness of normalization operation. In Table 3, we also present the results of HRGE-Net (w/o N) having the same architecture as HRGE-Net except that we do not normalize the max-pooled global features when concatenating them. It achieves and lower scores for per class and per instance accuracies, showing the necessity of feature normalization. It is worth noting that the result of HRGE-Net (w/o N) is similar to HRGE-Net-1L. This is because the global features at different layers are not in the same scale, and feature normalization is necessary for enabling the network to learn with deeper layers.

Effect of the number of views. These above architectures are all based on 12-view HRGE-Net, we further test 6-view HRGE-Net, and it achieves per class and per instance accuracies, and lower than the 12-view HRGE-Net. It is notable that the number of views influences the number of layers in our networks.

# Views Method
Per Class
Per Ins.
6 HRGE-Net 94.36 96.39
12 Baseline 94.00 95.50
PR 94.74 96.23
NR 94.05 96.15
2-4 HRGE-Net-1L 94.13 96.52
2-4 HRGE-Net(w/o N) 94.43 96.47
2-4 HRGE-Net-MP 93.60 95.99
HRGE-Net-AP 94.58 96.39
HRGE-Net-ID 93.90 96.27
2-4 HRGE-Net 94.97 96.84
Table 3: Results (in %) of variants of HRGR-Net for shape classification with different architectures.
Method microALL macroALL
ZFDR 53.5 25.6 28.2 19.9 33.0 21.9 40.9 19.7 25.5 37.7
DeepVoxNet 79.3 21.1 25.3 19.2 27.7 59.8 28.3 25.8 23.2 33.7
DLAN 81.8 68.9 71.2 66.3 76.2 61.8 53.3 50.5 47.7 56.3
RotationNet[15] 81.0 80.1 79.8 77.2 86.5 60.2 63.9 59.0 58.3 65.6
Improved GIFT[2] 78.6 77.3 76.7 72.2 82.7 59.2 65.4 58.1 57.5 65.7
ReVGG 76.5 80.3 77.2 74.9 82.8 51.8 60.1 51.9 49.6 55.9
MVFusionNet 74.3 67.7 69.2 62.2 73.2 52.3 49.4 48.4 41.8 50.2
CM-VGG5-6DB 41.8 71.7 47.9 54.0 65.4 12.2 66.7 16.6 33.9 40.4
GIFT[1] 70.6 69.5 68.9 64.0 76.5 44.4 53.1 45.4 44.7 54.8
MVCNN[31] 77.0 77.0 76.4 73.5 81.5 57.1 62.5 57.5 56.6 64.0
Ours 76.8 81.5 78.2 77.2 85.4 52.2 71.8 57.2 63.8 69.6
Table 4: Shape retrieval results (in ) on ShapeNet Core55 dataset. The first and second top accuracies are presented with bold and underline formats respectively.

4.4 Evaluation for 3D Shape Retrieval

We now evaluate our approach for 3D shape retrieval on ShapeNet Core55 [29]

, which is a challenging dataset containing 55 categories and 203 sub-categories. For given 3D shapes, after rendering them to 12 views, we train HRGE-Net for shape classification based on these multi-view images. Then these learned shape features when training classifiers can be taken as the features for retrieval. Following 

[15], we train two classification HRGE-Nets to respectively predict the shape categories (HRGE-Net-c55) and all the sub-categories (HRGE-Net-c203). For a test data, we send it to HRGE-Net-c55 and take the output before the last fully-connected layer followed by normalization as the 3D shape descriptor for shape retrieval. We first retrieve the shapes using -distance of shape descriptors, and drop out these shapes with -distance larger than a certain threshold. For the list of retrieved shapes, we further apply HRGE-Net-c203 to predict their sub-category labels and re-rank the list such that shapes in same sub-category to the query shape are ranked higher than others.

Figure 5: Retrieval results on ShapeNet Core55 test set. Top 10 matched shapes are shown. Red color indicates failure cases.

We compare with the state-of-the-art methods that attended the track of SHREC’17 for Large-Scale 3D Shape Retrieval [29] on ShapeNet Core55 dataset. The GIFT [1] and Improved-GIFT are multi-view retrieval methods using GIFT and improved GIFT techniques. MVFusionNet takes multi-view images as input and employs a Compact Multi-View Descriptor (CMVD) [6] to generate hand-crafted features which are fused with features from CNN. The method of ReVGG extracts multi-view features by a reduced VGG-M network, and defines similarity with modified Neighbor Set Similarity. CM-VGG55-6DB combines multi-view features to be a global feature and use Clocking Matching algorithm [19] to compute shape dissimilarity. DLAN is a point-set based model by deep aggregation of local 3D geometric features with two well-designed network blocks. ZFDR integrates both visual and geometric information as shape features. DeepVoxelNet designs network on binary voxel grids and the features extracted from the intermediate network layer are taken for shape retrieval.

As shown in Table 4, our net achieves the highest accuracies for micro-averaged R@N, mAP and macro-averaged R@N, mAP and NDCG@N, and second best for the micro-averaged F1@N and NDCG. Compared with Improve GIFT and MVCNN, HRGE-Net improves mAP and NDCG@N by more than 3.7% and 2.7% for microALL, 6.3% and 3.9% for macroALL. Furthermore, we outperform the current state-of-the-art method, RotationNet, by 5.5% in mAP and 4.0% in NDCG@N for macroALL. This is notable because RotationNet achieves high results by investigating different settings of camera views, while our HRGE-Net relies on a baseline 12-view setting popularly used in 3D shape recognition. We also present examples of retrieval results in Fig. 5. HRGE-Net can retrieve objects with diverse shapes. For example, in the fifth row that takes a loudspeaker as query object, we can successfully retrieve various loudspeakers. The red shapes indicate the failure cases, e.g., in the last row that takes potted plant as a query, we retrieve one wrong shape of lamp.

5 Conclusion

In this work, we propose a novel deep relational graph network to aggregate multi-view features for 3D shape recognition. Our proposed network is defined over multi-view relational graphs with a hierarchical architecture, and can be learned to gradually aggregate multi-view features considering pairwise and neighboring relations among views. We have extensively compared our network with previous methods for shape recognition, and showed state-of-the-art performance on benchmark datasets.

In the future work, we are interested to extend our current network to other camera view settings, e.g., cameras placed on a sphere or dodecahedron around object. It can be achieved by extending our relation modules to be defined on the corresponding graph. We are also interested to introduce attention mechanism into our relational modules to explore better way for feature aggregation.


  • [1] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki (2016) Gift: a real-time and scalable 3d shape search engine. In CVPR, Cited by: §2.1, §4.4, Table 4.
  • [2] S. Bai, X. Bai, Z. Zhou, Z. Zhang, Q. Tian, and L. J. Latecki (2017) GIFT: towards scalable 3d shape retrieval. IEEE Transactions on Multimedia 19 (6), pp. 1257–1271. Cited by: Table 4.
  • [3] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2016)

    Generative and discriminative voxel modeling with convolutional neural networks

    arXiv preprint arXiv:1608.04236. Cited by: §4.2, Table 1.
  • [4] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §1.
  • [5] S. Chen, L. Zheng, Z. Yan, Z. Sun, and X. Kai (2018)

    VERAM: view-enhanced recurrent attention model for 3d shape classification

    IEEE TVCG PP (99), pp. 1–1. Cited by: §2.1.
  • [6] P. Daras and A. Axenopoulos (2009) A compact multi-view descriptor for 3d object retrieval. In International Workshop on Content-Based Multimedia Indexing, Cited by: §4.4.
  • [7] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §3.1, §3.4.
  • [8] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao (2018) GVCNN: group-view convolutional neural networks for 3d shape recognition. In CVPR, Cited by: §2.1, §4.2, §4.2, §4.2, Table 1.
  • [9] C. Galleguillos and S. Belongie (2010) Context based object categorization: a critical survey. CVIU 114 (6), pp. 712–722. Cited by: §2.2.
  • [10] C. Galleguillos, A. Rabinovich, and S. Belongie (2008) Object categorization using co-occurrence, location and appearance. In CVPR, Cited by: §2.2.
  • [11] N. Hagbi, O. Bergig, J. El-Sana, and M. Billinghurst (2011) Shape recognition and pose estimation for mobile augmented reality. IEEE TVCG 17 (10), pp. 1369–1379. Cited by: §1.
  • [12] Z. Han, M. Shang, Z. Liu, et al. (2019) SeqViews2SeqLabels: learning 3d global features via aggregating sequential views by rnn with attention. IEEE TIP 28 (2), pp. 658–672. Cited by: §2.1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
  • [14] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018) Relation networks for object detection. In CVPR, Cited by: §2.2.
  • [15] A. Kanezaki, Y. Matsushita, and Y. Nishida (2018) RotationNet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In CVPR, Cited by: §1, §2.1, §4.2, §4.4, Table 4.
  • [16] R. Klokov and V. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In ICCV, Cited by: §4.2, Table 1.
  • [17] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan (2017) Attentive contexts for object detection. IEEE Transactions on Multimedia 19 (5), pp. 944–954. Cited by: §2.2.
  • [18] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) PointCNN: convolution on x-transformed points. In NIPS, Cited by: §1.
  • [19] Z. Lian, A. Godil, X. Sun, and J. Xiao (2013)

    CM-bof: visual similarity-based 3d shape retrieval using clock matching and bag-of-features

    Machine Vision and Applications 24 (8). Cited by: §4.4.
  • [20] D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §1, §4.2, Table 1.
  • [21] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In CVPR, Cited by: §2.2.
  • [22] R. Palm, U. Paquet, and O. Winther (2018) Recurrent relational networks. In NIPS, Cited by: §2.2, §2.2.
  • [23] B. T. Phong (1975) Illumination for computer generated pictures. Communications of the ACM 18 (6), pp. 311–317. Cited by: §3.1.
  • [24] T. Pylvanainen, K. Roimela, R. Vedantham, J. Itaranta, and R. Grzeszczuk (2010) Automatic alignment and multi-view segmentation of street view data using 3d shape priors. Symposium on 3D Data Processing, Visualization and Transmission (3DPVT) 737, pp. 738–739. Cited by: §1.
  • [25] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and multi-view cnns for object classification on 3d data. In CVPR, Cited by: §1, §4.2, Table 1.
  • [26] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Cited by: §1, §4.2, Table 1.
  • [27] H. Richards-Rissetto, F. Remondino, G. Agugiaro, J. von Schwerin, J. Robertsson, and G. Girardi (2012) Kinect and 3d gis in archaeology. In International Conference on Virtual Systems and Multimedia, Cited by: §1.
  • [28] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In NIPS, Cited by: §2.2, §2.2.
  • [29] M. Savva, F. Yu, H. Su, A. Kanezaki, T. Furuya, R. Ohbuchi, Z. Zhou, R. Yu, S. Bai, X. Bai, et al. (2017) Large-scale 3d shape retrieval from shapenet core55: shrec’17 track. In Proceedings of the Workshop on 3D Object Retrieval, Cited by: §1, §4.1, §4.4, §4.4.
  • [30] R. Stewart, M. Andriluka, and A. Y. Ng (2016) End-to-end people detection in crowded scenes. In CVPR, Cited by: §2.2.
  • [31] H. Su, S. Maji, E. Kalogerakis, and E. G. Learnedmiller (2015) Multi-view convolutional neural networks for 3d shape recognition. In ICCV, Cited by: §1, §2.1, §3.1, §3.4, §4.2, §4.2, Table 1, Table 4.
  • [32] J. Su, M. Gadelha, R. Wang, and S. Maji (2018) A deeper look at 3d shape classifiers. In Second Workshop on 3D Reconstruction Meets Semantics, ECCV, Cited by: §2.1, §3.1, §4.2, §4.2, §4.2, §4.3, Table 1.
  • [33] C. Wang, M. Pelillo, and K. Siddiqi (2017) Dominant set clustering and pooling for multi-view 3d object recognition. In BMVC, Cited by: §2.1, §4.2, §4.2, §4.2, Table 1.
  • [34] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §1, §1, §4.1, §4.2, Table 1.
  • [35] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao (2018) SpiderCNN: deep learning on point sets with parameterized convolutional filters. In ECCV, Cited by: §1.
  • [36] H. You, Y. Feng, X. Zhao, C. Zou, R. Ji, and Y. Gao (2019) PVRNet: point-view relation neural network for 3d shape recognition. In AAAI, Cited by: §2.1, §4.2, Table 1.
  • [37] T. Yu, J. Meng, and J. Yuan (2018) Multi-view harmonized bilinear network for 3d object recognition. In CVPR, Cited by: §1, §2.1, §4.2, §4.2, Table 1.