Auto-MVCNN: Neural Architecture Search for Multi-view 3D Shape Recognition

by   Zhaoqun Li, et al.

In 3D shape recognition, multi-view based methods leverage human's perspective to analyze 3D shapes and have achieved significant outcomes. Most existing research works in deep learning adopt handcrafted networks as backbones due to their high capacity of feature extraction, and also benefit from ImageNet pretraining. However, whether these network architectures are suitable for 3D analysis or not remains unclear. In this paper, we propose a neural architecture search method named Auto-MVCNN which is particularly designed for optimizing architecture in multi-view 3D shape recognition. Auto-MVCNN extends gradient-based frameworks to process multi-view images, by automatically searching the fusion cell to explore intrinsic correlation among view features. Moreover, we develop an end-to-end scheme to enhance retrieval performance through the trade-off parameter search. Extensive experimental results show that the searched architectures significantly outperform manually designed counterparts in various aspects, and our method achieves state-of-the-art performance at the same time.


page 1

page 2

page 3

page 4


PVNet: A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition

3D object recognition has attracted wide research attention in the field...

Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Neural architecture search (NAS) has been successfully applied to tasks ...

GIFT: A Real-time and Scalable 3D Shape Search Engine

Projective analysis is an important solution for 3D shape retrieval, sin...

Gram Regularization for Multi-view 3D Shape Retrieval

How to obtain the desirable representation of a 3D shape is a key challe...

SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems

We design deep neural networks (DNNs) and corresponding networks' splitt...

Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL

While early AutoML frameworks focused on optimizing traditional ML pipel...

AutoDispNet: Improving Disparity Estimation with AutoML

Much research work in computer vision is being spent on optimizing exist...

1 Introduction

Along with the emergence of large 3D repositories [Chang et al.2015, Wu et al.2015]

and the development of Convolution Neural Network (CNN), deep learning based 3D shape recognition has attracted strong interest in research 

[Su et al.2015, Xie et al.2017, Qi et al.2017, Su et al.2018, Han et al.2019]. Among different kinds of research works, multi-view based methods have achieved the best performance so far, in which images are generally first rendered from a set of views and then passed into CNNs to obtain a shape descriptor.

Handcrafted networks are usually adopted as the backbone in current methods, where a variety of classic architectures (e.g. VGG [Simonyan and Zisserman2015], ResNet [He et al.2016]) have been employed for feature extraction. Over the past years, the majority of researches emphasized leveraging relationships among view images in single-view feature level [Ma et al.2019, He et al.2019] or multi-view feature level [Feng et al.2018, Han et al.2019], and therefore devoted efforts to designing sub-network on top of the backbone. Despite the remarkable progress achieved in previous studies, the effect of CNN extractors is not fully investigated, which restricts their performance to some extent. Meanwhile, in order to avoid the excessive memory usage, a lot of research works [Su et al.2015, Ma et al.2019, He et al.2019]

develop a multi-stage training scheme which only uses the backbone to extract view features, while the relation between the feature extraction and the feature fusion is neglected. These drawbacks may not only degrade the performance but also lead to being time-consuming and increasing computation cost. As the neural network plays a crucial role in 3D shape recognition, it is desired to design an esfficient and powerful architecture that can process multi-view images with an end-to-end scheme.

In recent years, due to the effectiveness of Neural Architecture Search (NAS) compared with the human-designed structures [Fang et al.2020, Guo et al.2020], its application field has also expanded on various benchmarks [Zoph et al.2018, Real et al.2019, Liu et al.2019a, Chen et al.2019]

, before which NAS has been dedicated to image classification. Besides, it is worth mentioning that the combination of multiple loss functions is essential in multi-task learning, which has a large impact on neural architecture design. Unfortunately, many of AutoML methods using reinforcement learning and evolution algorithms have extreme computational demands. And there is a relatively small amount of works that study the balance of training as well as associated techniques in the searching process. Darts 

[Liu et al.2019b] is a well-known gradient-based framework that largely reduces computation complexity. However, directly transferring Darts algorithm into 3D shape recognition is not an advisable choice. Firstly, Darts processes a single image instead of multi-view images, in which case the correlation across multiple views will be neglected. In addition, we aim to develop a unified model for both classification and retrieval tasks, which is different from Darts that focuses on single task optimization.

To automatically search a suitable neural network for 3D shape analysis, in this paper, we propose our Auto-MVCNN which is particularly adaptive for multi-view shape recognition. Our network architecture contains three parts: a shared backbone for view feature extraction, a fusion module for multi-view feature fusion and a linear combination of loss functions for multi-task learning. The pipeline of our method is shown in Fig. 1. In the network, we propose a novel fusion cell which is specially designed for processing sequence view features. By continuous relaxation of discrete operations, it inherits the efficiency and effectiveness of Darts and can be integrated into the existing framework seamlessly. The equipped search space enables us to find appropriate fusion patterns that explore the correlation among views. For supervision signals, in addition to the shape classification loss on the top, we also add a view classification loss function for view feature enhancement and another retrieval loss function for the retrieval task. The trade-off parameters that linearly combine these loss functions are searched by an end-to-end scheme.

To evaluate the performance of Auto-MVCNN, we carry out experiments on two large-scale datasets and conduct the comparison from various aspects. Compared to the handcrafted network, with or without ImageNet pre-training, our searched networks show the superiority in regard to both performance and computation resources saving. Besides, more experiments are implemented to compare our method with state-of-the-arts indicating the effectiveness of our proposed framework. Finally, we also analyze the impact of the number of initial channels and the stability of the searching process.

To summarize, the contribution of our paper is four-fold:

  1. To our best knowledge, this is the first work of neural architecture search in the field of multi-view 3D shape recognition that replaces the manually designed search with the automatic mechanism.

  2. We propose a novel fusion cell to process multi-view features that can be integrated into the existing framework seamlessly.

  3. We develop a simple scheme that dynamically searches loss weights of multiple loss functions, achieving appropriate training balance for multi-task learning.

  4. Extensive experiments show that the searched CNNs achieve state-of-the-art performance, using much fewer parameters than other baselines.

2 Related Work

2.1 Multi-view 3D Shape Recognition

On the basis of different formats of the processed 3D data, methods in 3D shape recognition could be roughly divided into two categories: model based methods [Osada et al.2002, Xie et al.2017, Li et al.2020] and view based methods [Su et al.2015, Wang et al.2017a]. In this section, we mainly introduce multi-view based methods which leverage 2D views’ information to construct 3D descriptors.

MVCNN [Su et al.2015] is a typical framework in which the whole pipeline is divided into two parts. The first part is the backbone for extracting view features and the other part is responsible for processing further shape features. Between the two parts, the view features are aggregated into a single shape representation through the element-wise maximum operation. In  [Bai et al.2016], a postprocessing algorithm adopting the inverted file is proposed for fast retrieval. Recently, leveraging correlation among views has become more and more popular in some research works. GVCNN [Feng et al.2018] introduces a multi-level descriptor by exploring the view-group-shape hierarchical correlation, which largely improves the performance on 3D shape classification and retrieval.  [Huang et al.] develops a local 3D shape descriptor, which makes full use of relations over points on the shape and can be directly utilized for a wide range of shape analysis tasks. Many research works [Su et al.2015, Feng et al.2018, He et al.2018] show that metric learning is essential in 3D shape retrieval task. [Li et al.2019c] designs two loss functions to separately deal with these two distances. The flexible combination property of the proposed loss functions provides effective tools to enhance retrieval performance.

2.2 Neural Architecture Search

The basic ideology of NAS is to find candidate network structures through a search strategy in a defined search space, based on the obtained feedback of the evaluation. The search space develops from the entire structure at the beginning to stacking cells [Zoph et al.2018]. Cell-based search can greatly narrow the search space and improve the search efficiency, which has been applied in numerous subsequent works. However, search strategies based on reinforcement learning, such as the Q-learning algorithm in MetaQNN [Baker et al.2017], require high computational complexity. AmoebaNet [Real et al.2019]

develops evolutionary algorithms instead of reinforcement learning to optimize performance. Although it achieves better results, it still takes 450 GPUs and 7 days in a row to complete the experiment.

NAS is gradually approaching the very obvious problem of solving heavy computation, which enables gradient-based methods and other efficient methods to emerge. ENAS [Pham et al.2018] employs weight sharing to accelerate validation, where the cell-based search mode greatly improves experimental results. Similar to ENAS, DARTS [Liu et al.2019b] also searches subgraphs in designing cells and conducts weight sharing as well. EfficientNet [Tan and Le2019] and MobileNetv3[Howard et al.2019]

use the network search to obtain a fixed set of scaling factors to scale the width, depth, and resolution of the network respectively, achieving better efficiency and accuracy. To search for an appropriate loss function for face recognition, AM-LFS 

[Li et al.2019a] employs the REINFORCE [Williams1992]

idea to automatically search for appropriate hyperparameters of the loss function, with great transferability at the same time.

Figure 1: The illustration of Auto-MVCNN pipeline (). The supernet is composed of backbone, fusion module and three loss functions. The stem in the backbone consists of several convolutional layers, each of which is a fixed structure. Three types of cells are learned in the searching process: normal, reduction and fusion.

3 Architecture Search

The goal of Auto-MVCNN is to search for a neural architecture which is suitable for multi-view 3D shape feature representation. In this section, we first describe the formation of our end-to-end pipeline. Then we recall the definition of the neural cell and introduce our novel fusion cell for processing sequence features. Finally, we propose our loss combination scheme that automatically balances the training of classification and retrieval tasks.

3.1 Auto-MVCNN Network

In our method, a view image sequence of length

is input to a shared backbone to obtain view feature vectors

. Then a shape descriptor is generated by fusing . The pipeline illustration is described in Fig. 1.

The whole neural network, which is called supernet, is formed of multiple cells and multi-task loss functions. It is divided into two parts according to their functions: the backbone is used to extract view features and the fusion module is designed for view feature fusion. The backbone consists of a number of normal cells inserted with several reduction cells which are presented in a stacking manner. Different from the backbone composition, the fusion module is generated by fusion cells. The details of these components will be described in Sec. 3.2 and Sec. 3.3.

In our method, both classification and retrieval tasks are fulfilled with multiple supervision signals and we treat all the loss functions as important roles in neural network architecture. Besides the classification loss on the top of the supernet, we add an auxiliary classification loss and an auxiliary retrieval loss to enhance view features and boost retrieval performance respectively. The formulation of the loss functions will be described in Sec. 3.4.

3.2 Backbone Search

In the searching stage, the cell is the basic searching component that contains the combination of all candidate operations. Formally, a cell is defined as a directed acyclic graph (DAG) which contains an ordered sequence of 7 nodes

. Each node represents a latent tensor (i.e. a feature map) and each edge

consists of multiple parallel network layers. Two input tensors and are the output of previous cells and one output tensor is computed as . For the intermediate node, the computation can be formulated as


where is a continuous relaxation of the search space :


Here, is the architecture parameter that represents the weight of operation in the edge . The continuous relaxation allows us to optimize by gradient descent. is the set of candidate operations and we choose the same set as previous works [Liu et al.2019b, Zoph et al.2018, Liu et al.2018, Liu et al.2019a] to keep consistence: and separable convolutions, and atrous convolution, average pooling,max pooling, skip connection, and .

A cell that remains the same spatial resolution as the previous cell is the normal cell and that divides spatial dimension by 2 is the reduction cell. In both training and inference stage, view features are extracted simultaneously from the backbone.

3.3 Fusion Module Search

Figure 2: Fusion cell. We take and one kernel for example. Fused feature is obtained by iteratively aggregating 3 view features.

The NAS framework [Liu et al.2019b] focuses on image classification that can generate single view feature. Though feasible, a simple combination (e.g. max-pooling, view-wise addition) of view-level features will lead to information loss that largely degrades the performance. By drawing on the experiences of previous works [Feng et al.2018, He et al.2019], we summarize that leveraging the spatial information and the feature correlation among views is essential for obtaining competitive performance.

Following the principle mentioned above, we design the fusion module in Auto-MVCNN to aggregate view features into a compact and discriminative shape feature. This module consists of two sequential fusion cells that are developed to process sequence view features . In order to integrate the fusion cell into the existing optimization framework, the fusion cell has a search space which is similar to that applies the size adaption to the kernel size of all operations. Concretely, we regard as a three dimension tensor of shape with channels and spatial dimension . And the size adaption is to change the kernel size from to for all operations in . In this way, we can adopt all the operations in on . The size adoption of operations is illustrated in Fig. 2. The fusion cell is then formed by linking these operations using Eq. 2

. It is worth pointing that the size adaption is equivalent to padding zeros to

such that its shape is , while the operations remain the same in .

The compatibility of the fusion cell shows various excellent properties. For one operation , conserves the spatial relationship of input tensors and thus can model the spatial information among views. The correlation among view features can also be revealed by a variety of operations. In addition, the fusion cell inherits the diversity of the combination among different layers, which enables the fusion module to search for novel fusion patterns.

3.4 Trade-off of Loss Functions

Auto-MVCNN aims to develop a network for both shape classification task and shape retrieval task where training balance is extremely important. There are totally three loss functions in the supernet. and are softmax loss located on the top and middle of Auto-MVCNN, in charge of the shape classification and the view feature enhancement respectively. is the loss function proposed in [Li et al.2019c] which is used for enlarging the inter-class distance:


where and are the batch size and the ground truth label respectively.

In general, classification places emphasis on the right label prediction while for retrieval the feature distribution is more important. This phenomenon is also observed in [Su et al.2015, Feng et al.2018] and they adopt offline metric learning algorithms to improve retrieval performance. In this paper, by contrast, we tackle the issue by adding the loss function of metric learning with an end-to-end training scheme. In practice, multi-task loss functions are linearly combined, . Since our target is searching for proper training balance, we normalize the loss weight using trade-off parameters and the total objective loss function is formulated as:


An appropriate value distribution of could enhance performance without impeding the classification task, while a relative quick drop of loss

means its gradient is large in the backpropagation which hampers the training of other tasks. However, direct optimization via minimizing

is infeasible. If is the minimum value among the three losses, will be straight to become large, which is irrelevant to the balance of training. To tackle this issue, in this paper, we develop a scheme that searches for the balance of training leveraging the performance on the validation set. Specifically, let denote the loss value in the - iteration on the validation set, and we define as the training rate of . We propose a regularization term that directly involves and the training rate:


penalizes when its corresponding loss drops quickly and, in turn, it augments the weight of one task if its training is relatively slow.

4 Training and Evaluation of Network

Input: architecture parameters , network parameters , ,
while not converged do
       Sample mini-batch from , calculate and ;
       Update by descending ;
       Update by descending ;
       Sample mini-batch from , calculate ;
       Update by descending ;
end while
Algorithm 1 The Auto-MVCNN search algorithm
Figure 3: The learned cells on the ModelNet40. The “dil”, “sep”, “max”, “avg” in the figure represent depthwise-separable convolution, atrous convolution, max-pooling and average-pooling respectively.

4.1 Optimization

In our method, searching neural network is a bilevel optimization problem that has two different sets of parameters: the network parameters and the architecture parameters . We follow the first-order approximation proposed in Darts [Liu et al.2019b] and split the training data manually into two disjoint sets and . The optimization of and is carried out in an alternating fashion on and until convergence, as shown in Alg. 1. The stability of search is clarified in Appendix.

4.2 Evaluation

After the search convergence, the final cell for evaluation is pruned by selecting non-zero layers in the connection. The selection is achieved by retaining the top-k strongest operations from edge to edge :


We use in our method. The extracted cells are then stacked to form the supernet and retrained for evaluation. The final searched cells for evaluation are shown in Fig. 3 and the final searched loss weights are . The search time is about 4 GPU hours.

For the sake of stability, the loss weights need to rescaled to ensure the loss weight of is 1, i.e. , in the retraining. When we retrain the architecture, its initial number of channels changes to 24 and 36 (for matching the size of popular NAS architectures), generating our two representative models AM_24 and AM_36.

5 Experimental Results

Backbone Fusion layer Params MACs Cputime w/o pretrain w/ pretrain
(M) (G) (ms) Accuracy mAP Accuracy mAP
VGG-M conv-5 90.5 34.6 204 81.1% 66.2% 90.4% 78.7%
ResNet-18 resblock-5 11.2 21.9 111 84.5% 64.0% 91.0% 81.7%
ResNet-50 resblock-5 23.6 49.6 353 85.5% 63.4% 91.3% 82.9%
GoogleLeNet inception-5b 10.1 18.3 209 88.8% 75.0% 91.9% 85.0%
VGG-11 conv-6 130.0 80.3 403 86.5% 80.2% 91.4% 86.1%
VGG-11 conv-8 130.0 90.5 460 86.9% 78.3% 91.3% 85.3%
VGG-19 conv-16 139.7 235.2 970 82.4% 72.2% 90.6% 86.6%
AM_c24 cell-5 2.1 3.2 200 90.5% 86.9% 93.9% 90.9%
AM_c36 cell-5 4.7 6.9 289 91.0% 87.6% 94.4% 91.0%
Table 1: The performance comparison of different architectures on ModelNet40.

5.1 Dataset and Metrics

To evaluate the performance of our method, we conduct experiments on the Princeton ModelNet dataset [Wu et al.2015] and the ShapeNetCore55 dataset [Savva et al.2016b].

The ModelNet is a large-scale 3D shape dataset which contains 127,915 3D CAD models divided into 662 categories. We apply the extracted subset ModelNet40, which includes 12,311 models cleaned manually from 40 categories, in our evaluation. We follow the same training/testing split as described in [Su et al.2015]

, by randomly selecting 100 unique models per category from the subset, where 80 models are used for training and the rest for testing. The evaluation metrics adopted in this dataset include the (per-class) classification accuracy, the mean average precision (mAP) and the area under curve (AUC). Their detailed definitions can be found in 

[Wu et al.2015].

The ShapeNetCore55 dataset, introduced in the Shape Retrieval Contest (SHREC) 2016 competition track, contains 51,190 3D shapes from 55 common categories which is a subset of the full ShapeNet dataset with clean 3D models. Each model in this dataset is attached with a label from the 55 categories plus a fine-grained subcategory deriving from 204 subcategories. The dataset is divided into two versions, named as “normal” version and “perturbed” version, where the 3D shapes in the former version are aligned but more challenging in the latter one with all shapes are rotated randomly. In terms of training and testing split method, 70% shapes in the dataset are provided for training and another 10% shapes are for validation, with the remaining 20% shapes forming the testing set. Refer to [Savva et al.2016b] for the definition of metrics F-Measure (F-1) and NDCG used in this paper.

5.2 Implementation Details

The experiments are carried out on a server with four Nvidia GTX2080Ti GPUs, Intel Xeon CPU E5-2678 v3 and 128G RAM. Before training and testing, each shape is rendered to generate 12 images with size 224224, following the same protocol as [Su et al.2018].

Architecture search on ModelNet40. In the experimental settings, we employ 3 normal cells, 2 reduction cells and 2 fusion cells to build the architecture space. The supernet contains a stem at the bottom with 7 cells stacked sequentially, Fig. 1 displays the architecture. During the searching process, half of the ModelNet40 training data is set as the validation set . The batch size is 36 and the initial number of channels is 16. Please refer to Appendix for other hyperparameter settings.

Architecture evaluation on ModelNet40 and ShapeNetCore55. To evaluate the performance of searched architectures, we need to retrain the derived supernet on our target dataset. The retraining set the batch size to 36. SGD with the initial learning rate 0.01 is adopted. The shape descriptor is extracted for the retrieval task using cosine distance. For a fair comparison with other methods, we also pretrain the supernet on ImageNet classification benchmark. We propose two models, AM_c24 and AM_c36, with initial channels 24 and 36 respectively for evaluation. Please refer to Appendix for detailed hyperparameter settings.

5.3 Main Results

Method Backbone Accuracy AUC mAP
3DShapeNet - 77.3% 49.9% 49.2%
DeepPano - 82.5% 77.6% 76.8%
PointNet - 86.2% - -
Octree - 90.6% - -
MVCNN-su VGG-M 90.1% - 80.2%
ATCL VGG-M - 87.2% 86.1%
GVCNN GoogLeNet 93.1% - 85.7%
NCENet GoogLeNet - 88.0% 87.1%
MV-LSTM ResNet18 91.1% 85.7% 84.3%
RED ResNet50 - 87.0% 86.3%
MVCNN-new VGG-11 92.4% - -
TCL VGG-11 - 89.0% 88.0%
SeqViews VGG-19 91.4% - 89.1%
VNN VGG-19 - 90.2 89.3%
Auto-MVCNN AM_c24 93.9% 91.5% 90.9%
Auto-MVCNN AM_c36 94.4% 91.6% 91.0%
Table 2: The performance comparison with state-of-the-art methods on ModelNet40.

Comparison with hand-crafted networks. As for comparative experiments, we train several popular hand-crafted networks in this domain using the same training protocol. The comparative experiments involve several aspects, results of which are indicated in Tab. 1. We employ the network parameters (Params) and the multiply–accumulate operation (MACs) to measure the network size and the computation cost. Cputime is the network inference time averaged by 10 times running. Their values are obtained by inputting an image sequence into the network. For a comprehensive comparison, we take the following factors into consideration: (1) the position of the fusion layer, (2) with or without ImageNet pretraining. We can see that the two models with different sizes in our Auto-MVCNN, AM_c24 and AM_c36, apply the least amount of parameters and take up the lowest computation cost, consequently saving a lot of memory. In addition, Auto-MVCNN obtains the highest value of both accuracy and mAP, demonstrating the best performance with or without ImageNet pretraining.

Methods Micro Macro
Wang 24.6% 60.0% 77.6% 16.3% 47.8% 69.5%
Li 53.4% 74.9% 86.5% 18.2% 57.9% 76.7%
Kd-network 45.1% 61.7% 81.4% 24.1% 48.4% 72.6%
MVCNN 61.2% 73.4% 84.3% 41.6% 66.2% 79.3%
GIFT 66.1% 81.1% 88.9% 42.3% 73.0% 84.3%
TCL 67.9% 84.0% 89.5% 43.9% 78.3% 86.9%
VNN 71.3% 84.3% 89.7% 50.1% 78.0% 86.8%
NCENet 73.3% 89.6% 92.1% 51.3% 85.6% 90.5%
Our 68.1% 91.1% 92.3% 51.4% 86.2% 91.2%
Table 3: The performance comparison on SHREC16 perturbed dataset.

Comparison with state-of-the-art methods. We choose model-based methods including 3DShapeNet [Wu et al.2015], DeepPano [Shi et al.2015], PointNet [Qi et al.2017], Octree [Wang et al.2017b], and view-based methods including MVCNN-su [Su et al.2015], MVCNN-new [Su et al.2018], ATCL [Li et al.2019b], NCENet [Xu et al.2019], TCL [He et al.2018], RED [Song et al.2017], SeqViews [Han et al.2019], GVCNN [Feng et al.2018], VNN [He et al.2019] and MV-LSTM [Ma et al.2019] methods for comparison. The comparison results are indicated in Tab. 2 111We report per-class accuracy. . With ImageNet pretraining, Auto-MVCNN has the greatest accuracy, AUC and mAP, in which the values achieve , and respectively. Particularly, AM_c36 achieves better performance compared to AM_c24.

Our proposed Auto-MVCNN is also evaluated on ShapeNetCore55 perturbed dataset. This perturbed dataset is more challenging as all shapes are rotated randomly. Note that the architecture is also searched on the ModelNet40. We choose the participants of the competition [Savva et al.2016a, Klokov and Lempitsky2017] and other popular methods as comparison. As is shown in Tab. 3, our method (AM_c36) outperforms others in both mAP and NDCG metrics. We attribute the little lower performance on F-Measure to the transfer of datasets.

5.4 Ablation Study

Effect of fusion module. To demonstrate the effectiveness of our fusion cells, in our searched network, we manually replace the fusion cells with normal cells and conduct experiments on ModelNet40. Note that we maintain the same number of layers and same supervision signals. We also choose three popular cell-based NAS networks that have similar network size to ours as comparison (the supervision is single softmax loss). As these architectures are searched for single image classification, we adopt a view max-pooling operation to fuse the view features after the penultimate layer as [Su et al.2015]. As we can see from Tab. 4, owing to the ability of the fusion module that can explore the intrinsic correlation among view features, our learned network outperforms other NAS architectures. When compared with [Howard et al.2019] and [Tan and Le2019], the following two other factors also contribute to performance improvement: (1) Our network is derived directly from the ModelNet40 dataset while others are searched on classification benchmarks; (2) Multiple supervision signals and corresponding appropriate loss weights enhance the performance on both shape classification and shape retrieval.

Effect of dynamic loss balance. To reveal the superiority of our loss weights balance method, we choose several commonly adopted loss combinations and compare their performances. The results are shown in Tab. 5. Note that values in the loss combinations need to be rescaled to conduct retraining (see Sec. 5.2). We can see from the first three experiments that both and are essential for competitive retrieval performance and our result is better than others. When the loss combination is close to ours, its corresponding performance is also similar to ours.

Architecture Fusion cell Accuracy mAP
Darts_v2 92.9% 83.0%
MobileNetv3 92.1% 73.3%
EfficientNet_b0 90.3% 71.6%
AM_c24 92.9% 87.0%
AM_c24 93.9% 90.9%
AM_c36 92.8% 88.8%
AM_c36 94.4% 91.0%
Table 4: The performance comparison of different NAS architectures with ImageNet pretraining.
Loss combination w/o pretrain w/ pretrain
Accuracy mAP Accuracy mAP
1 0 0 88.8% 77.4% 90.5% 80.4%
0.5 0.5 0 90.3% 82.3% 93.3% 86.4%
0.5 0 0.5 88.6% 82.2% 89.9% 83.8%
0.5 0.25 0.25 89.6% 85.4% 92.6% 89.3%
0.4 0.3 0.3 87.4% 82.8% 93.5% 89.9%
0.2 0.2 0.6 90.7% 87.2% 94.1% 90.9%
0.216 0.204 0.580 91.0% 87.6% 94.4% 91.0%
Table 5: The results under different loss combination.

6 Conclusions

In this paper, aiming at the problem of multi-view 3D shape recognition, we propose a novel neural architecture search framework to optimize architectures, which is named as Auto-MVCNN. It abandons hand-crafted networks as the backbone for the first time, which greatly reduces the number of parameters and computational complexity. The proposed fusion cell enables the whole network to explore the intrinsic connections of view features automatically, which fully utilizes the 3D information. In addition, we apply a searching scheme for the training balance with an end-to-end fashion, improving both classification and retrieval performances. Extensive experiments exhibit our Auto-MVCNN achieves the best performance in various aspects, and clarify its effectiveness at the same time.


  • [Bai et al.2016] Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, and Longin Jan Latecki. Gift: A real-time and scalable 3d shape search engine. In CVPR, pages 5023–5032, 2016.
  • [Baker et al.2017] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017.
  • [Chang et al.2015] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [Chen et al.2019] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. Detnas: Backbone search for object detection. In NeurIPS, 2019.
  • [Fang et al.2020] Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. Densely connected search space for more flexible neural architecture search. 2020.
  • [Feng et al.2018] Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. Gvcnn: Group-view convolutional neural networks for 3d shape recognition. In CVPR, pages 264–272, 2018.
  • [Guo et al.2020] Minghao Guo, Yuzhe Yang, Rui Xu, Ziwei Liu, and Dahua Lin. When nas meets robustness: In search of robust architectures against adversarial attacks. 2020.
  • [Han et al.2019] Z. Han, M. Shang, Z. Liu, C. Vong, Y. Liu, M. Zwicker, J. Han, and C. L. P. Chen. Seqviews2seqlabels: Learning 3d global features via aggregating sequential views by rnn with attention. IEEE Transactions on Image Processing, 28(2):658–672, 2019.
  • [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [He et al.2018] Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3d object retrieval. In CVPR, pages 1945–1954, 2018.
  • [He et al.2019] Xinwei He, Tengteng Huang, Song Bai, and Xiang Bai.

    View n-gram network for 3d object retrieval.

    In ICCV, 2019.
  • [Howard et al.2019] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In ICCV, 2019.
  • [Huang et al.] Haibin Huang, Evangelos Kalogerakis, Siddhartha Chaudhuri, Duygu Ceylan, Vladimir G. Kim, and Ersin Yumer. Learning local shape descriptors from part correspondences with multiview convolutional networks. volume 37, pages 1–14.
  • [Kingma and Ba2015] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [Klokov and Lempitsky2017] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In ICCV, pages 863–872, 2017.
  • [Li et al.2019a] Chuming Li, Xin Yuan, Chen Lin, Minghao Guo, Wei Wu, Junjie Yan, and Wanli Ouyang. Am-lfs: Automl for loss function search. In ICCV, 2019.
  • [Li et al.2019b] Zhaoqun Li, Cheng Xu, and Biao Leng. Angular triplet-center loss for multi-view 3d shape retrieval. In AAAI, pages 8682–8689, 2019.
  • [Li et al.2019c] Zhaoqun Li, Cheng Xu, and Biao Leng. Rethinking loss design for large-scale 3d shape retrieval. In IJCAI, pages 840–846, 2019.
  • [Li et al.2020] Guohao Li, Guocheng Qian, Itzel C Delgadillo, Matthias Müller, Ali Thabet, and Bernard Ghanem. Sgas: Sequential greedy architecture search. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2020.
  • [Liu et al.2018] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018.
  • [Liu et al.2019a] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, pages 82–92, 2019.
  • [Liu et al.2019b] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2019.
  • [Ma et al.2019] Chao Ma, Yulan Guo, Jungang Yang, and Wei An. Learning multi-view representation with lstm for 3-d shape recognition and retrieval. IEEE Transactions on Multimedia, 21(5):1169–1182, 2019.
  • [Osada et al.2002] Robert Osada, Thomas Funkhouser, Bernard Chazelle, and David Dobkin. Shape distributions. ACM Transactions on Graphics, 21(4):807–832, 2002.
  • [Pham et al.2018] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In ICML, 2018.
  • [Qi et al.2017] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 77–85, 2017.
  • [Real et al.2019] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le.

    Regularized evolution for image classifier architecture search.

    In AAAI, volume 33, pages 4780–4789, 2019.
  • [Savva et al.2016a] M. Savva, F. Yu, Hao Su, M. Aono, B. Chen, D. Cohen-Or, W. Deng, Hang Su, S. Bai, X. Bai, N. Fish, J. Han, E. Kalogerakis, E. G. Learned-Miller, Y. Li, M. Liao, S. Maji, A. Tatsuma, Y. Wang, N. Zhang, and Z. Zhou. Large-scale 3d shape retrieval from shapenet core55. In Eurographics Workshop on 3D Object Retrieval. The Eurographics Association, 2016.
  • [Savva et al.2016b] Manolis Savva, Fisher Yu, Hao Su, M Aono, B Chen, D Cohen-Or, W Deng, Hang Su, Song Bai, Xiang Bai, et al. Shrec’16 track large-scale 3d shape retrieval from shapenet core55. In Proceedings of the eurographics workshop on 3D object retrieval, 2016.
  • [Shi et al.2015] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
  • [Simonyan and Zisserman2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [Song et al.2017] Bai Song, Zhichao Zhou, Jingdong Wang, Bai Xiang, and Tian Qi. Ensemble diffusion for retrieval. In ICCV, pages 774–783, 2017.
  • [Su et al.2015] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In ICCV, pages 945–953, 2015.
  • [Su et al.2018] Jong-Chyi Su, Matheus Gadelha, Rui Wang, and Subhransu Maji. A deeper look at 3d shape classifiers. In Second Workshop on 3D Reconstruction Meets Semantics, ECCV, 2018.
  • [Tan and Le2019] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, volume 97, pages 6105–6114, 2019.
  • [Wang et al.2017a] Chu Wang, Marcello Pelillo, and Kaleem Siddiqi. Dominant set clustering and pooling for multi-view 3d object recognition. In BMVC, 2017.
  • [Wang et al.2017b] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics, 36(4):72:1–72:11, 2017.
  • [Williams1992] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.
  • [Wu et al.2015] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015.
  • [Xie et al.2017] Jin Xie, Guoxian Dai, Fan Zhu, Edward K Wong, and Yi Fang. Deepshape: Deep-learned shape descriptor for 3d shape retrieval. IEEE transactions on pattern analysis and machine intelligence, pages 1335–1345, 2017.
  • [Xu et al.2019] Cheng Xu, Zhaoqun Li, Qiang Qiu, Biao Leng, and Jingfei Jiang. Enhancing 2d representation via adjacent views for 3d shape retrieval. In ICCV, 2019.
  • [Zoph et al.2018] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, pages 8697–8710, 2018.

Appendix A Training details

a.1 Architecture search on ModelNet40.

We adopt the stochastic gradient descent (SGD) with the initial learning rate 0.01, the momentum 0.9 and the weight decay 3e-4 to optimize the network weights

. The architecture parameter

is initialized by a Gaussian distribution of the mean value 0 and the standard deviation 1e-3.

is optimized using the same optimizer of and the learning rate is 0.05. is optimized by Adam [Kingma and Ba2015] with the initial learning rate 3e-4, the momentum

and the weight decay 1e-3. For the sake of stability, the gradient clip is adopted and a warmup scheme is conducted during the searching process.

a.2 Pretrain on ImageNet

In 3D shape recognition, most approaches take advantage of ImageNet pretrained network to boost their performances. For a fair comparison, we also train the searched network on the ImageNet classification benchmark. What should be noticed is that we train separate view images for the classification task, therefore only the parameters of the first 5 cells are updated. The training process takes 800 as the batch size and total epochs are 120. SGD with the initial learning rate 0.1 and weight decay 3e-4 is used in the optimization.

a.3 Architecture evaluation on ModelNet40 and ShapeNetCore55.

To evaluate the performance of searched architectures, we need to retrain the derived supernet on our target dataset. The retraining set the batch size to 36. SGD with the initial learning rate 0.01 is adopted. The weight decay is 1e-3 without pretraining and 3e-4 with pretraining. The shape descriptor with dimension is used to conduct retrieval.

a.4 Stability of search.

Since the searching process is initialization-sensitive, the final searched architectures are generally distinct from one another due todifferent random seeds. To investigate the performance stability of the searching process, we conduct the search experiment 5 times with the same hyperparameters but different random seeds. The results of the experiments are shown in Tab. 6.

Seed Accuracy mAP
1 90.3% 83.8%
2 91.0% 87.6%
3 89.8% 82.3%
4 90.5% 84.2%
5 90.5% 84.3%
Table 6: The performance of searched architectures from different random seed on ModelNet40.