Log In Sign Up

MV-C3D: A Spatial Correlated Multi-View 3D Convolutional Neural Networks

by   Qi Xuan, et al.

As the development of deep neural networks, 3D object recognition is becoming increasingly popular in computer vision community. Many multi-view based methods are proposed to improve the category recognition accuracy. These approaches mainly rely on multi-view images which are rendered with the whole circumference. In real-world applications, however, 3D objects are mostly observed from partial viewpoints in a less range. Therefore, we propose a multi-view based 3D convolutional neural network, which takes only part of contiguous multi-view images as input and can still maintain high accuracy. Moreover, our model takes these view images as a joint variable to better learn spatially correlated features using 3D convolution and 3D max-pooling layers. Experimental results on ModelNet10 and ModelNet40 datasets show that our MV-C3D technique can achieve outstanding performance with multi-view images which are captured from partial angles with less range. The results on 3D rotated real image dataset MIRO further demonstrate that MV-C3D is more adaptable in real-world scenarios. The classification accuracy can be further improved with the increasing number of view images.


page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 10


Pairwise Decomposition of Image Sequences for Active Multi-View Recognition

A multi-view image sequence provides a much richer capacity for object r...

End-to-End Multi-View Structure-from-Motion with Hypercorrelation Volumes

Image-based 3D reconstruction is one of the most important tasks in Comp...

A Comparison of Multi-View Learning Strategies for Satellite Image-Based Real Estate Appraisal

In the house credit process, banks and lenders rely on a fast and accura...

Multi-View Product Image Search Using Deep ConvNets Representations

Multi-view product image queries can improve retrieval performance over ...

Multi-view Multi-label Anomaly Network Traffic Classification based on MLP-Mixer Neural Network

Network traffic classification is the basis of many network security app...

Leveraging Multi-view Image Sets for Unsupervised Intrinsic Image Decomposition and Highlight Separation

We present an unsupervised approach for factorizing object appearance in...

Multi-view in Lensless Compressive Imaging

Multi-view images are acquired by a lensless compressive imaging archite...

I Introduction

Recently, deep learning technologies have been widely applied to several industrial manufacturing processes [1, 2, 3, 4, 5]. The development of convolutional neural networks (CNNs) has enabled the dramatic progress of 3D object recognition technologies. 3D object recognition has a wide range of applications, e.g., automatic driving [6], robots [7], and civil monitoring etc [8]. In this work, we have proposed a novel 3D CNN architecture which only requires multiple images from limited viewpoint and can still achieve satisfied classification results.

Currently, most CNN architectures are designed specifically for 2D images [9]. Therefore, to perform classifications for 3D models, we need to transform the current models based on voxels or 2D images. For voxel based approaches, 3D models are organized as volumetric occupancy grids [10, 11, 12]. The main advantage of voxel representation is that it can maintain full geometrical information of the original 3D objects. However, those approaches also suffer from the problems of resolution loss and exponentially growing computational cost [13].

Previously, researchers developed multi-view based methods, which can derive comparable results with much lower computational cost [14, 15, 16]. However, the multi-view based methods require multiple images derived from various predefined viewpoints in the whole circumference, which is quite impractical for real-world applications. Thus, it is much more desirable to perform successful 3D recognition from multi-view images in limited viewpoints. The existent multi-view image approaches [14, 16]

treat each multi-view image as an independent variable and feed the images into 2D CNNs separately, and the final classifications are derived by aggregating the feature vectors with view-pooling or clustering. Those approaches can easily lead to inferior results by neglecting the spatial correlations between the multi-view images.

Thus, to address these problems, we propose a multi-view based 3D CNN, or MV-C3D. As shown in FIGURE MV-C3D: A Spatial Correlated Multi-View 3D Convolutional Neural Networks, our technique takes the multi-view images of objects as input and predicts the corresponding category labels. Unlike the existent multi-view based methods, our model uses multi-view images from only partial angles with less range, which makes it more adaptable in real-world applications. Moreover, our technique considers different viewpoint images as a joint variable instead of independent variables. With the help of 3D convolution and 3D max-pooling layers, our MV-C3D architecture can take advantage of spatial correlations between multi-view images to learn distinguishing features from different objects.

The main contributions of this work are summarized as follows.

  1. We propose the novel multi-view based 3D convolution neural network for the first time, namely MV-C3D, which only requires partial multi-view images from limited viewpoint and outperforms the current multi-view based state-of-the-art classification performance on ModelNet benchmark.

  2. We combine the images of different view as a joint variable to learn spatial correlated features by using 3D convolution and 3D max-pooling layers. The visualization of feature maps shows that our network can focus on the same part of object in different view images.

  3. We demonstrate experimentally that MV-C3D can get higher classification accuracy with contiguous and increasing view images in partial angles with less range.

  4. We test MV-C3D with a 3D rotated real image dataset MIRO with multiple images which was captured from arbitrary but contiguous viewpoint to demonstrate the performance of real-world scenarios.

The experimental results show that, on ModelNet dataset, our proposed architecture can outperform the state-of-the-art multi-view based method MVCNN [14] by 3.8% , multi-modal based method Spherical Projections [17] by 0.6% and DSCNN [16] by 1.7%, with the same input modality, respectively.

The rest of the paper is organized as follows. Section II provides a detailed review of the related works. Section III describes our proposed MV-C3D architecture. Section IV presents the experimental setup and results. Section V concludes the paper and discuss future works.

Ii Related Work

Previously, researchers mainly rely upon local or global descriptors which can map 3D shape information into feature vectors [18, 19, 20, 21]. With the breakthrough of CNNs, neural network based approaches are becoming more and more popular. The current existent works can be generalized into two categories: voxel based methods and multi-view images based methods.

Ii-a Voxel based method

Wu et al. [22]

constructed a five-layer 3D convolutional deep belief network (CDBN), namely 3D ShapeNet, to learn the probability distribution of 3D voxel grids. Sedaghat et al. 

[11] considered 3D object classification as a multi-task problem by introducing object orientation prediction. This model achieved excellent performance, which demonstrates that orientation is also an important aspect for 3D object classification [10, 23].

Ii-B Multi-view images based method

2D images based methods are also important for 3D object classification problem. Su et al. [14]

proposed a multi-view CNN (MVCNN) based technique to aggregate multiple images into concise descriptors in a view pooling layer, which lies in the middle of a 2D CNN framework pre-trained on ImageNet 

[24]. Multi-view images are also used in 3D object retrieval applications [25]. Qi et al. [13] conducted a comprehensive study on the voxel based and multi-view based CNNs for 3D object classification. According to these works, there are two important factors affecting the model performance: architecture and volume resolution. Therefore, two distinct volumetric networks and multi-resolution filtering technique are proposed. In particular, Feng et al. [26] proposed a group-view convolutional framework which is composed of a hierarchical view-group-shape architecture for correlation modeling towards discriminative 3D shape description. Currently, the state-of-the-art method is [16], called DSCNN, which can learn feature vectors from multiple views by using a recurrent cluster strategy. In addition to the above methods, a novel method which can directly work on point cloud data attracts increasing attention [27, 28], but the performance is still worse than multi-view images based approaches.

[b](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]2d-3d.pdf Comparison of 2D and 3D convolutions: (a) 2D convolution and (b) 3D convolution. In (b) the kernel size is , which means that it computes the related features between 3 contiguous views.

Specifically, MVCNN [14]

treats each view image as an independent variable and feeds it into 2D CNNs to compute feature maps. Then it directly performs a full stride channel-wise max pooling on these feature maps to generate a unified feature vector. Although MVCNN achieves great success in peral classificaiton, this operation may destroy the spatial correlation information and thus could be further improved. Moreover, DSCNN 

[16] uses a clustering strategy which may also cause the loss of viewpoint dimension information. Different from these existent methods, our MV-C3D technique treats multi-view image as a joint variable, and uses 3D convolution and 3D max-pooling to learn both the spatial features and the intrinsic correlations among multi-view images simultaneously.

Iii MV-C3D Method

Iii-a Multi-view based 3D convolution

In 2D CNN applications, we only need to compute features from the 2D spatial dimensions and thus, a single-view image of object is sufficient. However, for the 3D object recognition problem, it is required to encode 3D object information from 3D spatial dimensions where different viewpoint images are considered as the third dimension. Compared to 2D CNNs, 3D CNNs can be more efficient and accurate for multi-view feature learning. In 3D CNNs, 3D convolution is performed by applying a 3D kernel in the view images. FIGURE II-B shows the difference between 2D and 3D convolutions. 2D convolution kernel is applied on an image in 2D spatial directions. Thus, it cannot include 3D view information. On the other hand, 3D convolution kernel can preserve spatial correlation information between different view images. Moreover, unlike voxel-based 3D CNNs which focus on learning geometrical features, multi-view based 3D CNNs can capture the correlated features between multi-view images. Formally, the value at position on the th feature map in the th layer is given by:


where is the bias, is the result of convolution with the th feature map, is the size of 3D convolution kernel, is the offset, is the viewpoint dimension, is the kernel connected to the th feature map in the th layer, and is the th feature map in the th layer.

Iii-B Partial multi-view images setup

Typically, 3D models in online repository are stored as polygon meshes, which are collections of vertices, edges, and faces that define the shape of a polyhedral object. We employ the Phong reflection model [29] to render 3D models at different predefined viewpoints. We assume that the 3D objects are upright oriented along with -axis [26, 16]. As shown in FIGURE III-B, we fix the -axis as the rotation axis and then place viewpoints separated by angle around the axis. The viewpoints are elevated by from the ground plane. As a result, we generate 36 view images. Unlike the existing omnibearing viewpoints based methods, only a portion of contiguous images from limited viewpoints are required.

Moreover, 2D images with larger resolution can reserve more information, which can lead to a better performance at the cost of computational time. To balance the computational cost and performance, we set the size of each image to .

[h](topskip=0pt, botskip=0pt, midskip=0pt)[width=.999]partial.pdf Different input representation setup.

Iii-C Network Architecture

Based on the structure of VGGNet [30], we propose MV-C3D, which is essentially a 3D CNN capable of processing contiguous multi-view images (FIGURE III-C). The input is a stack cube of multi-view RGB images with the size of , where is the number of views, is the height and width of a single image. Unlike the existent methods [14, 16], we do not compute 2D spatial features on different view images independently. Instead, we take these images as an entire instance and learn spatially correlated features between multi-view images.

Our network architecture contains eight 3D convolution layers, five 3D max-pooling layers, three fully-connected layers, and a softmax function to estimate the output distribution. For the fully-connected layers, the dimensions of the first two layers are equal to 4096, while that of the third layer is determined by the number of classes. The activation function of the 3D convolution and fully-connected layers is rectified linear units (ReLUs). We also implement a dropout layer following the first two fully-connected layers to reduce overfitting.

[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]arch.pdf The architecture of our MV-C3D. is the number of view images. The white layer represent the 3D convolution operation, green layer represent 3D max-pooling, and yellow layer represent fully connected layers. is the floor function, and the (number) represents the channels of feature cube.

Name Type Filter size/stride Output size
Conv1 Convolution
Pool1 Max pooling
Conv2 Convolution
Pool2 Max pooling
Conv3_a Convolution
Conv3_b Convolution
Pool3 Max pooling
Conv4_a Convolution
Conv4_b Convolution
Pool4 Max pooling
Conv5_a Convolution
Conv5_b Convolution
Pool5 Max pooling
Fc1 Fully connected
Fc2 Fully connected
Fc3 Fully connected
TABLE I: Size for each layer of MV-C3D.

The size of the convolution kernels is fixed to be (), which is the same as the 2D CNNs [30]. It is demonstrated that small spatial receptive of can increase the performance of DNN models in 2D recognition. Therefore, we set the kernel size to be to compute the 2D spatial features in each view image and set the third viewpoint dimension to be

to aggregate spatial correlated features between view images. The number of filters for each convolution layers is 64, 128, 256, 256, 512, 512, 512, and 512, respectively. We add padding to both spatial and views dimension in all convolution layers, so that the size of feature maps remain constant after these layers.

For the pooling layers, to preserve the 2D spatial features in the single-view images, we set the kernel size to be with stride of in the first pooling layer. In other words, we apply 2D spatial max-pooling on each view image. Apart from the first pooling layer, the remaining pooling layers implement 3D max-pooling with kernel size of and stride of . Therefore, the size of the output feature maps is scaled-down by a factor of 32 () compared with the origin input. Meanwhile, the viewpoint dimension is also scaled-down by a factor of 16 (), as shown in TABLE I.

Iv Experiment

Iv-a Experimental setup

Dataset. We evaluate our MV-C3D model on the 3D ModelNet Benchmark [22]. It is a comprehensive collection of 3D CAD models, which contains 127,915 models divided into 662 different categories. As shown in TABLE II, two subsets of ModelNet are widely used, which are ModelNet10 with 4,899 object instances in 10 categories and ModelNet40 with 12,311 object instances in 40 categories. Both of them are fully labeled and used in many state-of-the-art researches [16, 14, 10, 31, 32]. The datasets also provide both the training and testing sets. For example, ModelNet10 has 3,991 training and 908 testing samples and ModelNet40 has 9,843 training and 2,468 testing samples. We use the default settings in our experiments.

Training detail.

We perform experiments on a machine with NVIDIA TITAN X Pascal GPU, Intel Core i7-6700K CPU, and 32GB RAM. Our proposed model is coded in the Tensorflow 

[33] platform, which is a popular deep learning library from Google.

The neural network is trained using Adam [34]

optimization. The initial learning rate is set to be 0.0001 and divided by 10 every 20 epochs during the training. The loss function

is cross-entropy with weight regularization as shown in the following equation:


where is the mini-batch size, is the number of category (e.g., for ModelNet10, and for ModelNet40), and represent the true label and the prediction score, respectively, is the indicator function, is the weighting parameter which is set to 0.0005 empirically,

is the filter parameters initialized with zero-mean Gaussian distribution with standard deviation of 0.05, and

is the total number of hyper-parameters.

In training phase, we divide the default training into training set and validation set in a ratio of 4 to 1. We calculate the validation loss every epoch and stop the training when validation loss converges in 5 epochs (, ), with defined by

Name Train split Test split Total
ModelNet10 3991 908 4899
ModelNet40 9843 2468 12311
TABLE II: The details of ModelNet sub-dataset.

Iv-B Exploring viewpoint dimension of kernel

A small receptive field of convolution kernel is appropriate for 2D spatial feature learning according to the findings in VGGNet [30]. Thus we fix the spatial dimension of 3D convolution kernel to when only vary the viewpoint dimension to exploit the optimal 3D convolution kernel size. Moreover, we set to be 12, which is consistent with the existent methods.

During the experiment, we first assume that all convolution kernels have the same viewpoint dimension. Thus, we evaluate 4 different 3D kernel sizes which the viewpoint dimension fixed to 1, 3, 5, and 7 from the first to the eighth convolution layer. Then we set the dimension varying across different convolution layers. For this setting, we test two types of networks with the viewpoint dimension of kernel size in decreasing order and increasing order, respectively, and choose the one of the best performance to compare with other settings. In particularly, we choose 7-5-5-5-3-3-1-1 to represent the decreasing order and 1-1-3-3-5-5-7-7 for increasing order.

The networks are trained on the training sets of ModelNet10 and tested using the testing sets. FIGURE IV-B shows the experimental results. The size of 3D convolution kernel which the viewpoint dimension is fixed to 3 gives the best performance. Therefore, we use kernels in the following experiments. Moreover, an interesting observation is that when viewpoint dimension is equal to 1, the performance is the worst compared with other settings. This is expected since it is essentially equivalent to a 2D convolution kernel and hence, cannot capture multi-view features. This suggests that 3D CNNs can learn spatial correlated features between multi-view images effectively and improve the classification results.

[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.999]kernel.pdf Exploring viewpoint dimension of 3D convolution kernel on ModelNet10.

Iv-C Over-sampling

Class imbalance can significantly affect the performance and generalization ability of the models [35, 36]. As shown in FIGURE III, the number of instances in each category varies greatly. To eliminate the influence of data bias, we select the object instance which belongs to the fewer categories randomly and designate it as a new instance in the same category. Therefore, the number of instances in each category is balanced. To create a more balanced training set, we increase the number of instances in each category to 500. Thus, our scheme can significantly reduce the imbalanced data problem. After applying our strategies, the classification accuracies on ModelNet10 and ModelNet40 are shown in TABLE III, the model performance is slightly improved.

Method ModelNet10 ModelNet40
No sampling 90.5% 89.8%
Oversampling 91.1% 90.1%
TABLE III: Different performances with or without oversampling.

[!t](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]d10d40.pdf The number of instances for each category in ModelNet10 and ModelNet40.

Iv-D Pre-training

In 2D object classification applications, the model performance can be significantly improved by pre-training on ImageNet [5]. Similarly, when pre-trained on ImageNet, the existent 3D multi-view based methods [14, 37], which is based on 2D CNNs such as VGG-M [38], can also improve the classification accuracy. Unfortunately, our MV-C3D model cannot be pre-trained on ImageNet because of lacking multi-view images. Therefore, for pre-training, we employ the UCF101 [39], which is an action classification dataset collected from Youtube, to pre-train our 3D CNNs. It is demonstrated that 3D CNNs can learn relevance features between different video frames effectively [40]. As shown in TABLE IV, our MV-C3D can also derive a better classification accuracy when pre-trained and fine-tuned on UCF101.

Pre-training Oversampling ModelNet10 ModelNet40
1 90.5% 89.8%
2 91.1% 90.1%
3 91.7% 91.0%
4 92.0% 91.5%
TABLE IV: Different performances with different pre-processing methods.

Iv-E Effect of the number of view images

In this section, we explore the impact of the number of view images on the classification performance. FIGURE IV-E shows the performance of our MV-C3D with the number of view images varying from 1 to 36 on ModelNet40. The performance is poor when the number of view images is below 4 because of the lack of sufficient spatial correlated features. With the increasing number, the classification accuracy improves rapidly. Our MV-C3D technique achieves 89.3% classification with only 10 views and 93.2% with 16 views. The performance converges to 93.9% ( 0.1%) with more than 20 views.

[!t](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]numberofview-change.pdf Classification result versus number of view images on ModelNet40.

Iv-F Experiment on ModelNet

We compare our MV-C3D technique with other methods based on varying input modalities (e.g., voxel, point cloud). As shown in TABLE V, our proposed MV-C3D model achieves an accuracy of 93.9% and 90.5% mAP on ModelNet40, which outperforms all the other methods based on voxel and point cloud. Meanwhile, for multi-view input, it has an improvement of 3.8% and 11% compare with MVCNN technique on the classification and retrieval tasks, respectively. For multi-modal methods, our method outperforms all the other methods except for the Spherical Projections [17] technique, which is 0.3% better. However, their approach requires depth information, which is impractical and requires more resources to process. Moreover, MV-C3D achieves the best result of 94.5% on classification task in ModelNet10.

Method Input Modality ModelNet40 ModelNet10
Classification Retrieval Classification Retrieval
(Accuracy) (mAP) (Accuracy) (mAP)
Beam Search [41] Voxel 81.26% - 88% -
VoxNet [10] 83% - 92% -
3D-GAN [23] 83.3% - 91% -
LightNet [42] 86.9% - 93.39% -
FPNN [43] 88.4% - - -
MVCNN-MultiRes [13] 91.4% - - -
3D ShapeNets [22] Point Cloud 77.32% 49.23% 83.54% 68.26%
PointNet [44] - - 77.6% -
PointNet++ [45] 91.9% - - -
Set-convolution [46] 90% - - -
Angular Triplet-Center [47] - 86.11% - 92.07%
Geo-CNN [48] 93.9% - - -
DeepPano [49] Multi-view 77.63% 76.81% 85.45% 84.18%
MVCNN [14] 90.1% 79.5% - -
FusionNet [37] Multi-view + Voxel 90.8% - 93.11% -
PRVNet [50] Multi-view + Point Cloud 93.6% 90.5% - -
Multiple Depth [51] Multi-view + Depth 87.8% - 91.5% -
DSCNN [16] Multi-view + Depth 93.8% - - -
Spherical Projections [17] Multi-view + Depth 94.24% - - -
MV-C3D (proposed) Multi-view 93.9% 90.5% 94.5% 91.4%
TABLE V: Comparison of classification Accuracy and retrieval mean Average Precision (mAP).

Iv-G Replicability

To demonstrate the replicability of our method, we have repeated experiment on ModelNet10 for 20 trials. FIGURE IV-G

shows the accuracy curve with error band. The accuracies have high variance at the beginning of training. However, with the increasing of train epoch, the results converge and are stable at 94.5% (

0.15%) after 30 training epochs. This result suggests that our method has good replicability.

[ht](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]replicable.pdf Results on ModelNet10 with error band.

Iv-H Visualization of learned features

For a better understanding of how MV-C3D works, we provide visualization of learned features by using the method from [52]. FIGURE IV-H shows the deconvolution of one learned feature map of the layer. We can see that MV-C3D focuses on empennage in all view images in the first example. The second focuses on empennage and airfoil simultaneously. The third focuses on chair legs and the forth focuses on chair handle. The fact that during the learning process, images from different angles focusing on the same feature suggests that MV-C3D can capture correlated features between multi-view images effectively.

[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]visualization.pdf Visualization of learned features between multi-view images. In the first example, the feature focuses on empennage in all view images. The second focuses on empennage and airfoil simultaneously. The third focuses on chair leg and the forth focuses on chair handle.

Iv-I Comparison with Multi-view based methods

In this section, we compare our technique with the multi-view based method MVCNN [14] and multi-modal based method DSCNN [16] on ModelNet40 classification task. As shown in TABLE VI, MV-C3D achieves 91.4% accuracy by taking all view images with interval 30° (°), which outperform MVCNN 0.9%, but is slightly worse 0.8% compared to DSCNN, using the same input setting. We believe that the correlated feature information between multi-view is weak when they are highly dissimilar because of the large interval. However, when we take 12 contiguous view images with interval 10° (°) as input, the performance is increased to 91.9% while other methods decreased. With the increasing number of views, the performance of our MV-C3D improves and achieves 93.9% when the number of views is 20, which outperforms other methods with the same input setting. The reason may be that the prior multi-view based methods mainly rely on global contour information while our MV-C3D only needs the correlated information between multi-view images. This is an advantage when MV-C3D is applied in real-word scenarios where objects are always captured with a limit angle instead of omnibearing.

Method View Interval #Views Accuracy
MVCNN 12 89.5%
8 80.1%
12 82.7%
16 84.1%
20 85.3%
DSCNN 12 92.2%
8 87.6%
12 90.3%
16 91.4%
20 92.1%
MV-C3D 12 91.4%
8 90.5%
12 91.9%
16 93.2%
20 93.9%
TABLE VI: The comparison between MV-C3D and other multi-view based methods in different input settings on ModelNet40.

Iv-J Experiment on 3D rotated image dataset

In this section, we test our technique on a 3D rotated real image dataset ”Multi-view Images of Rotated Objects (MIRO)” [15]

. In the previous experiments, we assume that the viewpoints are uniformly distributed along a circle. However, in real-world applications, objects are often observed with arbitrary directions, which is more close to MIRO. MIRO consists of 120 object instances in 12 categories, and each instance has 160 images (10 different elevation angles and 16 different azimuth angles) captured from different viewpoints approximately equally distributed in the spherical space. FIGURE 

IV-J shows an example of object and the corresponding multi-view images. We randomly select 12 contiguous views (both in elevation direction and azimuth direction) as input to test our model, which is trained on the ModelNet40 dataset. The accuracy on each category is reported in TABLE VII. In almost all cases, MV-C3D outperforms other multi-view based methods. This suggests that MV-C3D is accurate, robust, and more practical.

[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]example.pdf A clock exemplar and the multi-view images in MIRO dataset.

Method bus car cleanser clock cup head- mouse scissors shoe stapler sunglasses tape Mean
phones cutter
MVCNN 80% 70% 90% 90% 100% 70% 80% 60% 90% 100% 80% 90% 83.3%
DSCNN 90% 80% 100% 90% 100% 80% 80% 70% 90% 90% 90% 90% 87.5%
MV-C3D 90% 90% 100% 100% 100% 90% 90% 80% 100% 90% 90% 100% 93.3%
TABLE VII: Classification accuracy on each category in MIRO.

Iv-K Ablation study

Ablation study on convolution pattern. The prior work employs the 2D convolution to extract feature independently. To evaluate the 3D convolution operation in MV-C3D, we build the same neural network as MV-C3D but use the 2D convolution operation instead. As shown in FIGURE IV-K, each view image is calculated using individual 2D convolution kernels. TABLE VIII shows the performance of different convolution filters on ModelNet40. 3D convolution outperform 2D convolution significantly, which demonstrates the importance of 3D convolution.

[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]2d-independently.pdf Extract features on multi-view images by 2D convolution operation independently.

Method ModelNet10 ModelNet40
2D conv. 81.8% 75.3%
3D conv. 94.5% 93.9%
TABLE VIII: Ablation study on convolution pattern. Classification results on ModelNet with different convolution pattern.

Ablation study on model complexity. Most of the parameters in the MV-C3D model come from the last three fully connected layers. The parameter number can be estimated as , where is determined by the channel numbers and the size of feature maps in the last 3D max-pooling layer, is the number of categories, , are the dimensions of the second and third fully connected layer, respectively. We increase and from 1024 to 4096 to estimate the impact of model complexity. The experimental results are shown in TABLE IX. For ”MV-C3D-S” model, , are set to 1024. For ”MV-C3D-M” model, both are 2048. We obverse that the increasing model complexity does not improve performance much. Increasing the number of input view images has a much more significant impact.

Method Model Size #Views Accuracy
MV-C3D-S 142M 8 82.0%
12 88.7%
16 92.7%
MV-C3D-M 186M 8 82.9%
12 89.3%
16 93.0%
MV-C3D 299M 8 83.7%
12 91.9%
16 93.2%
TABLE IX: Comparison of models with varying complexity.

V Conclusion

In this paper, we propose MV-C3D, which is a multi-view based 3D convolutional neural network and can perform 3D objects classification using multi-view images which are captured from only partial angles with less range. MV-C3D can effectively learn 3D object representations by using 3D convolution layers and max-pooling layers to aggregate the spatial correlated features of different viewpoint images. Experiments on the ModelNet10 and ModelNet40 benchmarks show that MV-C3D outperform the state-of-the-art multi-view based methods by using only RGB images partial viewpoints which can easily be captured by surveillance cameras or moving cameras. Furthermore, the outstanding results on a real image dataset MIRO suggest that our technique can be applied in real-world multi-view classification task. In the future work, we plan to explore different architectures to further reduce the parameters of the 3D convolution based model while maintaining the accuracy of classification.