I Introduction
Recently, deep learning technologies have been widely applied to several industrial manufacturing processes [1, 2, 3, 4, 5]. The development of convolutional neural networks (CNNs) has enabled the dramatic progress of 3D object recognition technologies. 3D object recognition has a wide range of applications, e.g., automatic driving [6], robots [7], and civil monitoring etc [8]. In this work, we have proposed a novel 3D CNN architecture which only requires multiple images from limited viewpoint and can still achieve satisfied classification results.
Currently, most CNN architectures are designed specifically for 2D images [9]. Therefore, to perform classifications for 3D models, we need to transform the current models based on voxels or 2D images. For voxel based approaches, 3D models are organized as volumetric occupancy grids [10, 11, 12]. The main advantage of voxel representation is that it can maintain full geometrical information of the original 3D objects. However, those approaches also suffer from the problems of resolution loss and exponentially growing computational cost [13].
Previously, researchers developed multi-view based methods, which can derive comparable results with much lower computational cost [14, 15, 16]. However, the multi-view based methods require multiple images derived from various predefined viewpoints in the whole circumference, which is quite impractical for real-world applications. Thus, it is much more desirable to perform successful 3D recognition from multi-view images in limited viewpoints. The existent multi-view image approaches [14, 16]
treat each multi-view image as an independent variable and feed the images into 2D CNNs separately, and the final classifications are derived by aggregating the feature vectors with view-pooling or clustering. Those approaches can easily lead to inferior results by neglecting the spatial correlations between the multi-view images.
Thus, to address these problems, we propose a multi-view based 3D CNN, or MV-C3D. As shown in FIGURE MV-C3D: A Spatial Correlated Multi-View 3D Convolutional Neural Networks, our technique takes the multi-view images of objects as input and predicts the corresponding category labels. Unlike the existent multi-view based methods, our model uses multi-view images from only partial angles with less range, which makes it more adaptable in real-world applications. Moreover, our technique considers different viewpoint images as a joint variable instead of independent variables. With the help of 3D convolution and 3D max-pooling layers, our MV-C3D architecture can take advantage of spatial correlations between multi-view images to learn distinguishing features from different objects.
The main contributions of this work are summarized as follows.
-
We propose the novel multi-view based 3D convolution neural network for the first time, namely MV-C3D, which only requires partial multi-view images from limited viewpoint and outperforms the current multi-view based state-of-the-art classification performance on ModelNet benchmark.
-
We combine the images of different view as a joint variable to learn spatial correlated features by using 3D convolution and 3D max-pooling layers. The visualization of feature maps shows that our network can focus on the same part of object in different view images.
-
We demonstrate experimentally that MV-C3D can get higher classification accuracy with contiguous and increasing view images in partial angles with less range.
-
We test MV-C3D with a 3D rotated real image dataset MIRO with multiple images which was captured from arbitrary but contiguous viewpoint to demonstrate the performance of real-world scenarios.
Ii Related Work
Previously, researchers mainly rely upon local or global descriptors which can map 3D shape information into feature vectors [18, 19, 20, 21]. With the breakthrough of CNNs, neural network based approaches are becoming more and more popular. The current existent works can be generalized into two categories: voxel based methods and multi-view images based methods.
Ii-a Voxel based method
Wu et al. [22]
constructed a five-layer 3D convolutional deep belief network (CDBN), namely 3D ShapeNet, to learn the probability distribution of 3D voxel grids. Sedaghat et al.
[11] considered 3D object classification as a multi-task problem by introducing object orientation prediction. This model achieved excellent performance, which demonstrates that orientation is also an important aspect for 3D object classification [10, 23].Ii-B Multi-view images based method
2D images based methods are also important for 3D object classification problem. Su et al. [14]
proposed a multi-view CNN (MVCNN) based technique to aggregate multiple images into concise descriptors in a view pooling layer, which lies in the middle of a 2D CNN framework pre-trained on ImageNet
[24]. Multi-view images are also used in 3D object retrieval applications [25]. Qi et al. [13] conducted a comprehensive study on the voxel based and multi-view based CNNs for 3D object classification. According to these works, there are two important factors affecting the model performance: architecture and volume resolution. Therefore, two distinct volumetric networks and multi-resolution filtering technique are proposed. In particular, Feng et al. [26] proposed a group-view convolutional framework which is composed of a hierarchical view-group-shape architecture for correlation modeling towards discriminative 3D shape description. Currently, the state-of-the-art method is [16], called DSCNN, which can learn feature vectors from multiple views by using a recurrent cluster strategy. In addition to the above methods, a novel method which can directly work on point cloud data attracts increasing attention [27, 28], but the performance is still worse than multi-view images based approaches.[b](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]2d-3d.pdf Comparison of 2D and 3D convolutions: (a) 2D convolution and (b) 3D convolution. In (b) the kernel size is , which means that it computes the related features between 3 contiguous views.
Specifically, MVCNN [14]
treats each view image as an independent variable and feeds it into 2D CNNs to compute feature maps. Then it directly performs a full stride channel-wise max pooling on these feature maps to generate a unified feature vector. Although MVCNN achieves great success in peral classificaiton, this operation may destroy the spatial correlation information and thus could be further improved. Moreover, DSCNN
[16] uses a clustering strategy which may also cause the loss of viewpoint dimension information. Different from these existent methods, our MV-C3D technique treats multi-view image as a joint variable, and uses 3D convolution and 3D max-pooling to learn both the spatial features and the intrinsic correlations among multi-view images simultaneously.Iii MV-C3D Method
Iii-a Multi-view based 3D convolution
In 2D CNN applications, we only need to compute features from the 2D spatial dimensions and thus, a single-view image of object is sufficient. However, for the 3D object recognition problem, it is required to encode 3D object information from 3D spatial dimensions where different viewpoint images are considered as the third dimension. Compared to 2D CNNs, 3D CNNs can be more efficient and accurate for multi-view feature learning. In 3D CNNs, 3D convolution is performed by applying a 3D kernel in the view images. FIGURE II-B shows the difference between 2D and 3D convolutions. 2D convolution kernel is applied on an image in 2D spatial directions. Thus, it cannot include 3D view information. On the other hand, 3D convolution kernel can preserve spatial correlation information between different view images. Moreover, unlike voxel-based 3D CNNs which focus on learning geometrical features, multi-view based 3D CNNs can capture the correlated features between multi-view images. Formally, the value at position on the th feature map in the th layer is given by:
(1) |
(2) |
where is the bias, is the result of convolution with the th feature map, is the size of 3D convolution kernel, is the offset, is the viewpoint dimension, is the kernel connected to the th feature map in the th layer, and is the th feature map in the th layer.
Iii-B Partial multi-view images setup
Typically, 3D models in online repository are stored as polygon meshes, which are collections of vertices, edges, and faces that define the shape of a polyhedral object. We employ the Phong reflection model [29] to render 3D models at different predefined viewpoints. We assume that the 3D objects are upright oriented along with -axis [26, 16]. As shown in FIGURE III-B, we fix the -axis as the rotation axis and then place viewpoints separated by angle around the axis. The viewpoints are elevated by from the ground plane. As a result, we generate 36 view images. Unlike the existing omnibearing viewpoints based methods, only a portion of contiguous images from limited viewpoints are required.
Moreover, 2D images with larger resolution can reserve more information, which can lead to a better performance at the cost of computational time. To balance the computational cost and performance, we set the size of each image to .
[h](topskip=0pt, botskip=0pt, midskip=0pt)[width=.999]partial.pdf Different input representation setup.
Iii-C Network Architecture
Based on the structure of VGGNet [30], we propose MV-C3D, which is essentially a 3D CNN capable of processing contiguous multi-view images (FIGURE III-C). The input is a stack cube of multi-view RGB images with the size of , where is the number of views, is the height and width of a single image. Unlike the existent methods [14, 16], we do not compute 2D spatial features on different view images independently. Instead, we take these images as an entire instance and learn spatially correlated features between multi-view images.
Our network architecture contains eight 3D convolution layers, five 3D max-pooling layers, three fully-connected layers, and a softmax function to estimate the output distribution. For the fully-connected layers, the dimensions of the first two layers are equal to 4096, while that of the third layer is determined by the number of classes. The activation function of the 3D convolution and fully-connected layers is rectified linear units (ReLUs). We also implement a dropout layer following the first two fully-connected layers to reduce overfitting.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]arch.pdf The architecture of our MV-C3D. is the number of view images. The white layer represent the 3D convolution operation, green layer represent 3D max-pooling, and yellow layer represent fully connected layers. is the floor function, and the (number) represents the channels of feature cube.
Name | Type | Filter size/stride | Output size |
---|---|---|---|
Input | |||
Conv1 | Convolution | ||
Pool1 | Max pooling | ||
Conv2 | Convolution | ||
Pool2 | Max pooling | ||
Conv3_a | Convolution | ||
Conv3_b | Convolution | ||
Pool3 | Max pooling | ||
Conv4_a | Convolution | ||
Conv4_b | Convolution | ||
Pool4 | Max pooling | ||
Conv5_a | Convolution | ||
Conv5_b | Convolution | ||
Pool5 | Max pooling | ||
Fc1 | Fully connected | ||
Fc2 | Fully connected | ||
Fc3 | Fully connected | ||
Softmax |
The size of the convolution kernels is fixed to be (), which is the same as the 2D CNNs [30]. It is demonstrated that small spatial receptive of can increase the performance of DNN models in 2D recognition. Therefore, we set the kernel size to be to compute the 2D spatial features in each view image and set the third viewpoint dimension to be
to aggregate spatial correlated features between view images. The number of filters for each convolution layers is 64, 128, 256, 256, 512, 512, 512, and 512, respectively. We add padding to both spatial and views dimension in all convolution layers, so that the size of feature maps remain constant after these layers.
For the pooling layers, to preserve the 2D spatial features in the single-view images, we set the kernel size to be with stride of in the first pooling layer. In other words, we apply 2D spatial max-pooling on each view image. Apart from the first pooling layer, the remaining pooling layers implement 3D max-pooling with kernel size of and stride of . Therefore, the size of the output feature maps is scaled-down by a factor of 32 () compared with the origin input. Meanwhile, the viewpoint dimension is also scaled-down by a factor of 16 (), as shown in TABLE I.
Iv Experiment
Iv-a Experimental setup
Dataset. We evaluate our MV-C3D model on the 3D ModelNet Benchmark [22]. It is a comprehensive collection of 3D CAD models, which contains 127,915 models divided into 662 different categories. As shown in TABLE II, two subsets of ModelNet are widely used, which are ModelNet10 with 4,899 object instances in 10 categories and ModelNet40 with 12,311 object instances in 40 categories. Both of them are fully labeled and used in many state-of-the-art researches [16, 14, 10, 31, 32]. The datasets also provide both the training and testing sets. For example, ModelNet10 has 3,991 training and 908 testing samples and ModelNet40 has 9,843 training and 2,468 testing samples. We use the default settings in our experiments.
Training detail.
We perform experiments on a machine with NVIDIA TITAN X Pascal GPU, Intel Core i7-6700K CPU, and 32GB RAM. Our proposed model is coded in the Tensorflow
[33] platform, which is a popular deep learning library from Google.The neural network is trained using Adam [34]
optimization. The initial learning rate is set to be 0.0001 and divided by 10 every 20 epochs during the training. The loss function
is cross-entropy with weight regularization as shown in the following equation:(3) |
where is the mini-batch size, is the number of category (e.g., for ModelNet10, and for ModelNet40), and represent the true label and the prediction score, respectively, is the indicator function, is the weighting parameter which is set to 0.0005 empirically,
is the filter parameters initialized with zero-mean Gaussian distribution with standard deviation of 0.05, and
is the total number of hyper-parameters.In training phase, we divide the default training into training set and validation set in a ratio of 4 to 1. We calculate the validation loss every epoch and stop the training when validation loss converges in 5 epochs (, ), with defined by
(4) |
Name | Train split | Test split | Total |
---|---|---|---|
ModelNet10 | 3991 | 908 | 4899 |
ModelNet40 | 9843 | 2468 | 12311 |
Iv-B Exploring viewpoint dimension of kernel
A small receptive field of convolution kernel is appropriate for 2D spatial feature learning according to the findings in VGGNet [30]. Thus we fix the spatial dimension of 3D convolution kernel to when only vary the viewpoint dimension to exploit the optimal 3D convolution kernel size. Moreover, we set to be 12, which is consistent with the existent methods.
During the experiment, we first assume that all convolution kernels have the same viewpoint dimension. Thus, we evaluate 4 different 3D kernel sizes which the viewpoint dimension fixed to 1, 3, 5, and 7 from the first to the eighth convolution layer. Then we set the dimension varying across different convolution layers. For this setting, we test two types of networks with the viewpoint dimension of kernel size in decreasing order and increasing order, respectively, and choose the one of the best performance to compare with other settings. In particularly, we choose 7-5-5-5-3-3-1-1 to represent the decreasing order and 1-1-3-3-5-5-7-7 for increasing order.
The networks are trained on the training sets of ModelNet10 and tested using the testing sets. FIGURE IV-B shows the experimental results. The size of 3D convolution kernel which the viewpoint dimension is fixed to 3 gives the best performance. Therefore, we use kernels in the following experiments. Moreover, an interesting observation is that when viewpoint dimension is equal to 1, the performance is the worst compared with other settings. This is expected since it is essentially equivalent to a 2D convolution kernel and hence, cannot capture multi-view features. This suggests that 3D CNNs can learn spatial correlated features between multi-view images effectively and improve the classification results.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.999]kernel.pdf Exploring viewpoint dimension of 3D convolution kernel on ModelNet10.
Iv-C Over-sampling
Class imbalance can significantly affect the performance and generalization ability of the models [35, 36]. As shown in FIGURE III, the number of instances in each category varies greatly. To eliminate the influence of data bias, we select the object instance which belongs to the fewer categories randomly and designate it as a new instance in the same category. Therefore, the number of instances in each category is balanced. To create a more balanced training set, we increase the number of instances in each category to 500. Thus, our scheme can significantly reduce the imbalanced data problem. After applying our strategies, the classification accuracies on ModelNet10 and ModelNet40 are shown in TABLE III, the model performance is slightly improved.
Method | ModelNet10 | ModelNet40 |
---|---|---|
No sampling | 90.5% | 89.8% |
Oversampling | 91.1% | 90.1% |
[!t](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]d10d40.pdf The number of instances for each category in ModelNet10 and ModelNet40.
Iv-D Pre-training
In 2D object classification applications, the model performance can be significantly improved by pre-training on ImageNet [5]. Similarly, when pre-trained on ImageNet, the existent 3D multi-view based methods [14, 37], which is based on 2D CNNs such as VGG-M [38], can also improve the classification accuracy. Unfortunately, our MV-C3D model cannot be pre-trained on ImageNet because of lacking multi-view images. Therefore, for pre-training, we employ the UCF101 [39], which is an action classification dataset collected from Youtube, to pre-train our 3D CNNs. It is demonstrated that 3D CNNs can learn relevance features between different video frames effectively [40]. As shown in TABLE IV, our MV-C3D can also derive a better classification accuracy when pre-trained and fine-tuned on UCF101.
Pre-training | Oversampling | ModelNet10 | ModelNet40 | |
---|---|---|---|---|
1 | 90.5% | 89.8% | ||
2 | 91.1% | 90.1% | ||
3 | 91.7% | 91.0% | ||
4 | 92.0% | 91.5% |
Iv-E Effect of the number of view images
In this section, we explore the impact of the number of view images on the classification performance. FIGURE IV-E shows the performance of our MV-C3D with the number of view images varying from 1 to 36 on ModelNet40. The performance is poor when the number of view images is below 4 because of the lack of sufficient spatial correlated features. With the increasing number, the classification accuracy improves rapidly. Our MV-C3D technique achieves 89.3% classification with only 10 views and 93.2% with 16 views. The performance converges to 93.9% ( 0.1%) with more than 20 views.
[!t](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]numberofview-change.pdf Classification result versus number of view images on ModelNet40.
Iv-F Experiment on ModelNet
We compare our MV-C3D technique with other methods based on varying input modalities (e.g., voxel, point cloud). As shown in TABLE V, our proposed MV-C3D model achieves an accuracy of 93.9% and 90.5% mAP on ModelNet40, which outperforms all the other methods based on voxel and point cloud. Meanwhile, for multi-view input, it has an improvement of 3.8% and 11% compare with MVCNN technique on the classification and retrieval tasks, respectively. For multi-modal methods, our method outperforms all the other methods except for the Spherical Projections [17] technique, which is 0.3% better. However, their approach requires depth information, which is impractical and requires more resources to process. Moreover, MV-C3D achieves the best result of 94.5% on classification task in ModelNet10.
Method | Input Modality | ModelNet40 | ModelNet10 | ||
---|---|---|---|---|---|
Classification | Retrieval | Classification | Retrieval | ||
(Accuracy) | (mAP) | (Accuracy) | (mAP) | ||
Beam Search [41] | Voxel | 81.26% | - | 88% | - |
VoxNet [10] | 83% | - | 92% | - | |
3D-GAN [23] | 83.3% | - | 91% | - | |
LightNet [42] | 86.9% | - | 93.39% | - | |
FPNN [43] | 88.4% | - | - | - | |
MVCNN-MultiRes [13] | 91.4% | - | - | - | |
3D ShapeNets [22] | Point Cloud | 77.32% | 49.23% | 83.54% | 68.26% |
PointNet [44] | - | - | 77.6% | - | |
PointNet++ [45] | 91.9% | - | - | - | |
Set-convolution [46] | 90% | - | - | - | |
Angular Triplet-Center [47] | - | 86.11% | - | 92.07% | |
Geo-CNN [48] | 93.9% | - | - | - | |
DeepPano [49] | Multi-view | 77.63% | 76.81% | 85.45% | 84.18% |
MVCNN [14] | 90.1% | 79.5% | - | - | |
FusionNet [37] | Multi-view + Voxel | 90.8% | - | 93.11% | - |
PRVNet [50] | Multi-view + Point Cloud | 93.6% | 90.5% | - | - |
Multiple Depth [51] | Multi-view + Depth | 87.8% | - | 91.5% | - |
DSCNN [16] | Multi-view + Depth | 93.8% | - | - | - |
Spherical Projections [17] | Multi-view + Depth | 94.24% | - | - | - |
MV-C3D (proposed) | Multi-view | 93.9% | 90.5% | 94.5% | 91.4% |
Iv-G Replicability
To demonstrate the replicability of our method, we have repeated experiment on ModelNet10 for 20 trials. FIGURE IV-G
shows the accuracy curve with error band. The accuracies have high variance at the beginning of training. However, with the increasing of train epoch, the results converge and are stable at 94.5% (
0.15%) after 30 training epochs. This result suggests that our method has good replicability.[ht](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]replicable.pdf Results on ModelNet10 with error band.
Iv-H Visualization of learned features
For a better understanding of how MV-C3D works, we provide visualization of learned features by using the method from [52]. FIGURE IV-H shows the deconvolution of one learned feature map of the layer. We can see that MV-C3D focuses on empennage in all view images in the first example. The second focuses on empennage and airfoil simultaneously. The third focuses on chair legs and the forth focuses on chair handle. The fact that during the learning process, images from different angles focusing on the same feature suggests that MV-C3D can capture correlated features between multi-view images effectively.
[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]visualization.pdf Visualization of learned features between multi-view images. In the first example, the feature focuses on empennage in all view images. The second focuses on empennage and airfoil simultaneously. The third focuses on chair leg and the forth focuses on chair handle.
Iv-I Comparison with Multi-view based methods
In this section, we compare our technique with the multi-view based method MVCNN [14] and multi-modal based method DSCNN [16] on ModelNet40 classification task. As shown in TABLE VI, MV-C3D achieves 91.4% accuracy by taking all view images with interval 30° (°), which outperform MVCNN 0.9%, but is slightly worse 0.8% compared to DSCNN, using the same input setting. We believe that the correlated feature information between multi-view is weak when they are highly dissimilar because of the large interval. However, when we take 12 contiguous view images with interval 10° (°) as input, the performance is increased to 91.9% while other methods decreased. With the increasing number of views, the performance of our MV-C3D improves and achieves 93.9% when the number of views is 20, which outperforms other methods with the same input setting. The reason may be that the prior multi-view based methods mainly rely on global contour information while our MV-C3D only needs the correlated information between multi-view images. This is an advantage when MV-C3D is applied in real-word scenarios where objects are always captured with a limit angle instead of omnibearing.
Method | View Interval | #Views | Accuracy |
---|---|---|---|
MVCNN | 12 | 89.5% | |
8 | 80.1% | ||
12 | 82.7% | ||
16 | 84.1% | ||
20 | 85.3% | ||
DSCNN | 12 | 92.2% | |
8 | 87.6% | ||
12 | 90.3% | ||
16 | 91.4% | ||
20 | 92.1% | ||
MV-C3D | 12 | 91.4% | |
8 | 90.5% | ||
12 | 91.9% | ||
16 | 93.2% | ||
20 | 93.9% |
Iv-J Experiment on 3D rotated image dataset
In this section, we test our technique on a 3D rotated real image dataset ”Multi-view Images of Rotated Objects (MIRO)” [15]
. In the previous experiments, we assume that the viewpoints are uniformly distributed along a circle. However, in real-world applications, objects are often observed with arbitrary directions, which is more close to MIRO. MIRO consists of 120 object instances in 12 categories, and each instance has 160 images (10 different elevation angles and 16 different azimuth angles) captured from different viewpoints approximately equally distributed in the spherical space. FIGURE
IV-J shows an example of object and the corresponding multi-view images. We randomly select 12 contiguous views (both in elevation direction and azimuth direction) as input to test our model, which is trained on the ModelNet40 dataset. The accuracy on each category is reported in TABLE VII. In almost all cases, MV-C3D outperforms other multi-view based methods. This suggests that MV-C3D is accurate, robust, and more practical.[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]example.pdf A clock exemplar and the multi-view images in MIRO dataset.
Method | bus | car | cleanser | clock | cup | head- | mouse | scissors | shoe | stapler | sunglasses | tape | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
phones | cutter | ||||||||||||
MVCNN | 80% | 70% | 90% | 90% | 100% | 70% | 80% | 60% | 90% | 100% | 80% | 90% | 83.3% |
DSCNN | 90% | 80% | 100% | 90% | 100% | 80% | 80% | 70% | 90% | 90% | 90% | 90% | 87.5% |
MV-C3D | 90% | 90% | 100% | 100% | 100% | 90% | 90% | 80% | 100% | 90% | 90% | 100% | 93.3% |
Iv-K Ablation study
Ablation study on convolution pattern. The prior work employs the 2D convolution to extract feature independently. To evaluate the 3D convolution operation in MV-C3D, we build the same neural network as MV-C3D but use the 2D convolution operation instead. As shown in FIGURE IV-K, each view image is calculated using individual 2D convolution kernels. TABLE VIII shows the performance of different convolution filters on ModelNet40. 3D convolution outperform 2D convolution significantly, which demonstrates the importance of 3D convolution.
[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]2d-independently.pdf Extract features on multi-view images by 2D convolution operation independently.
Method | ModelNet10 | ModelNet40 |
---|---|---|
2D conv. | 81.8% | 75.3% |
3D conv. | 94.5% | 93.9% |
Ablation study on model complexity. Most of the parameters in the MV-C3D model come from the last three fully connected layers. The parameter number can be estimated as , where is determined by the channel numbers and the size of feature maps in the last 3D max-pooling layer, is the number of categories, , are the dimensions of the second and third fully connected layer, respectively. We increase and from 1024 to 4096 to estimate the impact of model complexity. The experimental results are shown in TABLE IX. For ”MV-C3D-S” model, , are set to 1024. For ”MV-C3D-M” model, both are 2048. We obverse that the increasing model complexity does not improve performance much. Increasing the number of input view images has a much more significant impact.
Method | Model Size | #Views | Accuracy |
---|---|---|---|
MV-C3D-S | 142M | 8 | 82.0% |
12 | 88.7% | ||
16 | 92.7% | ||
MV-C3D-M | 186M | 8 | 82.9% |
12 | 89.3% | ||
16 | 93.0% | ||
MV-C3D | 299M | 8 | 83.7% |
12 | 91.9% | ||
16 | 93.2% |
V Conclusion
In this paper, we propose MV-C3D, which is a multi-view based 3D convolutional neural network and can perform 3D objects classification using multi-view images which are captured from only partial angles with less range. MV-C3D can effectively learn 3D object representations by using 3D convolution layers and max-pooling layers to aggregate the spatial correlated features of different viewpoint images. Experiments on the ModelNet10 and ModelNet40 benchmarks show that MV-C3D outperform the state-of-the-art multi-view based methods by using only RGB images partial viewpoints which can easily be captured by surveillance cameras or moving cameras. Furthermore, the outstanding results on a real image dataset MIRO suggest that our technique can be applied in real-world multi-view classification task. In the future work, we plan to explore different architectures to further reduce the parameters of the 3D convolution based model while maintaining the accuracy of classification.
References
- [1] Q. Xuan, Z. Chen, Y. Liu, H. Huang, G. Bao, and D. Zhang, “Multiview generative adversarial network and its application in pearl classification,” IEEE Trans. Ind. Electron., vol. 66, no. 10, pp. 8244–8252, 2019.
- [2] Y. Liu, Y. Fan, and J. Chen, “Flame images for oxygen content prediction of combustion systems using dbn,” Energy Fuels, vol. 31, no. 8, pp. 8776–8783, 2017.
- [3] Y. Liu, C. Yang, Z. Gao, and Y. Yao, “Ensemble deep kernel learning with application to quality prediction in industrial polymerization processes,” Chemometr. Intell. Lab. Syst., vol. 174, pp. 15–21, 2018.
- [4] Q. Xuan, B. Fang, Y. Liu, J. Wang, J. Zhang, Y. Zheng, and G. Bao, “Automatic pearl classification machine based on a multistream convolutional neural network,” IEEE Trans. Ind. Electron., vol. 65, no. 8, pp. 6538–6547, 2018.
- [5] Q. Xuan, H. Xiao, C. Fu, and Y. Liu, “Evolving convolutional neural network and its application in fine-grained visual categorization,” IEEE Access, vol. 6, pp. 31 110–31 116, 2018.
- [6] M. Simon, S. Milz, K. Amende, and H.-M. Gross, “Complex-yolo: Real-time 3d object detection on point clouds,” Mar. 2018, [Online]. Available: https://arxiv.org/abs/1803.06199.
- [7] P. Tsarouchi, S.-A. Matthaiakis, G. Michalos, S. Makris, and G. Chryssolouris, “A method for detection of randomly placed objects for robotic handling,” CIRP J. Manuf. Sci. Technol., vol. 14, pp. 20–27, 2016.
- [8] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, 2013.
-
[9]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)
, Jun. 2018, pp. 7132–7141. - [10] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. Rep. U. S. (IROS), Sep. 2015, pp. 922–928.
- [11] N. Sedaghat, M. Zolfaghari, E. Amiri, and T. Brox, “Orientation-boosted voxel nets for 3d object recognition,” Apr. 2016, [Online]. Available: https://arxiv.org/abs/1604.03351.
- [12] S. Zhi, Y. Liu, X. Li, and Y. Guo, “Toward real-time 3d object recognition: A lightweight volumetric cnn framework using multitask learning,” Computers Graphics, vol. 71, pp. 199–207, 2018.
- [13] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multi-view cnns for object classification on 3d data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5648–5656.
- [14] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 945–953.
- [15] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 5010–5019.
- [16] C. Wang, M. Pelillo, and K. Siddiqi, “Dominant set clustering and pooling for multi-view 3d object recognition,” in Proc. Brit. Mach. Vis. Conf. (BMVC), Sep. 2017.
- [17] Z. Cao, Q. Huang, and R. Karthik, “3d object classification via spherical projections,” in Proc. Int. Conf. 3D Vis. (3DV), Oct. 2017, pp. 566–574.
- [18] Y. Guo, F. Sohel, M. Bennamoun, M. Lu, and J. Wan, “Rotational projection statistics for 3d local surface description and object recognition,” Int. J. Comput. Vis., vol. 105, no. 1, pp. 63–86, 2013.
- [19] K. Tang, P. Song, and X. Chen, “3d object recognition in cluttered scenes with robust shape description and correspondence selection,” IEEE Access, vol. 5, pp. 1833–1845, 2017.
- [20] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2010, pp. 356–369.
- [21] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool, “Hough transform and 3d surf for robust three dimensional classification,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2010, pp. 589–602.
- [22] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1912–1920.
- [23] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2016, pp. 82–90.
- [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.
- [25] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki, “Gift: A real-time and scalable 3d shape search engine,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5023–5032.
- [26] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view convolutional neural networks for 3d shape recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 264–272.
- [27] J. Li, B. M. Chen, and G. H. Lee, “So-net: Self-organizing network for point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 9397–9406.
- [28] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 77–85.
- [29] B. T. Phong, “Illumination for computer generated pictures,” Commun. ACM, vol. 18, no. 6, pp. 311–317, 1975.
- [30] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Sep. 2014, [Online]. Available: https://arxiv.org/abs/1409.1556.
- [31] Z. Han, M. Shang, Z. Liu, C.-M. Vong, Y.-S. Liu, M. Zwicker, J. Han, and C. P. Chen, “Seqviews2seqlabels: Learning 3d global features via aggregating sequential views by rnn with attention,” IEEE Trans. on Image Process., vol. 28, no. 2, pp. 658–672, 2019.
- [32] M. Yavartanoo, E. Y. Kim, and K. M. Lee, “Spnet: Deep 3d object classification and retrieval using stereographic projection,” Nov. 2018, [Online]. Available: https://arxiv.org/abs/1811.01571.
-
[33]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al.
, “Tensorflow: a system for large-scale machine learning.” in
Proc. USENIX Symp. Oper. Syst. Des. Implement (OSDI), Nov. 2016, pp. 265–283. - [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Dec. 2014, [Online]. Available: https://arxiv.org/abs/1412.6980.
- [35] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Oct. 2017, [Online]. Available: https://arxiv.org/abs/1710.05381.
- [36] B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Artif. Intell., vol. 5, no. 4, pp. 221–232, 2016.
- [37] V. Hegde and R. Zadeh, “Fusionnet: 3d object classification using multiple data representations,” Jul. 2016, [Online]. Available: https://arxiv.org/abs/1607.05695.
- [38] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” May 2014, [Online]. Available: https://arxiv.org/abs/1405.3531.
- [39] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” Dec. 2012, [Online]. Available: https://arxiv.org/abs/1212.0402.
- [40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
- [41] X. Xu and S. Todorovic, “Beam search for learning a deep convolutional neural network of 3d shapes,” Dec. 2016, [Online]. Available: https://arxiv.org/abs/1612.04774.
- [42] S. Zhi, Y. Liu, X. Li, and Y. Guo, “Lightnet: A lightweight 3d convolutional neural network for real-time 3d object recognition,” in Proc. Eurographics Workshop on 3D Object Retr. (3DOR), Apr. 2017.
- [43] Y. Li, S. Pirk, H. Su, C. R. Qi, and L. J. Guibas, “Fpnn: Field probing neural networks for 3d data,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2016, pp. 307–315.
- [44] A. Garcia-Garcia, F. Gomez-Donoso, J. Garcia-Rodriguez, S. Orts-Escolano, M. Cazorla, and J. Azorin-Lopez, “Pointnet: A 3d convolutional neural network for real-time object class recognition,” in Proc. Int. Jt. Conf. Neural Netw. (IJCNN), Jul. 2016, pp. 1578–1584.
- [45] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2017, pp. 5099–5108.
- [46] S. Ravanbakhsh, J. Schneider, and B. Poczos, “Deep learning with sets and point clouds,” Nov. 2016, [Online]. Available: https://arxiv.org/abs/1611.04500.
- [47] Z. Li, C. Xu, and B. Leng, “Angular triplet-center loss for multi-view 3d shape retrieval,” Nov. 2018, [Online]. Available: https://arxiv.org/abs/1811.08622.
- [48] S. Lan, R. Yu, G. Yu, and L. S. Davis, “Modeling local geometric structure of 3d point clouds using geo-cnn,” Nov. 2018, [Online]. Available: https://arxiv.org/abs/1811.07782.
- [49] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deeppano: Deep panoramic representation for 3-d shape recognition,” IEEE Signal Process. Lett., vol. 22, no. 12, pp. 2339–2343, 2015.
- [50] H. You, Y. Feng, X. Zhao, C. Zou, R. Ji, and Y. Gao, “Pvrnet: Point-view relation neural network for 3d shape recognition,” Dec. 2018, [Online]. Available: https://arxiv.org/abs/1812.00333.
- [51] P. Zanuttigh and L. Minto, “Deep learning for 3d shape classification from multiple depth maps,” in Proc. Int. Conf. Image Proc. (ICIP), Sep. 2017, pp. 3615–3619.
- [52] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2014, pp. 818–833.