I Introduction
Recently, deep learning technologies have been widely applied to several industrial manufacturing processes [1, 2, 3, 4, 5]. The development of convolutional neural networks (CNNs) has enabled the dramatic progress of 3D object recognition technologies. 3D object recognition has a wide range of applications, e.g., automatic driving [6], robots [7], and civil monitoring etc [8]. In this work, we have proposed a novel 3D CNN architecture which only requires multiple images from limited viewpoint and can still achieve satisfied classification results.
Currently, most CNN architectures are designed specifically for 2D images [9]. Therefore, to perform classifications for 3D models, we need to transform the current models based on voxels or 2D images. For voxel based approaches, 3D models are organized as volumetric occupancy grids [10, 11, 12]. The main advantage of voxel representation is that it can maintain full geometrical information of the original 3D objects. However, those approaches also suffer from the problems of resolution loss and exponentially growing computational cost [13].
Previously, researchers developed multiview based methods, which can derive comparable results with much lower computational cost [14, 15, 16]. However, the multiview based methods require multiple images derived from various predefined viewpoints in the whole circumference, which is quite impractical for realworld applications. Thus, it is much more desirable to perform successful 3D recognition from multiview images in limited viewpoints. The existent multiview image approaches [14, 16]
treat each multiview image as an independent variable and feed the images into 2D CNNs separately, and the final classifications are derived by aggregating the feature vectors with viewpooling or clustering. Those approaches can easily lead to inferior results by neglecting the spatial correlations between the multiview images.
Thus, to address these problems, we propose a multiview based 3D CNN, or MVC3D. As shown in FIGURE MVC3D: A Spatial Correlated MultiView 3D Convolutional Neural Networks, our technique takes the multiview images of objects as input and predicts the corresponding category labels. Unlike the existent multiview based methods, our model uses multiview images from only partial angles with less range, which makes it more adaptable in realworld applications. Moreover, our technique considers different viewpoint images as a joint variable instead of independent variables. With the help of 3D convolution and 3D maxpooling layers, our MVC3D architecture can take advantage of spatial correlations between multiview images to learn distinguishing features from different objects.
The main contributions of this work are summarized as follows.

We propose the novel multiview based 3D convolution neural network for the first time, namely MVC3D, which only requires partial multiview images from limited viewpoint and outperforms the current multiview based stateoftheart classification performance on ModelNet benchmark.

We combine the images of different view as a joint variable to learn spatial correlated features by using 3D convolution and 3D maxpooling layers. The visualization of feature maps shows that our network can focus on the same part of object in different view images.

We demonstrate experimentally that MVC3D can get higher classification accuracy with contiguous and increasing view images in partial angles with less range.

We test MVC3D with a 3D rotated real image dataset MIRO with multiple images which was captured from arbitrary but contiguous viewpoint to demonstrate the performance of realworld scenarios.
Ii Related Work
Previously, researchers mainly rely upon local or global descriptors which can map 3D shape information into feature vectors [18, 19, 20, 21]. With the breakthrough of CNNs, neural network based approaches are becoming more and more popular. The current existent works can be generalized into two categories: voxel based methods and multiview images based methods.
Iia Voxel based method
Wu et al. [22]
constructed a fivelayer 3D convolutional deep belief network (CDBN), namely 3D ShapeNet, to learn the probability distribution of 3D voxel grids. Sedaghat et al.
[11] considered 3D object classification as a multitask problem by introducing object orientation prediction. This model achieved excellent performance, which demonstrates that orientation is also an important aspect for 3D object classification [10, 23].IiB Multiview images based method
2D images based methods are also important for 3D object classification problem. Su et al. [14]
proposed a multiview CNN (MVCNN) based technique to aggregate multiple images into concise descriptors in a view pooling layer, which lies in the middle of a 2D CNN framework pretrained on ImageNet
[24]. Multiview images are also used in 3D object retrieval applications [25]. Qi et al. [13] conducted a comprehensive study on the voxel based and multiview based CNNs for 3D object classification. According to these works, there are two important factors affecting the model performance: architecture and volume resolution. Therefore, two distinct volumetric networks and multiresolution filtering technique are proposed. In particular, Feng et al. [26] proposed a groupview convolutional framework which is composed of a hierarchical viewgroupshape architecture for correlation modeling towards discriminative 3D shape description. Currently, the stateoftheart method is [16], called DSCNN, which can learn feature vectors from multiple views by using a recurrent cluster strategy. In addition to the above methods, a novel method which can directly work on point cloud data attracts increasing attention [27, 28], but the performance is still worse than multiview images based approaches.[b](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]2d3d.pdf Comparison of 2D and 3D convolutions: (a) 2D convolution and (b) 3D convolution. In (b) the kernel size is , which means that it computes the related features between 3 contiguous views.
Specifically, MVCNN [14]
treats each view image as an independent variable and feeds it into 2D CNNs to compute feature maps. Then it directly performs a full stride channelwise max pooling on these feature maps to generate a unified feature vector. Although MVCNN achieves great success in peral classificaiton, this operation may destroy the spatial correlation information and thus could be further improved. Moreover, DSCNN
[16] uses a clustering strategy which may also cause the loss of viewpoint dimension information. Different from these existent methods, our MVC3D technique treats multiview image as a joint variable, and uses 3D convolution and 3D maxpooling to learn both the spatial features and the intrinsic correlations among multiview images simultaneously.Iii MVC3D Method
Iiia Multiview based 3D convolution
In 2D CNN applications, we only need to compute features from the 2D spatial dimensions and thus, a singleview image of object is sufficient. However, for the 3D object recognition problem, it is required to encode 3D object information from 3D spatial dimensions where different viewpoint images are considered as the third dimension. Compared to 2D CNNs, 3D CNNs can be more efficient and accurate for multiview feature learning. In 3D CNNs, 3D convolution is performed by applying a 3D kernel in the view images. FIGURE IIB shows the difference between 2D and 3D convolutions. 2D convolution kernel is applied on an image in 2D spatial directions. Thus, it cannot include 3D view information. On the other hand, 3D convolution kernel can preserve spatial correlation information between different view images. Moreover, unlike voxelbased 3D CNNs which focus on learning geometrical features, multiview based 3D CNNs can capture the correlated features between multiview images. Formally, the value at position on the th feature map in the th layer is given by:
(1) 
(2) 
where is the bias, is the result of convolution with the th feature map, is the size of 3D convolution kernel, is the offset, is the viewpoint dimension, is the kernel connected to the th feature map in the th layer, and is the th feature map in the th layer.
IiiB Partial multiview images setup
Typically, 3D models in online repository are stored as polygon meshes, which are collections of vertices, edges, and faces that define the shape of a polyhedral object. We employ the Phong reflection model [29] to render 3D models at different predefined viewpoints. We assume that the 3D objects are upright oriented along with axis [26, 16]. As shown in FIGURE IIIB, we fix the axis as the rotation axis and then place viewpoints separated by angle around the axis. The viewpoints are elevated by from the ground plane. As a result, we generate 36 view images. Unlike the existing omnibearing viewpoints based methods, only a portion of contiguous images from limited viewpoints are required.
Moreover, 2D images with larger resolution can reserve more information, which can lead to a better performance at the cost of computational time. To balance the computational cost and performance, we set the size of each image to .
[h](topskip=0pt, botskip=0pt, midskip=0pt)[width=.999]partial.pdf Different input representation setup.
IiiC Network Architecture
Based on the structure of VGGNet [30], we propose MVC3D, which is essentially a 3D CNN capable of processing contiguous multiview images (FIGURE IIIC). The input is a stack cube of multiview RGB images with the size of , where is the number of views, is the height and width of a single image. Unlike the existent methods [14, 16], we do not compute 2D spatial features on different view images independently. Instead, we take these images as an entire instance and learn spatially correlated features between multiview images.
Our network architecture contains eight 3D convolution layers, five 3D maxpooling layers, three fullyconnected layers, and a softmax function to estimate the output distribution. For the fullyconnected layers, the dimensions of the first two layers are equal to 4096, while that of the third layer is determined by the number of classes. The activation function of the 3D convolution and fullyconnected layers is rectified linear units (ReLUs). We also implement a dropout layer following the first two fullyconnected layers to reduce overfitting.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]arch.pdf The architecture of our MVC3D. is the number of view images. The white layer represent the 3D convolution operation, green layer represent 3D maxpooling, and yellow layer represent fully connected layers. is the floor function, and the (number) represents the channels of feature cube.
Name  Type  Filter size/stride  Output size 

Input  
Conv1  Convolution  
Pool1  Max pooling  
Conv2  Convolution  
Pool2  Max pooling  
Conv3_a  Convolution  
Conv3_b  Convolution  
Pool3  Max pooling  
Conv4_a  Convolution  
Conv4_b  Convolution  
Pool4  Max pooling  
Conv5_a  Convolution  
Conv5_b  Convolution  
Pool5  Max pooling  
Fc1  Fully connected  
Fc2  Fully connected  
Fc3  Fully connected  
Softmax 
The size of the convolution kernels is fixed to be (), which is the same as the 2D CNNs [30]. It is demonstrated that small spatial receptive of can increase the performance of DNN models in 2D recognition. Therefore, we set the kernel size to be to compute the 2D spatial features in each view image and set the third viewpoint dimension to be
to aggregate spatial correlated features between view images. The number of filters for each convolution layers is 64, 128, 256, 256, 512, 512, 512, and 512, respectively. We add padding to both spatial and views dimension in all convolution layers, so that the size of feature maps remain constant after these layers.
For the pooling layers, to preserve the 2D spatial features in the singleview images, we set the kernel size to be with stride of in the first pooling layer. In other words, we apply 2D spatial maxpooling on each view image. Apart from the first pooling layer, the remaining pooling layers implement 3D maxpooling with kernel size of and stride of . Therefore, the size of the output feature maps is scaleddown by a factor of 32 () compared with the origin input. Meanwhile, the viewpoint dimension is also scaleddown by a factor of 16 (), as shown in TABLE I.
Iv Experiment
Iva Experimental setup
Dataset. We evaluate our MVC3D model on the 3D ModelNet Benchmark [22]. It is a comprehensive collection of 3D CAD models, which contains 127,915 models divided into 662 different categories. As shown in TABLE II, two subsets of ModelNet are widely used, which are ModelNet10 with 4,899 object instances in 10 categories and ModelNet40 with 12,311 object instances in 40 categories. Both of them are fully labeled and used in many stateoftheart researches [16, 14, 10, 31, 32]. The datasets also provide both the training and testing sets. For example, ModelNet10 has 3,991 training and 908 testing samples and ModelNet40 has 9,843 training and 2,468 testing samples. We use the default settings in our experiments.
Training detail.
We perform experiments on a machine with NVIDIA TITAN X Pascal GPU, Intel Core i76700K CPU, and 32GB RAM. Our proposed model is coded in the Tensorflow
[33] platform, which is a popular deep learning library from Google.The neural network is trained using Adam [34]
optimization. The initial learning rate is set to be 0.0001 and divided by 10 every 20 epochs during the training. The loss function
is crossentropy with weight regularization as shown in the following equation:(3) 
where is the minibatch size, is the number of category (e.g., for ModelNet10, and for ModelNet40), and represent the true label and the prediction score, respectively, is the indicator function, is the weighting parameter which is set to 0.0005 empirically,
is the filter parameters initialized with zeromean Gaussian distribution with standard deviation of 0.05, and
is the total number of hyperparameters.In training phase, we divide the default training into training set and validation set in a ratio of 4 to 1. We calculate the validation loss every epoch and stop the training when validation loss converges in 5 epochs (, ), with defined by
(4) 
Name  Train split  Test split  Total 

ModelNet10  3991  908  4899 
ModelNet40  9843  2468  12311 
IvB Exploring viewpoint dimension of kernel
A small receptive field of convolution kernel is appropriate for 2D spatial feature learning according to the findings in VGGNet [30]. Thus we fix the spatial dimension of 3D convolution kernel to when only vary the viewpoint dimension to exploit the optimal 3D convolution kernel size. Moreover, we set to be 12, which is consistent with the existent methods.
During the experiment, we first assume that all convolution kernels have the same viewpoint dimension. Thus, we evaluate 4 different 3D kernel sizes which the viewpoint dimension fixed to 1, 3, 5, and 7 from the first to the eighth convolution layer. Then we set the dimension varying across different convolution layers. For this setting, we test two types of networks with the viewpoint dimension of kernel size in decreasing order and increasing order, respectively, and choose the one of the best performance to compare with other settings. In particularly, we choose 75553311 to represent the decreasing order and 11335577 for increasing order.
The networks are trained on the training sets of ModelNet10 and tested using the testing sets. FIGURE IVB shows the experimental results. The size of 3D convolution kernel which the viewpoint dimension is fixed to 3 gives the best performance. Therefore, we use kernels in the following experiments. Moreover, an interesting observation is that when viewpoint dimension is equal to 1, the performance is the worst compared with other settings. This is expected since it is essentially equivalent to a 2D convolution kernel and hence, cannot capture multiview features. This suggests that 3D CNNs can learn spatial correlated features between multiview images effectively and improve the classification results.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.999]kernel.pdf Exploring viewpoint dimension of 3D convolution kernel on ModelNet10.
IvC Oversampling
Class imbalance can significantly affect the performance and generalization ability of the models [35, 36]. As shown in FIGURE III, the number of instances in each category varies greatly. To eliminate the influence of data bias, we select the object instance which belongs to the fewer categories randomly and designate it as a new instance in the same category. Therefore, the number of instances in each category is balanced. To create a more balanced training set, we increase the number of instances in each category to 500. Thus, our scheme can significantly reduce the imbalanced data problem. After applying our strategies, the classification accuracies on ModelNet10 and ModelNet40 are shown in TABLE III, the model performance is slightly improved.
Method  ModelNet10  ModelNet40 

No sampling  90.5%  89.8% 
Oversampling  91.1%  90.1% 
[!t](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]d10d40.pdf The number of instances for each category in ModelNet10 and ModelNet40.
IvD Pretraining
In 2D object classification applications, the model performance can be significantly improved by pretraining on ImageNet [5]. Similarly, when pretrained on ImageNet, the existent 3D multiview based methods [14, 37], which is based on 2D CNNs such as VGGM [38], can also improve the classification accuracy. Unfortunately, our MVC3D model cannot be pretrained on ImageNet because of lacking multiview images. Therefore, for pretraining, we employ the UCF101 [39], which is an action classification dataset collected from Youtube, to pretrain our 3D CNNs. It is demonstrated that 3D CNNs can learn relevance features between different video frames effectively [40]. As shown in TABLE IV, our MVC3D can also derive a better classification accuracy when pretrained and finetuned on UCF101.
Pretraining  Oversampling  ModelNet10  ModelNet40  

1  90.5%  89.8%  
2  91.1%  90.1%  
3  91.7%  91.0%  
4  92.0%  91.5% 
IvE Effect of the number of view images
In this section, we explore the impact of the number of view images on the classification performance. FIGURE IVE shows the performance of our MVC3D with the number of view images varying from 1 to 36 on ModelNet40. The performance is poor when the number of view images is below 4 because of the lack of sufficient spatial correlated features. With the increasing number, the classification accuracy improves rapidly. Our MVC3D technique achieves 89.3% classification with only 10 views and 93.2% with 16 views. The performance converges to 93.9% ( 0.1%) with more than 20 views.
[!t](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]numberofviewchange.pdf Classification result versus number of view images on ModelNet40.
IvF Experiment on ModelNet
We compare our MVC3D technique with other methods based on varying input modalities (e.g., voxel, point cloud). As shown in TABLE V, our proposed MVC3D model achieves an accuracy of 93.9% and 90.5% mAP on ModelNet40, which outperforms all the other methods based on voxel and point cloud. Meanwhile, for multiview input, it has an improvement of 3.8% and 11% compare with MVCNN technique on the classification and retrieval tasks, respectively. For multimodal methods, our method outperforms all the other methods except for the Spherical Projections [17] technique, which is 0.3% better. However, their approach requires depth information, which is impractical and requires more resources to process. Moreover, MVC3D achieves the best result of 94.5% on classification task in ModelNet10.
Method  Input Modality  ModelNet40  ModelNet10  

Classification  Retrieval  Classification  Retrieval  
(Accuracy)  (mAP)  (Accuracy)  (mAP)  
Beam Search [41]  Voxel  81.26%    88%   
VoxNet [10]  83%    92%    
3DGAN [23]  83.3%    91%    
LightNet [42]  86.9%    93.39%    
FPNN [43]  88.4%        
MVCNNMultiRes [13]  91.4%        
3D ShapeNets [22]  Point Cloud  77.32%  49.23%  83.54%  68.26% 
PointNet [44]      77.6%    
PointNet++ [45]  91.9%        
Setconvolution [46]  90%        
Angular TripletCenter [47]    86.11%    92.07%  
GeoCNN [48]  93.9%        
DeepPano [49]  Multiview  77.63%  76.81%  85.45%  84.18% 
MVCNN [14]  90.1%  79.5%      
FusionNet [37]  Multiview + Voxel  90.8%    93.11%   
PRVNet [50]  Multiview + Point Cloud  93.6%  90.5%     
Multiple Depth [51]  Multiview + Depth  87.8%    91.5%   
DSCNN [16]  Multiview + Depth  93.8%       
Spherical Projections [17]  Multiview + Depth  94.24%       
MVC3D (proposed)  Multiview  93.9%  90.5%  94.5%  91.4% 
IvG Replicability
To demonstrate the replicability of our method, we have repeated experiment on ModelNet10 for 20 trials. FIGURE IVG
shows the accuracy curve with error band. The accuracies have high variance at the beginning of training. However, with the increasing of train epoch, the results converge and are stable at 94.5% (
0.15%) after 30 training epochs. This result suggests that our method has good replicability.[ht](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]replicable.pdf Results on ModelNet10 with error band.
IvH Visualization of learned features
For a better understanding of how MVC3D works, we provide visualization of learned features by using the method from [52]. FIGURE IVH shows the deconvolution of one learned feature map of the layer. We can see that MVC3D focuses on empennage in all view images in the first example. The second focuses on empennage and airfoil simultaneously. The third focuses on chair legs and the forth focuses on chair handle. The fact that during the learning process, images from different angles focusing on the same feature suggests that MVC3D can capture correlated features between multiview images effectively.
[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=1]visualization.pdf Visualization of learned features between multiview images. In the first example, the feature focuses on empennage in all view images. The second focuses on empennage and airfoil simultaneously. The third focuses on chair leg and the forth focuses on chair handle.
IvI Comparison with Multiview based methods
In this section, we compare our technique with the multiview based method MVCNN [14] and multimodal based method DSCNN [16] on ModelNet40 classification task. As shown in TABLE VI, MVC3D achieves 91.4% accuracy by taking all view images with interval 30° (°), which outperform MVCNN 0.9%, but is slightly worse 0.8% compared to DSCNN, using the same input setting. We believe that the correlated feature information between multiview is weak when they are highly dissimilar because of the large interval. However, when we take 12 contiguous view images with interval 10° (°) as input, the performance is increased to 91.9% while other methods decreased. With the increasing number of views, the performance of our MVC3D improves and achieves 93.9% when the number of views is 20, which outperforms other methods with the same input setting. The reason may be that the prior multiview based methods mainly rely on global contour information while our MVC3D only needs the correlated information between multiview images. This is an advantage when MVC3D is applied in realword scenarios where objects are always captured with a limit angle instead of omnibearing.
Method  View Interval  #Views  Accuracy 

MVCNN  12  89.5%  
8  80.1%  
12  82.7%  
16  84.1%  
20  85.3%  
DSCNN  12  92.2%  
8  87.6%  
12  90.3%  
16  91.4%  
20  92.1%  
MVC3D  12  91.4%  
8  90.5%  
12  91.9%  
16  93.2%  
20  93.9% 
IvJ Experiment on 3D rotated image dataset
In this section, we test our technique on a 3D rotated real image dataset ”Multiview Images of Rotated Objects (MIRO)” [15]
. In the previous experiments, we assume that the viewpoints are uniformly distributed along a circle. However, in realworld applications, objects are often observed with arbitrary directions, which is more close to MIRO. MIRO consists of 120 object instances in 12 categories, and each instance has 160 images (10 different elevation angles and 16 different azimuth angles) captured from different viewpoints approximately equally distributed in the spherical space. FIGURE
IVJ shows an example of object and the corresponding multiview images. We randomly select 12 contiguous views (both in elevation direction and azimuth direction) as input to test our model, which is trained on the ModelNet40 dataset. The accuracy on each category is reported in TABLE VII. In almost all cases, MVC3D outperforms other multiview based methods. This suggests that MVC3D is accurate, robust, and more practical.[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]example.pdf A clock exemplar and the multiview images in MIRO dataset.
Method  bus  car  cleanser  clock  cup  head  mouse  scissors  shoe  stapler  sunglasses  tape  Mean 

phones  cutter  
MVCNN  80%  70%  90%  90%  100%  70%  80%  60%  90%  100%  80%  90%  83.3% 
DSCNN  90%  80%  100%  90%  100%  80%  80%  70%  90%  90%  90%  90%  87.5% 
MVC3D  90%  90%  100%  100%  100%  90%  90%  80%  100%  90%  90%  100%  93.3% 
IvK Ablation study
Ablation study on convolution pattern. The prior work employs the 2D convolution to extract feature independently. To evaluate the 3D convolution operation in MVC3D, we build the same neural network as MVC3D but use the 2D convolution operation instead. As shown in FIGURE IVK, each view image is calculated using individual 2D convolution kernels. TABLE VIII shows the performance of different convolution filters on ModelNet40. 3D convolution outperform 2D convolution significantly, which demonstrates the importance of 3D convolution.
[!h](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.999]2dindependently.pdf Extract features on multiview images by 2D convolution operation independently.
Method  ModelNet10  ModelNet40 

2D conv.  81.8%  75.3% 
3D conv.  94.5%  93.9% 
Ablation study on model complexity. Most of the parameters in the MVC3D model come from the last three fully connected layers. The parameter number can be estimated as , where is determined by the channel numbers and the size of feature maps in the last 3D maxpooling layer, is the number of categories, , are the dimensions of the second and third fully connected layer, respectively. We increase and from 1024 to 4096 to estimate the impact of model complexity. The experimental results are shown in TABLE IX. For ”MVC3DS” model, , are set to 1024. For ”MVC3DM” model, both are 2048. We obverse that the increasing model complexity does not improve performance much. Increasing the number of input view images has a much more significant impact.
Method  Model Size  #Views  Accuracy 

MVC3DS  142M  8  82.0% 
12  88.7%  
16  92.7%  
MVC3DM  186M  8  82.9% 
12  89.3%  
16  93.0%  
MVC3D  299M  8  83.7% 
12  91.9%  
16  93.2% 
V Conclusion
In this paper, we propose MVC3D, which is a multiview based 3D convolutional neural network and can perform 3D objects classification using multiview images which are captured from only partial angles with less range. MVC3D can effectively learn 3D object representations by using 3D convolution layers and maxpooling layers to aggregate the spatial correlated features of different viewpoint images. Experiments on the ModelNet10 and ModelNet40 benchmarks show that MVC3D outperform the stateoftheart multiview based methods by using only RGB images partial viewpoints which can easily be captured by surveillance cameras or moving cameras. Furthermore, the outstanding results on a real image dataset MIRO suggest that our technique can be applied in realworld multiview classification task. In the future work, we plan to explore different architectures to further reduce the parameters of the 3D convolution based model while maintaining the accuracy of classification.
References
 [1] Q. Xuan, Z. Chen, Y. Liu, H. Huang, G. Bao, and D. Zhang, “Multiview generative adversarial network and its application in pearl classification,” IEEE Trans. Ind. Electron., vol. 66, no. 10, pp. 8244–8252, 2019.
 [2] Y. Liu, Y. Fan, and J. Chen, “Flame images for oxygen content prediction of combustion systems using dbn,” Energy Fuels, vol. 31, no. 8, pp. 8776–8783, 2017.
 [3] Y. Liu, C. Yang, Z. Gao, and Y. Yao, “Ensemble deep kernel learning with application to quality prediction in industrial polymerization processes,” Chemometr. Intell. Lab. Syst., vol. 174, pp. 15–21, 2018.
 [4] Q. Xuan, B. Fang, Y. Liu, J. Wang, J. Zhang, Y. Zheng, and G. Bao, “Automatic pearl classification machine based on a multistream convolutional neural network,” IEEE Trans. Ind. Electron., vol. 65, no. 8, pp. 6538–6547, 2018.
 [5] Q. Xuan, H. Xiao, C. Fu, and Y. Liu, “Evolving convolutional neural network and its application in finegrained visual categorization,” IEEE Access, vol. 6, pp. 31 110–31 116, 2018.
 [6] M. Simon, S. Milz, K. Amende, and H.M. Gross, “Complexyolo: Realtime 3d object detection on point clouds,” Mar. 2018, [Online]. Available: https://arxiv.org/abs/1803.06199.
 [7] P. Tsarouchi, S.A. Matthaiakis, G. Michalos, S. Makris, and G. Chryssolouris, “A method for detection of randomly placed objects for robotic handling,” CIRP J. Manuf. Sci. Technol., vol. 14, pp. 20–27, 2016.
 [8] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, 2013.

[9]
J. Hu, L. Shen, and G. Sun, “Squeezeandexcitation networks,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)
, Jun. 2018, pp. 7132–7141.  [10] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for realtime object recognition,” in Proc. Rep. U. S. (IROS), Sep. 2015, pp. 922–928.
 [11] N. Sedaghat, M. Zolfaghari, E. Amiri, and T. Brox, “Orientationboosted voxel nets for 3d object recognition,” Apr. 2016, [Online]. Available: https://arxiv.org/abs/1604.03351.
 [12] S. Zhi, Y. Liu, X. Li, and Y. Guo, “Toward realtime 3d object recognition: A lightweight volumetric cnn framework using multitask learning,” Computers Graphics, vol. 71, pp. 199–207, 2018.
 [13] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multiview cnns for object classification on 3d data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5648–5656.
 [14] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller, “Multiview convolutional neural networks for 3d shape recognition,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 945–953.
 [15] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 5010–5019.
 [16] C. Wang, M. Pelillo, and K. Siddiqi, “Dominant set clustering and pooling for multiview 3d object recognition,” in Proc. Brit. Mach. Vis. Conf. (BMVC), Sep. 2017.
 [17] Z. Cao, Q. Huang, and R. Karthik, “3d object classification via spherical projections,” in Proc. Int. Conf. 3D Vis. (3DV), Oct. 2017, pp. 566–574.
 [18] Y. Guo, F. Sohel, M. Bennamoun, M. Lu, and J. Wan, “Rotational projection statistics for 3d local surface description and object recognition,” Int. J. Comput. Vis., vol. 105, no. 1, pp. 63–86, 2013.
 [19] K. Tang, P. Song, and X. Chen, “3d object recognition in cluttered scenes with robust shape description and correspondence selection,” IEEE Access, vol. 5, pp. 1833–1845, 2017.
 [20] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2010, pp. 356–369.
 [21] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool, “Hough transform and 3d surf for robust three dimensional classification,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2010, pp. 589–602.
 [22] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1912–1920.
 [23] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2016, pp. 82–90.
 [24] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.
 [25] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki, “Gift: A realtime and scalable 3d shape search engine,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5023–5032.
 [26] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Groupview convolutional neural networks for 3d shape recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 264–272.
 [27] J. Li, B. M. Chen, and G. H. Lee, “Sonet: Selforganizing network for point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 9397–9406.
 [28] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 77–85.
 [29] B. T. Phong, “Illumination for computer generated pictures,” Commun. ACM, vol. 18, no. 6, pp. 311–317, 1975.
 [30] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” Sep. 2014, [Online]. Available: https://arxiv.org/abs/1409.1556.
 [31] Z. Han, M. Shang, Z. Liu, C.M. Vong, Y.S. Liu, M. Zwicker, J. Han, and C. P. Chen, “Seqviews2seqlabels: Learning 3d global features via aggregating sequential views by rnn with attention,” IEEE Trans. on Image Process., vol. 28, no. 2, pp. 658–672, 2019.
 [32] M. Yavartanoo, E. Y. Kim, and K. M. Lee, “Spnet: Deep 3d object classification and retrieval using stereographic projection,” Nov. 2018, [Online]. Available: https://arxiv.org/abs/1811.01571.

[33]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al.
, “Tensorflow: a system for largescale machine learning.” in
Proc. USENIX Symp. Oper. Syst. Des. Implement (OSDI), Nov. 2016, pp. 265–283.  [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Dec. 2014, [Online]. Available: https://arxiv.org/abs/1412.6980.
 [35] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Oct. 2017, [Online]. Available: https://arxiv.org/abs/1710.05381.
 [36] B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Artif. Intell., vol. 5, no. 4, pp. 221–232, 2016.
 [37] V. Hegde and R. Zadeh, “Fusionnet: 3d object classification using multiple data representations,” Jul. 2016, [Online]. Available: https://arxiv.org/abs/1607.05695.
 [38] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” May 2014, [Online]. Available: https://arxiv.org/abs/1405.3531.
 [39] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” Dec. 2012, [Online]. Available: https://arxiv.org/abs/1212.0402.
 [40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
 [41] X. Xu and S. Todorovic, “Beam search for learning a deep convolutional neural network of 3d shapes,” Dec. 2016, [Online]. Available: https://arxiv.org/abs/1612.04774.
 [42] S. Zhi, Y. Liu, X. Li, and Y. Guo, “Lightnet: A lightweight 3d convolutional neural network for realtime 3d object recognition,” in Proc. Eurographics Workshop on 3D Object Retr. (3DOR), Apr. 2017.
 [43] Y. Li, S. Pirk, H. Su, C. R. Qi, and L. J. Guibas, “Fpnn: Field probing neural networks for 3d data,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2016, pp. 307–315.
 [44] A. GarciaGarcia, F. GomezDonoso, J. GarciaRodriguez, S. OrtsEscolano, M. Cazorla, and J. AzorinLopez, “Pointnet: A 3d convolutional neural network for realtime object class recognition,” in Proc. Int. Jt. Conf. Neural Netw. (IJCNN), Jul. 2016, pp. 1578–1584.
 [45] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2017, pp. 5099–5108.
 [46] S. Ravanbakhsh, J. Schneider, and B. Poczos, “Deep learning with sets and point clouds,” Nov. 2016, [Online]. Available: https://arxiv.org/abs/1611.04500.
 [47] Z. Li, C. Xu, and B. Leng, “Angular tripletcenter loss for multiview 3d shape retrieval,” Nov. 2018, [Online]. Available: https://arxiv.org/abs/1811.08622.
 [48] S. Lan, R. Yu, G. Yu, and L. S. Davis, “Modeling local geometric structure of 3d point clouds using geocnn,” Nov. 2018, [Online]. Available: https://arxiv.org/abs/1811.07782.
 [49] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deeppano: Deep panoramic representation for 3d shape recognition,” IEEE Signal Process. Lett., vol. 22, no. 12, pp. 2339–2343, 2015.
 [50] H. You, Y. Feng, X. Zhao, C. Zou, R. Ji, and Y. Gao, “Pvrnet: Pointview relation neural network for 3d shape recognition,” Dec. 2018, [Online]. Available: https://arxiv.org/abs/1812.00333.
 [51] P. Zanuttigh and L. Minto, “Deep learning for 3d shape classification from multiple depth maps,” in Proc. Int. Conf. Image Proc. (ICIP), Sep. 2017, pp. 3615–3619.
 [52] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2014, pp. 818–833.