Deep learning has changed the research landscape in visual object recognition over the last few years. Since their spectacular success in recognizing object categories , convolutional neural networks have become the new off the shelf state of the art in visual classification. Since then, the robot vision community has also attempted to take advantage of the deep learning trend, as the ability of robots to understand what they see reliably is critical for their deployment in the wild. A critical issue when trying to transfer results from computer to robot vision is that robot perception is tightly coupled with robot action. Hence, pure RGB visual recognition is not enough.
. Here a strong emerging trend is that of using Convolutional Neural Networks (CNNs) pre-trained over ImageNet by colorizing the depth channel . The approach has proved successful, especially when coupled with fine tuning  and/or spatial pooling strategies [8, 9, 10] (for a review of recent work we refer to section II). These results suggest that the filters learned by CNNs from ImageNet are able to capture information also from depth images, regardless of their perceptual difference.
Is this the best we can do? What if one would train from scratch a CNN over a very large scale 2.5D object categorization database, wouldn’t the filters learned be more suitable for object recognition from depth images? RGB images are perceptually very rich, with generally a strong presence of textured patterns, especially in ImageNet. Features learned from RGB data are most likely focusing on those aspects, while depth images contain more information about the shape and the silhouette of objects. Unfortunately, as of today a 2.5D object categorization database large enough to train a CNN on it does not exist. A likely reason for this is that gathering such data collection is a daunting challenge: capturing the same variability of ImageNet over the same number of object categories would require the coordination of very many laboratories, over an extended period of time.
In this paper we follow an alternative route. Rather than acquiring a 2.5D object categorization database, we propose to use synthetic data as a proxy for training a deep learning architecture specialized in learning depth specific features. To this end, we construct the VANDAL database, a collection of million depth images from more than objects, belonging to categories. The depth images are generated starting from 3D CAD models, downloaded from the Web, through a protocol developed to extract the maximum information from the models. VANDAL is used as input to train from scratch a deep learning architecture, obtaining a pre-trained model able to act as a depth specific feature extractor. Visualizations of the filters learned by the first layer of the architecture show that the filter we obtain are indeed very different from those learned from ImageNet with the very same convolutional neural network (figure 1). As such, they are able to capture different facets of the perceptual information available from real depth images, more suitable for the recognition task in that domain. We call our pre-trained architecture DepthNet.
Experimental results on two publicly available databases confirm this: when using only depth, our DepthNet features achieve better performance compared to previous methods based on a CNN pre-trained over ImageNet, without using fine tuning or spatial pooling. The combination of the DepthNet features with the descriptors obtained from the CNN pre-trained over ImageNet, on both depth and RGB images, leads to strong results on the Washington database , and to results competitive with fine-tuning and/or sophisticated spatial pooling approaches on the JHUIT database . To the best of our knowledge, this is the first work that uses synthetically generated depth data to train a depth-specific convolutional neural network. Upon acceptance of the paper, all the VANDAL data, the protocol and the software for generating new depth images, as well as the pre-trained DepthNet, will be made publicly available.
The rest of the paper is organized as follows. After a review of the recent literature (section II), we introduce the VANDAL database, describing its generation protocol and showcasing the obtained depth images (section III). Section IV describes the deep architecture used and section V reports our experimental findings. The paper concludes with a summary and a discussion on future research.
Ii Related Works
, combined together through vector quantization in a Bag-of-Words encoding
. This heuristic approach has been surpassed by end-to-end feature learning architectures, able to define suitable features in a data-driven fashion[14, 3, 15]. All these methods have been designed to cope with a limited amount of training data (of the order of
depth images), thus they are able to only partially exploit the generalization abilities of deep learning as feature extractors experienced in the computer vision community[1, 16], where databases of RGB images like ImageNet  or Places  are available.
An alternative route is that of re-using deep learning architectures trained on ImageNet through pre-defined encoding  or colorization. Since the work of  re-defined the state of the art in the field, this last approach has been actively and successfully investigated. Eitel et al  proposed a parallel CNN architecture, one for the depth channel and one for the RGB one, combined together in the final layers through a late fusion scheme. Some approaches coupled non linear learning methods with various forms of spatial encodings [10, 9, 4, 12]. Hasan et al 
pushed further this multi-modal approach, proposing an architecture merging together RGB, depth and 3D point cloud information. Another notable feature is the encoding of an implicit multi scale representation through a rich coarse-to-fine feature extraction approach.
All these works build on top of CNNs pre-trained over ImageNet, for all modal channels. Thus, the very same filters are used to extract features from all of them. As empirically successful as this might be, it is a questionable strategy, as RGB and depth images are perceptually very different, and as such they would benefit from approaches able to learn data-specific features (figure 1). Our method matches this challenge, learning RGB features from RGB data and depth features from synthetically generated data, within a deep learning framework. The use of realistic synthetic data in conjunction with deep learning architectures is a promising emerging trend [19, 20, 21]. We are not aware of previous work attempting to use synthetic data to learn depth representations, with or without deep learning techniques.
Iii The Vandal Database
In this section we present VANDAL and the protocol followed for its creation. With synthetic images, it is the largest existing depth database for object recognition. Section III-A describes the criteria used to select the object categories composing the database and the protocol followed to obtain the 3D CAD models from Web resources. Section III-B illustrates the procedure used to generate depth images from the 3D CAD models.
Iii-a Selecting and Generating the 3D Models
CNNs trained on ImageNet have been shown to generalize well when used on other object centric datasets. Following this reasoning, we defined a list of object categories as a subset of the ILSVRC2014 list , removing by hand all scenery classes, as well as objects without a clear default shape such as clothing items or animals. This resulted in a first list of roughly 480 categories, which was used to query public 3D CAD model repositories like 3D Warehouse, Yeggi, Archive3D, and many others. Five volunteers111Graduate students from the MARR program at DIAG, Sapienza Rome University. manually downloaded the models, removing all irrelevant items like floor or other supporting surfaces, people standing next to the object and so forth, and running a script to harmonize the size of all models (some of them were originally over 1GB per file). They were also required to create significantly morphed variations of the original 3D CAD models, whenever suitable. Figure 2 shows examples of morphed models for the object category coffee cup. Finally, we removed all categories with less than two models, ending up with 319 object categories with an average of 30 models per category, for a total of CAD object models. Figure 3, left, gives a world cloud visualization of the VANDAL dataset, while on the right it shows examples of 3D models for the 6 most populated object categories.
Iii-B From 3D Models to 2.5 Depth Images
All depth renderings were created using Blender222www.blender.org, with a python script fully automating the procedure, and then saved as grayscale .png files, using the convention that black is close and white is far.
The depth data generation protocol was designed to extract as much information as possible from the available 3D CAD models. This concretely means obtaining the greatest possible variability between each rendering. The approach commonly used by real RGB-D datasets consists in fixing the camera at a given angle and then using a turntable to get all possible viewpoints of the object [11, 12]. We tested here a similar approach, but we found out using perceptual hashing that a significant number of object categories had more than 50% nearly identical images.
We defined instead a configuration space consisting of: (a) object distance from the camera, (b) focal length of the camera, (c) camera position on the sphere defined by the distance, and (d) slight () random morphs along the axes of the model. Figure 4 illustrates the described configuration space. This protocol ensured that almost none of the resulting images were identical. We sampled this configuration space with roughly depth images for each model, obtaining a total of million images. Preliminary experiments showed that increasing the sampling rate in the configuration space did lead to growing percentages of nearly identical images.
The rendered depth images consist of objects always centered on a white background. This is done on purpose, as it allows us the maximum freedom to perform various types of data augmentation at training time, as it is standard practice when training convolutional neural networks. This is here even more relevant than usual, as synthetically generated data are intrinsically perceptually less informative compared to real data. The data augmentation methods we used are: image cropping, occlusion (1/4 of the image is randomly occluded to simulate gaps in the sensor scan), contrast/brightness variations, in depth views corresponding to scaling the Z axis and shifting the objects along it, background substitution (substituting the white background with one randomly chosen farther away than the object’s center of mass), random uniform noise (as in film grain), and image shearing (a slanting transform). Figure 5 shows some examples of data augmentation images obtained with this protocol.
Iv Learning Deep Depth Filters
Once the VANDAL database has been generated, it is possible to use it to train any kind of convolutional deep architecture. In order to allow for a fair comparison with previous work, we opted for CaffeNet, a slight variation of AlexNet . Although more modern networks have been proposed in the last years [22, 23, 24], it still represents the most popular choice among practitioners, and the most used in robot vision333Preliminary experiments using the VGG, Inception and Wide Residual networks on the VANDAL database did not give stable results and need further investigation.
. Its well know architecture consists of 5 convolutional layers, interwoven with pooling, normalization and relu layers, plus three fully connected layers. CaffeNet differs from AlexNet in the pooling, which is done there before normalization. It usually performs slightly better and has thus gained wide popularity.
Although the standard choice in robot vision is using the output of the seventh activation layer as feature descriptors, several studies in the vision community show that lower layers, like the sixth and the fifth, tend to have higher generalization properties . We followed this trend, and opted for the fifth layer (by vectorization) as deep depth feature descriptor (an ablation study supporting this choice is reported in section V). We name in the following as DepthNet the CaffeNet architecture trained on VANDAL using as output feature the fifth layer, and Caffe-ImageNet the same architecture trained over ImageNet.
Once DepthNet has been trained, it can be used as any depth feature descriptor, alone or in conjunction with Caffe-ImageNet for classification of RGB images. We explore this last option, proposing a system for RGB-D object categorization that combines the two feature representations through a multi kernel learning classifier. Figure 6 gives an overview of the overall RGB-D classification system. Note that DepthNet can be combined with any other RGB and/or 3D point cloud descriptor, and that the integration of the modal representations can be achieved through any other cue integration approach. This underlines the versatility of DepthNet, as opposed to recent work where the depth component was tightly integrated within the proposed overall framework, and as such unusable outside of it [7, 8, 4, 12].
We assessed the DepthNet, as well as the associated RGB-D framework of figure 6, on two publicly available databases. Section V-A describes our experimental setup and the databases used in our experiments. Section V-B reports a set of experiments assessing the performance of DepthNet on depth images, compared to Caffe-ImageNet, while in section V-C we assess the performance of the whole RGB-D framework with respect to previous approaches.
V-a Experimental setup
We conducted experiments on the Washington RGB-D  and the JHUIT-50  object datasets. The first consists of RGB-D images organized into instances divided in classes. Each object instance was positioned on a turntable and captured from three different viewpoints while rotating. Since two consecutive views are extremely similar, only 1 frame out of 5 is used for evaluation purposes. We performed experiments on the object categorization setting, where we followed the evaluation protocol defined in . The second is a challenging recent dataset that focuses on the problem of fine-grained recognition. It contains object instances, often very similar with each other (e.g. 9 different kinds of screwdrivers). As such, it presents different classification challenges compared to the Washington database.
All experiments, as well as the training of DepthNet, were done using the publicly available Caffe framework 
, together with NVIDIA Deep Learning GPU Training System (DIGITS). As described above, we obtained DepthNet by training a CaffeNet over the VANDAL database. The network was trained using Stochastic Gradient Descent for 50 epochs. Learning rate started at 0.01 and gamma at 0.5 (halving the learning rate at each step). We used a variable step down policy, where the first step took 25 epochs, the next 25/2, the third 25/4 epochs and so on. These parameters were chosen to make sure that the test loss on the VANDAL test data had stabilized at each learning rate. Weight decay and momentum were left at their standard values of 0.0005 and 0.9.
To assess the quality of the DepthNet features we performed three set of experiments:
Object classification using depth only: features were extracted with DepthNet and a linear SVM444Liblinear: http://www.csie.ntu.edu.tw/ cjlin/liblinear/ was trained on it. We also examined how the performance varies when extracting from different layers of the network, comparing against a Caffe-ImageNet used for depth classification, as in .
For all experiments we used the training/testing splits originally proposed for each given dataset. For linear SVM, we set by cross validation. When using MKL, we left the default values of iterations for online and for batch and set and by cross validation.
Previous works using Caffe-ImageNet as feature extractor for depth, apply some kind of input preproccessing [6, 7, 8]. While we do compare against the published baselines, we also found that by simply normalizing each image (min to 0 and max to 255), one achieves very competitive results. Also, since our DepthNet is trained on depth data, it does not need any type of preprocessing over the depth images, obtaining strong results over raw data. Because of this, in all experiments reported in the following we only consider raw depth images and normalized depth images.
V-B Assessing the performance of the DepthNet architecture
We present here an ablation study, aiming at understanding the impact of choosing features from the last fully convolutional layer as opposed to the more popular last fully connected layer, and of using normalized depth images instead of raw data. By comparing our results with those obtained by Caffe-ImageNet, we also aim at illustrating up to which point the features learned from VANDAL are different from those derived from ImageNet.
Figure 7 shows results obtained on the Washington database, with normalized and raw depth data, using as features the activations of the fifth pooling layer (pool5), of the sixth fully connected layer (FC6), and of the seventh fully connected layer (FC7). Note that this last set of activations is the standard choice in the literature. We see that for all settings, pool5 achieves the best performance, followed by FC6 and FC7. This seems to confirm recent findings on RGB data , indicating that pool5 activations offer stronger generalization capabilities when used as features, compared to the more popular FC7. The best performance is obtained by DepthNet, pool5 activations over raw depth data, with a accuracy. DepthNet achieves also better results compared to Caffe-ImageNet over normalized data. To get a better feeling of how performance varies when using DepthNet or Caffe-ImageNet, we plotted the per-class accuracies obtained using pool5 and raw depth data. We sorted them in descending order according to the Caffe-ImageNet scores (figure 8).
While there seems to be a bulk of objects where both features perform well (left), DepthNet seems to have an advantage over challenging objects like apple, onion, ball, lime and orange (right), where the round shape tends to be more informative than the specific object texture. This trend is confirmed also when performing a t-SNE visualization  of all the Washington classes belonging to the high-level categories ’fruit’ and ’device’ (figure 9). We see that in general the DepthNet features tend to cluster tightly the single categories while at the same time separating them very well. For some classes like dry battery and banana, the different between the two representations is very marked. This does not imply that DepthNet features are always better than those computed by Caffe-ImageNet. Figure 8 shows that CaffeNet features obtain a significantly better performance compared to DepthNet over the classes binder and mushroom, to name just a few. The features learned by the two networks seem to focus on different perceptual aspects of the images. This is most likely due to the different set of samples used during training, and the consequent different filters learned by them (figure 1).
From these figures we can draw the following conclusions: (a) DepthNet provides the overall stronger descriptor for depth images, regardless of the activation layer chosen and the presence or not of preprocessing on the input depth data; (b) the features derived by the two networks tend to capture different features of the data, and as such are complementary. As we will show in the next section, this last point leads to very strong results when combining the two with a principled cue integration algorithm.
V-C Assessing the performance of the RGB-D architecture
In this section we present experiments on RGB-D data, from both the Washington and JHUIT databases, assessing the performance of our DepthNet-based framework of figure 6 against previous approaches. Table I shows in the top row our results, followed by results obtained by Caffe-ImageNet using the pool5 activations as features, as well as results from the recent literature based on convolutional neural networks. First, we see that the results in the RGB column stresses once more the strength of the pool5 activations as features: they achieve the best performance without any form of fine tuning, spatial pooling or sophisticated non-linear learning, as done instead in other approaches [7, 8, 4]. Second, DepthNet on raw depth data achieves the best performance among CNN-based approaches with or without fine tuning like [6, 7]
, but it is surpassed by approaches encoding explicitly spatial information through pooling strategies, and/or by using a more advanced classifier than a linear SVM, as we did. We would like to stress that we did not incorporate any of those strategies in our framework on purpose, to better assess the sheer power of training a given convolutional architecture on perceptually different databases. Still, nothing prevents in future work the merging of DepthNet with the best practices in spatial pooling and non-linear classifiers, with a very probable further increase in performance. Lastly, we see that in spite of the lack of such powerful tools, our framework achieves the best performance on RGB-D data. This clearly underlines that the representations learned by DepthNet are both powerful and able to extract different nuances from the data than Caffe-ImageNet. Rather than the actual overall accuracy reported here in the table, we believe this is the breakthrough result we offer to the community in this paper.
Experiments over the JHUIT database confirms the findings obtained over the Washington collection (table II). Here our RGB-D framework obtains the second best result, with the state of the art achieved by the proposers of the database with a non CNN-based approach. Note that this database focuses over the fine-grained classification problem, as opposed to object categorization as explored in the experiments above. While the results reported in Table II on Caffe-ImageNet using FC7 seem to indicate that the choice of using pool5 remains valid, the explicit encoding of local information is very important for this kind of tasks [30, 31]. We are inclined to attribute to this the superior performance of ; future work incorporating spatial pooling in our framework, as well as further experiments on the object identification task in the Washington database and on other RGB-D data collections will explore this issue.
In this paper we focused on object classification from depth images using convolutional neural networks. We argued that, as effective as the filters learned from ImageNet are, the perceptual features of 2.5D images are different, and that it would be desirable to have deep architectures able to capture them. To this purpose, we created VANDAL, the first depth image database synthetically generated, and we showed experimentally that the features derived from such data, using the very same CaffeNet architecture widely used over ImageNet, are stronger while at the same time complementary to them. This result, together with the public release of the database, the trained architecture and the protocol for generating new depth synthetic images, is the contribution of this paper.
We see this work as the very beginning of a long research thread. By its very nature, DepthNet could be plugged into all previous work using CNNs pre-trained over ImageNet for extracting depth features. It might substitute that module, or it might complement it; the open issue is when this will prove beneficial in terms of spatial pooling approaches, learning methods and classification problems. A second issue we plan to investigate is the impact of the deep architecture over the filters learned from VANDAL. While in this work we chose on purpose to not deviate from CaffeNet, it is not clear that this architecture, which was heavily optimized over ImageNet, is able to exploit at best our synthetic depth database. While preliminary investigations with existing architectures have not been satisfactory, we believe that architecture surgery might lead to better results. Finally, we believe that the possibility to use synthetic data as a proxy for real images opens up a wide array of possibilities: for instance, given prior knowledge about the classification task of interest, would it be possible to generate on the fly a task specific synthetic database, containing the object categories of interest under very similar imaging conditions, and train and end-to-end deep network on it? How would performance change compared to the use of network activations as done today? Future work will focus on these issues.
-  Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.
-  K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view RGB-D object dataset,” in Proc. ICRA, 2011, pp. 1817–1824.
-  R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Ng, “Convolutional-recursive deep learning for 3D object classification,” in Proc. NIPS, 2012, pp. 665–673.
-  Cheng, Yanhua, et al. ”Convolutional fisher kernels for rgb-d object recognition.” 3D Vision (3DV), 2015 International Conference on. IEEE, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2014). ImageNet large scale visual recognition challenge. arXiv: 1409 . 0575.
Schwarz, Max, Hannes Schulz, and Sven Behnke. ”RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features.” 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015.
-  Eitel, Andreas, et al. ”Multimodal deep learning for robust rgb-d object recognition.” Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015.
-  H. F. M. Zaki, F. Shafait, A. Mian, “Convolutional hypercube pyramid for accurate RGB-D object category and instance recognition” in Proc.International Conference on Robots and Automation (ICRA), 2016.
-  Y. Cheng, X. Zhao, K. Huang, and T. Tan. Semisupervised learning for rgb-d object recognition. In ICPR, 2014.
-  Y. Cheng, X. Zhao, K. Huang, and T. Tan. Semisupervised learning and feature evaluation for rgb-d object recognition. Computer Vision and Image Understanding, 2015.
-  Lai, Kevin, et al. ”A large-scale hierarchical multi-view rgb-d object dataset.” Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011.
Li, Chi, Austin Reiter, and Gregory D. Hager. ”Beyond spatial pooling: fine-grained representation learning in multiple domains.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
-  D. G. Lowe, “Distinctive image features from scale-invariant key-points,” International Journal of Computer Vision, vol. 60, no. 2, pp.91–110, 2004.
-  M. Blum, J. T. Springenberg, J. Wulfing, and M. Riedmiller, “A learned feature descriptor for object recognition in RGB-D data,” in Proc. ICRA, 2012, pp. 1298–1303.
U. Asif, M. Bennamoun, and F. Sohel, “Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees,” in Proc. ICRA, 2015, pp. 1295–1302.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition,” in Proc. Computer Vision and Pattern Recognition Workshops (CVPRW), 2014, pp. 512–519.
-  S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Proc. ECCV, 2014, pp. 345–360.
-  J. Papon, M. Schoeler. “Semantic pose using deep networks trained on synthetic RGB-D”. Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  D. Maturana, S. Scherer. “VoxNet: a 3D convolutional neural network for real-time object recognition”. Proc International Conference on Robots and Systems (IROS), 2015.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao “3D ShapeNets: A Deep Representation for Volumetric Shape Modeling.” Proceedings of 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR2015)
-  Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
-  Szegedy, Christian, et al. ”Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
-  He, Kaiming, et al. ”Deep residual learning for image recognition.” arXiv preprint arXiv:1512.03385 (2015).
-  Zheng, Liang, et al. ”Good Practice in CNN Feature Transfer.” arXiv preprint arXiv:1604.00133 (2016).
-  Orabona, Francesco, Luo Jie, and Barbara Caputo. ”Online-batch strongly convex multi kernel learning.” Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
-  Jia, Yangqing, et al. ”Caffe: Convolutional architecture for fast feature embedding.” Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.
L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
-  R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Ng. Convolutional-recursive deep learning for 3d object classification. In NIPS, 2012
-  Zhang, Ning, Ryan Farrell, and Trever Darrell. ”Pose pooling kernels for sub-category recognition.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
-  Angelova, Anelia, and Philip M. Long. ”Benchmarking large-scale fine-grained categorization.” IEEE Winter Conference on Applications of Computer Vision. IEEE, 2014.