Robots need to recognize what they see around them to be able to act and interact with it. Recognition must be carried out in the RGB domain, capturing mostly the visual appearance of things related to their reflectance properties, as well as in the depth domain, providing information about the shape and silhouette of objects and supporting both recognition and interaction with items. The current mainstream state of the art approaches for object recognition are based on Convolutional Neural Networks (CNNs, [NIPS1989_293-CNN]), which use end-to-end architectures achieving feature learning and classification at the same time. Some notable advantages of these networks are their ability to reach much higher accuracies on basically any visual recognition problem, compared to what would be achievable with heuristic methods; their being domain-independent, and their conceptual simplicity. Despite these advantages, they also present some limitations, such as high computational cost, long training time and the demand for large datasets, among others.
This last issue has so far proved crucial in the attempts to leverage over the spectacular success of CNNs over RGB-based object categorization [googlenet, krizhevsky2012imagenet] in the depth domain. Being CNNs data-hungry algorithms, the availability of very large scale annotated data collections is crucial for their success, and architectures trained over ImageNet [deng2009imagenet] are the cornerstone of the vast majority of CNN-based recognition methods. Besides the notable exception of [carlucci2016deep], the mainstream approach for using CNNs on depth-based object classification has been through transfer learning, in the form of a mapping able to make the depth input channel compatible with the data distribution expected by RGB architectures. Following recent efforts in transfer learning [Transfer1, Transfer2, Transfer3] that made it possible to use depth data with CNN pre-trained on a database of a different modality, several authors proposed hand-crafted mappings to colorize depth data, obtaining impressive improvements in classification over the Washington [washington] database, that has become the golden reference benchmark in this field [Schwarz, eitel2015multimodal].
We argue that this strategy is sub-optimal. By hand-crafting the mapping for the depth data colorization, one has to make strong assumptions on what information, and up to which extent, should be preserved in the transfer learning towards the RGB modality. While some choices might be valid for some classes of problems and settings, it is questionable whether the family of algorithms based on this approach can provide results combining high recognition accuracies with robustness across different settings and databases. Inspired by recent works on colorization of gray-scale photographs [IizukaSIGGRAPH2016, larsson2016learning, cheng2015deep], we tackle the problem by exploiting the power of end-to-end convolutional networks, proposing a deep depth colorization architecture able to learn the optimal transfer learning from depth to RGB for any given pre-trained convnet. Our deep colorization network takes advantage of the residual approach [ResNet], learning how to map between the two modalities by leveraging over a reference database (Figure 1, top), for any given architecture. After this training stage, the colorization network can be added on top of its reference pre-trained architecture, for any object classification task (Figure 1, bottom). We call our network (DE)2CO: DEep DEpth COlorization.
We assess the performance of (DE)2CO in several ways. A first qualitative analysis, comparing the colorized depth images obtained by (DE)2CO and by other state of the art hand-crafted approaches, gives intuitive insights on the advantages brought by learning the mapping as opposed to choosing it, over several databases. We further deepen this analysis with an experimental evaluation of our and other existing transfer learning methods on the depth channel only, using four different deep architectures and three different public databases, with and without fine-tuning. Finally, we tackle the RGB-D object recognition problem, combining (DE)2CO with off-the shelf state of the art RGB deep networks, benchmarking it against the current state of the art in the field. For all these experiments, results clearly support the value of our algorithm. All the (DE)2CO modules, for all architectures employed in this paper, are available at https://github.com/fmcarlucci/de2co.
Ii Related Work
Since 2012’s AlexNet [krizhevsky2012imagenet] spectacular success, CNNs have become the dominant learning paradigm in visual recognition. Several architectures have been proposed in recent years, each bringing new flavors to the community. Simonyan and Zisserman [VGG] investigated the effect of increasing the network depth. GoogLeNet [googlenet] also increased the depth and width of the network while restraining the computational budget, with a dramatic reduction in the number of parameters. He et al. [ResNet]
proposed a residual learning approach using a batch normalization layer and special skip connections for training deeper architectures, showing an impressive success in ILSVRC2015. All these architectures will be used in this work, to assess its generality.
Lately, several authors attempted to take advantage of pre-trained CNNs to perform RGB-D detection and recognition. Colorization of depth images can be seen as a transfer learning process across modalities, and several works explored this avenue within the deep learning framework. In the context of RGB-D object detection, a recent stream of works explicitly addressed cross modal transfer learning through sharing of weights across architectures[hoffman2016icra], [hoffman2016learning] and [Gupta_2016_CVPR]. This last work is conceptually close to our approach, as it proposes to learn how to transfer RGB extracted information to the Depth domain through distillation [hinton2015distilling]. While [Gupta_2016_CVPR] has proved very successful in the object detection realm, it presents some constraints that might potentially be problematic in object recognition, from the requirement of paired RGB-D images, to detailed data preprocessing and preparation for training. As opposed to this, our algorithm does not require explicit pairing of images in the two modalities, can be applied successfully on raw pixel data and does not require other data preparation for training.
Within the RGB-D classification literature, [Bo] converts the depth map to surface normals and then re-interprets it as RGB values, while Aekerberg et al. [aakerberg2017depth] builds on this approach and suggests an effective preprocessing pipeline to increase performance. A new method has also been proposed by Gupta, Saurabh, et al. [gupta2014learning]
: HHA is a mapping where one channel encodes the horizontal disparity, one the height above ground and the third the pixelwise angle between the surface normal and the gravity vector. Schwarz et al[Behnke] proposed a colorization pipeline where colors are assigned to the image pixels according to the distance of the vertexes of a rendered mesh to the center of the object. Besides the naive grayscale method, the rest of the mentioned colorization schemes are computationally expensive. Eitel et al [eitel2015multimodal] used a simple color mapping technique known as ColorJet, showing this simple method to be competitive with more sophisticated approaches.
All these works, and many others [zaki2016convolutional, carlucci2016deep], make use of an ad-hoc mapping for converting depth images into three channels. This conversion is vital as the dataset has to be compatible with the pre-trained CNN. Depth data is encoded as a 2D array where each element represents an approximate distance between the sensor and the object. Depth information is often depicted and stored as a single monochrome image. Compared to regular RGB cameras, the depth resolution is relatively low, especially when the frame is cropped to focus on a particular object. While addressing this issue, we avoid heuristic choices in our approach and we rely instead on an end-to-end, residual based deep architecture to learn the optimal mapping for the cross modal knowledge transfer.
Most of works in object recognition, against whom we compare our method, are evaluated on one single database, with Washington being the standard choice in the robot vision literature. This raises concerns about the generality of these methods, especially considering their hand-crafted nature. We circumvent this issue by evaluating (DE)2CO on three different databases.
Our work is also related to the colorization of grayscale images using deep nets. Cheng et al [cheng2015deep] proposed a colorization pipeline based on three different hand-designed feature extractors to determine the features from different levels of an input image. Larsson et al [larsson2016learning] used an architecture consisting of two parts. The first part is a fully convolutional version of VGG-16 used as feature extractor, and the second part is a fully-connected layer with 1024 channels predicting the distributions of hue and the chroma for each pixel given its feature descriptors from the previous level. Iizuka et al [IizukaSIGGRAPH2016] proposed an end-to-end network able to learn global and local features, exploiting the classification labels for better image colorization. Their architecture consists of several networks followed by fusion layer for the colorization task. Sun et al. [sun2017weakly] propose to use large scale CAD rendered data to leverage depth information without using low level features or colorization. In Asif et al. [asif2017rgb], hierarchical cascaded forests were used for computing grasp poses and perform object classification, exploiting several different features like orientation angle maps, surface normals and depth information colored with Jet method. Our work differs from this last research thread in the specific architecture proposed, and in its main goal, as here we are interested in learning optimal mapping for categorization rather than for colorization of grayscale images.
Iii Colorization of Depth Images
Although depth and RGB are modalities with significant differences, they also share enough similarities (edges, gradients, shapes) to make it plausible that convolutional filters learned from RGB data could be re-used effectively for representing colorized depth images. The approach currently adopted in the literature consists of designing ad-hoc colorization algorithms, as revised in the previous section. We refer to these kind of approaches as hand-crafted depth colorization. Specifically, we choose ColorJet [eitel2015multimodal], SurfaceNormals [Bo] and SurfaceNormals++ [aakerberg2017depth] as baselines against which we will assess our data driven approach because of their popularity and effectiveness.
In the rest of the section we first briefly summarize ColorJet (section III-A), SurfaceNormals and SurfaceNormals++ (section III-B). We then describe our deep approach to depth colorization (section III-C). To the best of our knowledge, (DE)2CO is the first deep colorization architecture applied successfully to depth images.
Iii-a Hand-Crafted Depth Colorization: ColorJet
ColorJet works by assigning different colors to different depth values. The original depth map is firstly normalized between 0-255 values. Then the colorization works by mapping the lowest value to the blue channel and the highest value to the red channel. The value in the middle is mapped to green and the intermediate values are arranged accordingly [eitel2015multimodal]. The resulting image exploits the full RGB spectrum, with the intent of leveraging at best the filters learned by deep networks trained on very large scale RGB datasets like ImageNet. Although simple, the approach gave very strong results when tested on the Washington database, and when deployed on a robot platform. Still, ColorJet was not designed to create realistic looking RGB images for the objects depicted in the original depth data (Figure 3, bottom row). This raises the question whether this mapping, although more effective than other methods presented in the literature, might be sub-optimal. In section III-C we will show that by fully embracing the end-to-end philosophy at the core of deep learning, it is indeed possible to achieve significantly higher recognition performances while at the same time producing more realistic colorized images.
Iii-B Hand-Crafted Depth Colorization: SurfaceNormals(++)
The SurfaceNormals mapping has been often used to convert depth images to RGB [Bo, wang2016correlated, eitel2015multimodal]. The process is straightforward: for each pixel in the original image the corresponding surface normal is computed as a normalized vector, which is then treated as an RGB color. Due to the inherent noisiness of the depth channel, such a direct conversion results in noisy images in the color space. To address this issue, the mapping we call SurfaceNormals++ was introduced by Aakerberg [aakerberg2017depth]: first, a recursive median filter is used to reconstruct missing depth values, subsequently a bilateral filter smooths the image to reduce noise, while preserving edges. Next, surface normals are computed for each pixel in the depth image. Finally the image is sharpened using the unsharp mask filter, to increase contrast around edges and other high-frequency components.
Iii-C Deep Depth Colorization: (DE)Co
(DE)2CO consists of feeding the depth maps, normalized into grayscale images, to a colorization network linked to a standard CNN architecture, pre-trained on ImageNet.
Given the success of deep colorization networks from grayscale images, we first tested existing architectures in this context [zhang2016colorful]. Extensive experiments showed that while the visual appearance of the colorized images was very good, the recognition performances obtained when combining such network with pre-trained RGB architectures was not competitive. Inspired by the generator network in [bousmalis2016unsupervised], we propose here a residual convolutional architecture (Figure 2). By design [ResNet], this architecture is robust and allows for deeper training. This is helpful here, as (DE)2CO requires stacking together two networks, which for not very deep architectures might lead to vanishing gradient issues. Furthermore, residual blocks works at pixel level [bousmalis2016unsupervised] helping to preserve locality.
Our architecture works as follows: the 1x228x228 input depth map, reduced to 64x57x57 size by a conv&pool layer, passes through a sequence of 8 residual blocks, composed by 2 small convolutions with batch normalization layers and leakyRelu as non linearities. The last residual block output is convolved by a three features convolution to form the 3 channels image output. Its resolution is brought back to 228x228 by a deconvolution (upsampling) layer.
Our whole system for object recognition in the depth domain using deep networks pre-trained over RGB images can be summarized as follows: the entire network, composed by (DE)2
CO and the classification network of choice, is trained on an annotated reference depth image dataset. The weights of the chosen classification network are kept frozen in their pre-trained state, as the only layer that needs to be retrained is the last fully connected layer connected to the softmax layer. Meanwhile, the weights of (DE)2CO are updated until convergence.
After this step, the depth colorization network has learned the mapping that maximizes the classification accuracy on the reference training dataset. It can now be used to colorize any depth image, from any data collection. Figure 3, top rows, shows exemplar images colorized with (DE)2CO trained over different reference databases, in combination with two different architectures (CaffeNet, an implementation variant of AlexNet, and VGG-16 [VGG]). We see that, compared to the images obtained with ColorJet and SurfaceNormal++, our colorization technique emphasizes the objects contours and their salient features while flatting the object background, while the other methods introduce either high frequency noise (SurfaceNormals++) or emphasize background gradient instead of focusing mainly on the objects (ColorJet). In the next section we will show how this qualitative advantage translates also into a numerical advantage, i.e. how learning (DE)2CO on one dataset and performing depth-based object recognition on another leads to a very significant increase in performance on several settings, compared to hand-crafted color mappings.
We evaluated our colorization scheme on three main settings: an ablation study of how different depth mappings perform when the network weights are kept frozen (section IV-B), a comparison of depth performance with network finetuning (section IV-C) and finally an assessment of (DE)2CO when used in RGB-D object recognition tasks (section IV-D). Before reporting on our findings, we illustrate the databases and baselines we used (section IV-A).
Iv-a Experimental Setup
Databases We considered three datasets: the Washington RGB-D [washington], the JHUIT-50 [jhuit] and the BigBIRD [singh2014bigbird] object datasets, which are the main public datasets for RGB-D object recognition. The first consists of RGB-D images organized into instances divided in classes. We performed experiments on the object categorization setting, where we followed the evaluation protocol defined in [washington]. The JHUIT-50 is a challenging recent dataset that focuses on the problem of fine-grained classification. It contains object instances, often very similar with each other (e.g. 9 different kinds of screwdrivers). As such, it presents different recognition challenges compared to the Washington database. Here we followed the evaluation procedure defined in [jhuit]. BigBIRD is the biggest of the datasets we considered: it contains object instances and images. Unfortunately, it is an extremely unforgiving dataset for evaluating depth features: many objects are extremely similar, and many are boxes, which are indistinguishable without texture information. To partially mitigate this, we grouped together all classes annotated with the same first word: for example nutrigrain apple cinnamon and nutrigrain blueberry were grouped into nutrigrain. With this procedure, we reduced the number of classes to 61 (while keeping all of the samples). As items are quite small, we used the object masks provided by [singh2014bigbird] to crop around the object. Evaluation-wise, we followed the protocol defined in [jhuit].
Hand-crafted Mappings According to previous works [eitel2015multimodal, aakerberg2017depth], the two most effective mappings are ColorJet [eitel2015multimodal] and SurfaceNormals [Bo, aakerberg2017depth]. For ColorJet we normalized the data between and and then applied the mapping using the OpenCV libraries111”COLORMAP_JET” from http://opencv.org/. For the SurfaceNormals mapping we considered two versions: the straightforward conversion of the depthmap to surface normals [Bo] and the enhanced version SurfaceNormals++[aakerberg2017depth] which uses extensive pre-processing and generally performs better222The authors graciously gave us their code for our experiments..
Iv-B Ablation Study
In this setting we compared our (DE)2CO method against hand crafted mappings, using pre-trained networks as feature extractors and only retraining the last classification layer. We did this on the three datasets described above, over four architectures: CaffeNet (a slight variant of the AlexNet [NIPS2012-AlexNet]), VGG16 [VGG] and GoogleNet [googlenet] were chosen because of their popularity within the robot vision community. We also tested the recent ResNet50 [ResNet], which although not currently very used in the robotics domain, has some promising properties. In all cases we considered models pretrained on ImageNet [deng2009imagenet]
, which we retrieved from Caffe’sModel Zoo333https://github.com/BVLC/caffe/wiki/Model-Zoo.
Training (DE)2CO means minimizing the multinomial logistic loss of a network trained on RGB images. This means that our network is attached between the depth images and the pre-trained network, of which we freeze the weights of all but the last layer, which are relearned from scratch (see Figure 1). We trained each network-dataset combination for epochs using the Nesterov solver [nesterov1983method] and starting learning rate (which is stepped down after ). During this phase, we used the whole source datasets, leaving aside only of the samples for validation purposes.
When the dataset on which we train the colorizer is different from the test one, we simply retrain the new final layer (freezing all the rest) for the new classes.
Effectively we are using the pre-trained networks as feature extractors, as done in [Behnke, eitel2015multimodal, zaki2016convolutional] and many others; for a performance analysis in the case of network finetuning we refer to paragraph IV-C. In this setting we used the Nesterov (for Washington and JHUIT-50) and ADAM (for BigBIRD) solvers. As we were only training the last fully connected layer, we learned a small handful of parameters with a very low risk of overfitting.
Table II reports the results from the ablation while Figure 4 focuses on the class recall for a specific experiment. For every architecture, we report the results obtained using ColorJet, SurfaceNormals (plain and enhanced) and (DE)2CO learned on a reference database between Washington or JHUIT-50, and (DE)2CO learned on the combination of Washington and JHUIT-50. For the CaffeNet and VGG networks we also present results on simple grayscale images. We attempted also to learn (DE)2CO from BigBIRD alone, and in combination with one (or both) of the other two databases. Results on BigBIRD only were disappointing, and results with/without adding it to the other two databases did not change the overall performance. We interpret this result as caused by the relatively small variability of objects in BigBIRD with respect to depth, and for sake of readability we decided to omit them in this work.
We see that, for all architectures and for all reference databases, (DE)2CO achieves higher results. The difference goes from , obtained with CaffeNet on the Washington database, to the best of for VGG16 on JHUIT-50. JHUIT-50 is the testbed database where, regardless of the chosen architecture, (DE)2CO achieves the strongest gains in performance compared to hand crafted mappings. Washington is, for all architectures, the database where hand crafted mappings perform best, with the combination Washington to CaffeNet being the most favorable to the shallow mapping. On average it appears the CaffeNet is the architecture that performs best on this datasets; still, it should be noted that we are using here all architectures as feature extractors rather than as classifiers. On this type of tasks, both ResNet and GoogLeNet-like networks are known to perform worse than CaffeNet [azizpour2016factors], hence our results are consistent with what reported in the literature. In Table III we report a second ablation study performed on the width and depth of (DE)2CO architecture. Starting from the standard (DE)2CO made of 8 residual blocks with 64 filters for each convolutional layer (which we found to be the best all-around architecture), we perform additional experiments by doubling and halving the number of residual blocks and the number of filters in each convolutional layer. As it can be seen, the (DE)2CO architecture is quite robust but can be potentially finetuned to each target dataset to further increase performance. In table I we report runtimes for the considered networks. As the results show, while (DE)2CO requires some extra computation time, in real life this is actually offset by the fact that only of the data is being moved to the GPU.
In our finetuning experiments we focused on the best performing network from the ablation, the CaffeNet (which is also used by current competitors [eitel2015multimodal, aakerberg2017depth]), to see up to which degree the network could exploit a given mapping. The protocol was quite simple: all layers were free to move equally, the starting learning rate was (with step down after ) and the solver was SGD. Training went on for epochs for the Washington and JHUIT-50 datasets and eps. for BigBIRD (a longer training was detrimental for all settings). To ensure a fair comparison with the static mapping methods, the (DE)2CO part of the network was kept frozen during finetuning.
Results are reported in Table IV. We see that here the gap between hand-crafted and learned colorization methods is reduced (very likely the network is compensating existing weaknesses). SurfaceNormals++ performs pretty well on Washington, but less so on the other two datasets (it’s actually the worse on BigBIRD). Surprisingly, the simple grayscale conversion is the one that performs best on BigBIRD, but lags clearly behind on all other settings. (DE)2CO on the other hand, performs comparably to the best mapping on every single setting and has a lead on JHUIT-50; we argue that it is always possible to find a shallow mapping that performs very well on a specific dataset, but there are no guarantees it can generalize.
While this paper focuses on how to best perform recognition in the depth modality using convnets, we wanted to provide a reference value for RGB-D object classification using (DE)2CO on the depth channel. To classify RGB images we follow [aakerberg2017depth] and use a pretrained VGG16 which we finetuned on the target dataset (using the previously defined protocol). RGB-D classification is then performed, without further learning, by computing the weighted average (weight factor was cross-validated) of the fc8 layers from the RGB and Depth networks and simply selecting the most likely class (the one with the highest activations). This cue integration scheme can be seen as one of the simplest, off-the-shelf algorithm for doing classifications using two different modalities [TommasiOC08]. We excluded BigBIRD from this setting, due to lack of competing works to compare with.
Results are reported in Tables V-VI. We see that (DE)2CO produces results on par or slightly superior to the current state of the art, even while using an extremely simple feature fusion method. This is remarkable, as competitors like [aakerberg2017depth, eitel2015multimodal] use instead sophisticated, deep learning based cue integration methods. Hence, our ability to compete in this setting is all due to the (DE)2CO colorization mapping, clearly more powerful than the other baselines. It is worth stressing that, in spite of the different cue integration and depth mapping approaches compared in Tables V-VI, convnet results on RGB are already very high, hence in this setting the advantage brought by a superior performance on the depth channel tends to be covered. Still, on Washington we achieve the second best result, and on JHUIT-50 we get the new state of the art.
This paper presented a network for learning deep colorization mappings. Our architecture follows the residual philosophy, learning how to map depth data to RGB images for a given pre-trained convolutional neural network. By using our (DE)2CO algorithm, as opposed to the hand-crafted colorization mappings commonly used in the literature, we obtained a significant jump in performance over three different benchmark databases, using four different popular deep networks pre trained over ImageNet. The visualization of the obtained colorized images further confirms how our algorithm is able to capture the rich informative content and the different facets of depth data. All the deep depth mappings presented in this paper are available at https://github.com/fmcarlucci/de2co. Future work will further investigate the effectiveness and generality of our approach, testing it on other RGB-D classification and detection problems, with various fine-tuning strategies and on several deep networks, pre-trained over different RGB databases, and in combination with RGB convnet with more advanced multimodal fusion approaches.