When CNNs Meet Random RNNs: Towards Multi-Level Analysis for RGB-D Object and Scene Recognition

04/26/2020 ∙ by Ali Caglayan, et al. ∙ Hacettepe University 17

Recognizing objects and scenes are two challenging but essential tasks in image understanding. In particular, the use of RGB-D sensors in handling these tasks has emerged as an important area of focus for better visual understanding. Meanwhile, deep neural networks, specifically convolutional neural networks (CNNs), have become widespread and have been applied to many visual tasks by replacing hand-crafted features with effective deep features. However, it is an open problem how to exploit deep features from a multi-layer CNN model effectively. In this paper, we propose a novel two-stage framework that extracts discriminative feature representations from multi-modal RGB-D images for object and scene recognition tasks. In the first stage, a pretrained CNN model has been employed as a backbone to extract visual features at multiple levels. The second stage maps these features into high level representations with a fully randomized structure of recursive neural networks (RNNs) efficiently. In order to cope with the high dimensionality of CNN activations, a random weighted pooling scheme has been proposed by extending the idea of randomness in RNNs. Multi-modal fusion has been performed through a soft voting approach by computing weights based on individual recognition confidences (i.e. SVM scores) of RGB and depth streams separately. This produces consistent class label estimation in final RGB-D classification performance. Extensive experiments verify that fully randomized structure in RNN stage encodes CNN activations to discriminative solid features successfully. Comparative experimental results on the popular Washington RGB-D Object and SUN RGB-D Scene datasets show that the proposed approach significantly outperforms state-of-the-art methods both in object and scene recognition tasks.



There are no comments yet.


page 2

page 5

page 7

page 9

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) have attracted researchers to handle many visual recognition tasks since their breakthrough emergence. However, building an effective model can be quite challenging due to the lack of labeled training data, limited time and computational resources, and the need for well defined hyperparameter settings for a good generalization capability. Especially in many real-world tasks, it is not preferable to train a model from scratch. Luckily, CNNs offer highly efficient solutions with their transferable off-the-shelf features. Consequently, many approaches take advantage of these features to propose new solutions for object recognition (e.g.

[1, 2]), scene recognition (e.g. [3, 4]), object detection (e.g. [5, 6]), and semantic segmentation (e.g. [5, 7]) due to their high representation ability and capability of generalization among different tasks when trained with large scale datasets. The most common and straightforward strategy among these methods is to utilize the features obtained from final layers which provide semantically rich information with smaller dimensions comparing to the earlier layers [1, 2, 5, 6, 7]. However, one of the concerns about this semantics is the fact that as features evolve towards the final layers, they are increasingly dependent on the chosen dataset and task [8], which might diminish the generalization capabilities of these features when transferred. Moreover, this strategy ignores the locally activated distinctive information of the earlier layers which is less sensitive to semantics [9, 10]. One of the main challenges in earlier layers of deep CNNs is the high dimensionality of extracted features. In addition, when these features are used as is, it makes the feature space untraceable. Eventually, while features are transformed from low-level general to high-level specific representations throughout the network, the relational information is distributed across the network at different levels [8, 9]. However, it remains unclear how to exploit the information effectively.

Fig. 1: General overview of the proposed framework. The framework accepts RGB and depth images and it first colorizes depth inputs. In the CNN-Stage activations at different levels of a pretrained model are extracted. In the RNN-Stage, first, CNN activations are converted to reasonable dimensions and appropriate input requirements for RNNs by preprocessing operations. Then, multiple random RNNs are applied to map these inputs into high level representations. Finally, multiple level fusion and classification steps are deployed for recognition tasks.

In this paper, we aim to present an effective deep feature extraction framework to derive powerful image representations through transfer learning. The proposed pipeline relies on two key insights. The first one is to employ a pre-trained CNN as the backbone model and exploit activations at different layers of the network to cover the predominant information of the underlying localities. The second one is to implement multiple random recursive neural networks (RNNs) on top of CNNs to encode the CNN activations into a robust representation with reduced dimensionality and sufficient descriptiveness.

In developing our framework, we particularly deal with the RGB-D object and scene recognition problems, which are challenging yet crucial tasks especially with the today’s wider application of robotics technologies. Moreover, the multi-modality of the RGB-D sensors arises additional difficulties in representation of input data such as handling different modalities and devising solutions that captures complementary information from both RGB and depth data effectively. Besides these challenges, alleviating limitations on time and memory consumption is another challenge to deal with. To address these challenges, we propose a novel framework that gathers feature representations at different levels in a compact and representative feature vector for both of RGB and depth data. After obtaining CNN activations, we first apply a preprocessing operation to the activation maps of each level through reshaping or randomized pooling. This not only provides a generic structure for each level by fixing an RNN tree but also it allows us to improve recognition accuracy through multi-level fusion. We then give the outputs of these operations to multiple random RNNs


to acquire higher level compact feature representations. Incorporating multiple fixed RNNs together with the pre-trained CNN models allows feature transition at different levels to preserve both semantic and spatial structure of objects. In order to transfer learning from a pre-trained CNN model for depth modality, we embed depth data into the RGB domain with a highly efficient depth colorization technique based on surface normals. As for the multi-modal fusion of RGB and depth modalities, we explore different fusion techniques. Moreover, we present an approach that provides a decisive fusion of RGB and depth modalities based on the modality importance through a weighting scheme (see Sec.


). Our implementation is in Python using PyTorch

111https://github.com/pytorch/pytorch and numpy222https://github.com/numpy/numpy libraries. All the source codes together with system requirements and documentations will be opened to the community on Github.

The proposed framework is evaluated with exhaustive experiments on two popular public datasets (i) Washington RGB-D Object dataset [12] for RGB-D object recognition task and (ii) Sun RGB-D Scene dataset [13] for RGB-D scene recognition task. The experimental results demonstrate the effectiveness of our approach in terms of accuracy by achieving superior performance over the current state-of-the-art methods. A preliminary version of this work appeared in [14] for RGB-D object recognition. In this work, we present an extended and enhanced version of our work in [14] with a novel framework and contribute to the task of RGB-D object and scene recognition tasks as follows:

  • We present a novel framework for deep features with two-stage organization where information at different levels is encoded by incorporation of multiple random RNNs with a pre-trained CNN model for RGB-D object and scene recognition (see Sect. 3). The framework is applicable to a variety of pre-trained CNN models including AlexNet [15], VGGNet [16], ResNet [17], and DenseNet [18]. The overall structure has been designed in a modular and extendable way through a unified CNN and RNN process. Thus, it offers easy and flexible use. These also can easily be extended with new capabilities and combined with different setups and other models for implementing new ideas. In fact, our preliminary approach has been already successfully applied to another challenging robotics task in a SLAM system [19].

  • We extend the idea of randomness in RNNs as a novel pooling strategy to cope with the high dimensionality of CNN activations from different levels (see Sec. 3.3.1). This strategy has been applied as a preprocessing stage before RNNs and it allows us to evaluate and utilize multiple level information in deep models such as ResNet [17] and DenseNet [18] models. In addition, we give the experimental results of different pooling strategies in terms of accuracy and show the effectiveness of our pooling strategy over other pooling methods (see Sec. 4.2.4).

  • We study several aspects of transfer learning through an empirical investigation including comparative profiling results of different baselines (see Sec. 4.2.1), level-wise analysis of different baselines (see Sec. 4.2.3), the effects of finetuning over fixed pretrained CNN models (see Sec. 4.2.6), and different approaches to multi-level and multi-modality data fusion (see Sec. 4.2.7). In regard to multi-model fusion, unlike our previous work using concatenation of features, we propose a soft voting approach based on individual SVM confidences of RGB and depth streams (see Sec. 3.4) and show the strength of our approach experimentally (see Sec. 4.2.7). We also give; (i) empirical evaluation of the randomness to see if random RNNs are stable enough (see Sec. 4.2.2), (ii) experimental analysis of multi-level RNNs (see Sec. 4.2.5), and (iii) comparative results of different pooling strategies over the proposed random pooling (see Sec. 4.2.4). Finally, we provide experimental results demonstrating that our approach improves the state-of-the-art results on two the most comprehensive and challenging real-world public datasets; Washington RGB-D Object dataset for RGB-D object recognition (see Sec. 4.3) and SUN RGB-D scene dataset for RGB-D scene recognition (see Sec. 4.4).

2 Related Work

The proposed work can be related with different areas, such as multi-modal CNN based approaches, transfer learning based approaches, and random recursive neural networks. In this section, we narrow our focus to RGB-D based recognition and give a brief review of the relevant approaches with stating the current work in the literature.

2.1 Multi-Modal CNN based Approaches

Following their success in computer vision, CNN-based solutions have replaced conventional methods such as the works in

[20], [21], and [22] in the field of RGB-D object recognition, as in many other areas. For instance, Wang et al. [23, 24] present CNN-based multi-modal learning systems motivated by the intuition of common patterns shared between RGB and depth modalities. They enforce their systems to correlate features of the two modalities in a multi-modal fusion layer with a pretrained model [23] and their custom network [24] respectively. Li et al. [25] extends the idea of considering multi-modal intrinsic relationship with intra-class and inter-class similarities for indoor scene classification by providing a two-stage training approach. In [26], a three-streams multi-modal CNN architecture has been proposed in which depth images are represented with two different encoding methods in two-streams and the remaining stream is used for RGB images. Despite the extra burden, this naturally has increased the depth accuracy in particular. Similar multi-representational approach has been proposed by Zia et al. in [27] where a hybrid 2D/3D CNN model initialized with pre-trained 2D CNNs is employed together with 3D CNNs for depth images. Cheng et al. [28]

propose convolutional fisher kernel (CFK) method which integrates a single CNN layer with fisher kernel encoding and utilizes Gaussian mixture models for feature distribution. The drawback of their approach is the very high dimensional of the feature space.

2.2 Transfer Learning based Approaches

Deep learning algorithms require a significant amount of annotated training data and obtaining such data can be difficult and expensive. Therefore, it is important to leverage transfer learning for enhancing high-performance learner on a target domain and the task at hand. Especially, applying a trained deep network and then fine-tuning the parameters can speed up the learning process or improve the classification performance [29]. Furthermore, many works show that a pre-trained CNN on a large-scale dataset can generate good generic representations that can effectively be used for other visual recognition tasks as well [1, 8, 30, 31, 32]. This is particularly important in vision tasks on RGB-D datasets, which is hard to collect with labeled data and generally amount of data is much less than that of the labeled images in RGB datasets.

There are many successful approaches that use transfer learning in the field of RGB-D object recognition. Schwarz et al. use the activations of two fully connected layers, a.k.a. fc7 and fc8, extracted from the pre-trained AlexNet [15]

for RGB-D object recognition and pose estimation. Gupta

et al. [33] study the problem of object detection and segmentation on RGB-D data and present a depth encoding approach referred as HHA to utilize a pre-trained CNN model on RGB datasets. Asif et al.

introduce a cascaded architecture of random forests together with the use of the

fc7 features of the pre-trained models of [34] and [16] to encode the appearance and structural information of objects in their works of [35] and [36], respectively. Carlucci et al. [37] propose a colorization network architecture and use a pre-trained model as feature extractor after fine-tuning it. They also make use of the final fully-connected layer in their approach. So, these above-mentioned studies mainly focus on the outputs of the fully-connected layers.

On the other hand, many studies [38, 10, 39, 40, 14] have concluded that using fully connected layers from pre-trained or fine-tuned networks might not be the optimum approach to capture discriminating properties in visual recognition tasks. Moreover, combining the activations obtained in different levels of the same modal enhances recognition performance further, especially for multi-modal representations, where earlier layers capture modality-specific patterns [41, 40, 14]. Hence, utilizing information at different levels in the works of [41, 10, 39, 40, 14, 42] yields better performances. More recent approach of Loghmani et al. [43] utilizes the pre-trained model of residual networks [17]

to extract features from multiple layers and combines them through a recurrent neural network. Their experimental results also verify that multi-level feature fusion provide better performance than single-level features. While their approach is based on a gated recurrent unit (GRU)


with a number of memory neurons, our approach employs multiple random neural networks with no necessarily need for training. A different related approach is proposed by Asif

et al. in [45]. They handle the classification task by dividing it into image-level and pixel-level branches and fusing through a Fisher encoding branch. Eitel et al. [46] and Tang et al. [47] employ two-stream CNNs, one for each modality of RGB and depth channels and each stream uses the pre-trained model of [15]

on the ImageNet. In both works

[46, 47], the two-streams are finally connected by a fully-connected fusion layer and a canonical correlation analysis (CCA) module, respectively. While feature fusion approaches (e.g. concatenation) may provide good accuracy for the visual recognition task, feature fusion may not be the only solution for multi-level decision process since increased feature space may not be good for recognition with small number of data. We experiment and show that voting on the SVM confidence scores for selected levels can also provide reliable and improved performance. Moreover, this also enables us to use confidence score based importance to RGB and depth domains in multi-modal fusion.

2.3 Random Recursive Neural Networks

Randomization in neural networks has been researched for a long time in various studies [48, 49, 50, 51, 52, 53, 11] due to its benefits, such as simplicity and computationally cheapness over optimization [54]. Since a complete overview of these variations is beyond the scope of this paper, we give an overview specifically with the focus of random recursive neural networks [11]. Recursive neural networks (RNNs) [55, 56, 57]

are graphs that process a given input into recursive tree structures to make a high-level reasoning possible in a part-whole hierarchy by repeating the same process over the trees. RNNs have been employed for various research purposes in computer vision including image super-resolution

[58], semantic segmentation [57, 59], and RGB-D object recognition [11, 60, 61]. In [11], Socher et al.

have introduced a two-stage RGB-D object recognition architecture where the first stage is a single CNN layer using a set of k-means centroids as the convolution filters and the second stage is multiple random recursive neural networks to process outputs of the first stage. Bai

et al. [60] propose a subset based approach of the pioneer work in [11] where they use a sparse auto-encoder instead of the k-means clustering for convolution filters. Cheng et al. [61] employ the same architecture of Socher et al. [11]

for a semi-supervised learning system with a modification by adding a spatial pyramid pooling to prevent a potential performance degradation during resizing input images. Bui

et al. [62] have replaced the single CNN layer in [11] with a pre-trained CNN model for RGB object recognition and achieved impressive results. Following their success, in our preliminary work [14], we propose an approach that aims to improve on this idea by gathering feature representations at different levels in a compact and representative feature vector for both of RGB and depth data. To this end, we reshape CNN activations in each layer that provides a generic structure for each layer by fixing the tree structure without hurting performance and it allows us to improve recognition accuracy by combining feature vectors at different levels. In this work, we propose a pooling strategy to handle large dimensional CNN activations by extending the idea of randomness in RNNs. This can be related with the stochastic pooling by Zeiler and Fergus in [63]

, which picks the normalized activations of a region according to a multinomial distribution by computing the probabilities within the region. Instead of using probabilities, our pooling approach here is a form of averaging based on uniform distributed random weights.

3 Proposed Approach

The proposed pipeline has two main stages. In the first stage, a pre-trained CNN model has been employed as the underlying feature extractor. In this work, we have examined several models in this stage. The second stage transforms convolutional features through a randomized recursive neural network based structure that aims to acquire more compact representations. In order to cope with the high dimensionality of CNN activations, a pooling strategy based on random weights has been proposed. The final representative outcomes have been passed through a linear SVM classifier for categorization of objects and scenes. The overall pipeline can be related as a deeper analogy to

[64] where a proper architecture with random weights for object recognition task has been explored. In the following, we describe each stage of our approach in detail.

Fig. 2: Schematic overview of CNN models and their level-wise extraction points based structures. Each level of schematic view shows name of the level, operations performed in the level with the number of them if exist (for ResNet [17] and DenseNet [18] models), and dimensions of the activation output.

3.1 Data Preparation

In order to use pre-trained CNN models, it is important to process input images appropriately. To this end, following common practices for preprocessing, we resize RGB images to 256x256 dimensions according to bilinear transformation and apply center cropping to get 224x224 dimensional images. Then, we apply commonly used z-score standardization on the input data by using mean and standard-deviation of the ImageNet

[65]. We do not perform any other practices such as data augmentation.

As for the depth domain, we first need appropriate RGB-like representation of depth data to leverage the power of pre-trained CNN models over the large-scale RGB dataset of the ImageNet. To do so, there are several ways to represent depth data as RGB-like images such as HHA method of Gupta et al. [33] (i.e. using horizontal and vertical observation values and angle of the normal to common surface), ColorJet work by Eitel et al. [46] (i.e. mapping depth values to different RGB color values), or commonly used surface normal based colorization as in [66, 14]. In this work, we prefer to use the colorization technique based on surface normals, as it confirms its effectiveness in our previous work [14]. However, unlike surface normal estimation from depth maps without camera parameters in [14]

, we improve this in a more accurate way by estimating surface normals on 3D point clouds that has been computed using depth maps and camera intrinsic values. To address the issue of missing depth values, we first apply a fast vectorized depth interpolation by applying a median filter through a

neighborhood to reconstruct missing values in noisy depth inputs. Then, 3D point cloud estimation by using camera intrinsic constant values and surface normal calculation on point clouds are followed, respectively. After this, the common approach is scaling surface normals to map values to the range to fit RGB image processing. However, since such an approach of mapping from floating point to integer values leads to a loss of information, we use these normal vectors as is without performing further quantization or scaling. Furthermore, unlike in RGB input processing, we apply resizing operation on these RGB-like depth data using the nearest neighborhood based interpolation rather than bilinear interpolation. Because the latter may lead to more distortion in geometric structure of a scene. Moreover, nearest neighbor interpolation is more suitable to the characteristics of depth data by providing a better separability between foreground and background in a scene. When applying z-score standardization to depth domain, we use the standard-deviation of the ImageNet as in RGB domain. However, we use zero-mean instead of the ImageNet mean as normal vectors are in the range of without the need for zero-mean shifting.

3.2 CNN-Stage

The backbone of our approach is a pre-trained CNN model. Since size of available RGB-D datasets are much smaller than that of RGB’s, it is important to make use of an efficient knowledge transfer from pre-trained models on large RGB datasets. In addition, it saves time by eliminating the need for training from scratch. In the previous work [14], the available pre-trained CNN model of [34], named VGG_f, in MatConvNet toolbox [67] has been used. In this work, we employ several available pre-trained models of PyTorch including AlexNet [15], VGGNet [16]

(specifically VGGNet-16 model with batch normalization), ResNet

[17] (specifically ResNet-50 and ResNet-101 models), and DenseNet [18]. We extract features from seven different levels of CNN models. The models investigated in this study with the feature extraction levels are shown in Fig. 2

. For AlexNet, outputs of the five successive convolutional layers and the following two fully-connected (FC) layers have been considered, while for VGGNet, the first two FC layers are taken into account together with the outputs of each convolution block that includes several convolutions and a final max pooling operations. Unlike AlexNet and VGGNet, ResNet and DenseNet models consist of blocks such as residual, dense or transition blocks where there are multiple layers. While ResNet extends the sequential behaviour of AlexNet and VGGNet with the introduction of the skip-connections, DenseNet takes one step further by concatenating the incoming activations rather than summing up them. The ResNet models consist of five stages and a following average pooling and an FC layer. Therefore, each output of the five successive stages and the output of the final average pool have been considered for the six of the seven extraction points. As for the remaining extraction level for these models (ResNet-50 and ResNet-101), the middle point of the third block (which is the largest block) has been taken. Similarly, for DenseNet model, the output of all the four dense blocks (for the last dense block, the output of normalization that follows the dense block has been taken) and the transition blocks between them have been considered as the extraction points. Since common and straightforward model of AlexNet has a minimum depth with a seven layer stack-ups, the above-mentioned CNN extraction points for each model are selected to evaluate and compare level-wise model performances. In addition, these levels are also related to the CNN model in the previous work

[14] that we improve on by considering their intrinsic reasoning behind the use of blocks and the approximate distance differences.

3.3 RNN-Stage

Fig. 3: Graphical representation of a single recursive neural network (RNN). The same random weights have been applied to compute each node and level.

Random recursive neural networks offer a feasible solution by randomly fixing the network connections and eliminate the need for selection in the parameter space. Motivated by this, we employ multiple random RNNs, whose inputs are the activation maps of a pre-trained CNN model. RNNs map a given 3D matrix input into a vector of higher level representations of it by applying the same operations recursively in a tree structure. In each layer, adjacent blocks are merged into a parent vector with tied weights where the objective is to map inputs into a lower dimensional space in the end through multiple levels. Then, the output of a parent vector is passed through a nonlinear function. A typical choice for this purpose is the function. In our previous work [14]

, we give the comparative results of different activation functions in terms of accuracy success and show hyperbolic functions work well. Therefore, in this work, we employ

activation function as in [11, 14]. Fig. 3 shows a graphical representation of a pooled CNN output with the size and an RNN structure with 3 levels and blocks of child nodes (Note that this figure is inspired by the RNN graphical representation of [11]).

In our case, inputs of RNNs are activation maps obtained from different levels of the underlying CNN model. Let be an input image that pass through a given CNN model, where are the extraction levels and , where the output convolution maps are either a 3D matrix for convolutional layers or a 1D vector of for FC layers/global average pooling. Since RNN requires a 3D input of , we first process the convolution maps at each level to ensure the required form. Moreover, by applying this step, we ensure that RNNs are able to handle inputs fast and effectively by reducing the receptive field area and/or the number of activation maps of high-dimensional feature levels (e.g. the outputs of early levels for models such as VGGNet [16], ResNet [17], DenseNet [18] etc). In addition, we apply preprocessing to obtain similar output structures with the previous work [14]. However, it was enough to apply only reshaping in the previous work due to less dimensional size of layers in VGG_f model. In this work, we introduce random weighted pooling that copes with high dimensionality of layers in the underlying deeper models such as ResNet [17] and DenseNet [18]. Our pooling mechanism can downsample CNN activations in both number and spatial dimension of maps. After applying the preprocessing step to obtain suitable forms for RNNs, we compute parent vector as


where for each CNN extraction level , is a nonlinearity function which is in this study, is block size of an RNN. Instead of a multi-level structured RNN, an RNN in this study is of one-level with a single parent vector. In fact, our experiments have shown that the single-level structure provides better or comparable results over the multi-level structure in terms of accuracy (see Sec. 4.2.5). Moreover, the single-level is more efficient with less computational burden. Thus, block size is actually the receptive field size in an RNN. In Eq. 1, the parameter weight matrix is

and it is randomly generated from a predefined distribution that satisfies the following probability density function


where is a predefined distribution and and are boundaries of the distribution. In our case, the weights are set to be uniform random values in , which have been assigned by following our previous work [14]

and specifically with the assumption of preventing possible explosion of tensor values due to our aggregating pooling strategy. Keeping in mind that in order to obtain sufficient descriptive power from the randomness, we need to generate enough samples from the range. In

[11], it has been demonstrated experimentally that increasing the number of random RNNs up to improves performance and gives the best result with RNNs. In [14], it has also been verified that number of RNN weights can be generated for feature encoding with high performance in classification on both of RGB and depth data. Therefore, as a standard usage in this work, we do feature encoding on CNN features using random RNNs with channel representations, leading us to dimensional feature vector at each level in a model.

The reason why random weights work well for object recognition tasks seems to lie in the fact that particular convolutional pooling architectures can naturally produce frequency selective and translational invariant features [68]. As stated before, in analogy to the convolutional-pooling architecture in [64], our approach intuitively incorporates both selectivity due to the CNN stage and translational invariance due to the RNN stage. Moreover, we have to point out that there is biological plausibility lies in the use of randomness as well. In [69], Rigotti et al. have shown that random connections between inter-layer neurons are needed to implement mixed selectivity for optimal performance during complex cognitive tasks. Before concluding this section, we give details of our random pooling approach, where we extend the idea of random RNN as a downsampling mechanism.

3.3.1 Random Weighted Pooling

Fig. 4: Illustration of random weighted pooling over number of maps (top) and size of maps (below).

In our previous work [14], we give CNN outputs to RNNs after a reshaping process. However, due to the high dimensional output size of the models used in this study, it is necessary to process CNN activations further. In this work, we propose a random pooling strategy to reduce the dimension in either size of the activation maps ( block size or receptive field area of an RNN) or number of maps () at CNN levels where reshaping is insufficient. In our random weighted pooling approach, we aggregate the CNN activation maps by sampling from a uniform distribution as in Eq. 2 from each pooling area. More precisely, for extraction level, the pooling reduces activations by mapping into region as where and in Eq. 3.


where is pooling region, convolutional activations, is the index of each element within the pooling, and is random weights. and when pooling is over number of maps whereas and when pooling is over size of maps. Fig. 4 illustrates proposed random weighted pooling for both of downsampling in number of maps and size of maps. In this work, by extending the randomness in RNNs along the pipeline with the proposed pooling strategy, we aim to show that randomness can actually work quite effectively. In fact, as we can see in the comparative results (see Sec. 4.2.4), this randomness in our approach works generally better comparing to the other common pooling methods such as max pooling and average pooling.

3.4 Fusion and Classification

After obtaining encoded features from the RNN-Stage, we investigate multi-level fusions to capture more distinctive information at different levels for further recognition performance. In order to minimize the cross entropy error between output predictions and the target values, we could give multi-level outputs to fully connected layers and back-propagate through them. However, following the success in our previous study [14], we perform classification by employing linear SVM with the scikit-learn333https://github.com/scikit-learn/scikit-learn [70] implementation. To this end, in our previous work [14], we have performed the straightforward feature concatenation on various combinations of the best mid-level representations. In this work, in addition to the feature concatenation, we also apply soft voting by averaging SVM confidence scores on these best trio of levels. Finally, RGB and depth features are fused to evaluate combined RGB-D accuracy performance. Shiny, transparent, or thin surfaces may cause corruption in depth information since depth sensors do not properly handle reflections from such surfaces, resulting better performance in favor of RGB in such cases. On the other hand, depth sensors work well in a certain range and are insensitive to changes in lighting conditions. Therefore, to take full advantage of both modalities in a complementary way, a compact multi-modal combination based on the success of input type is important in devising the best performing fusion. To this end, we present a decision mechanism using weighted soft voting based on the confidence scores obtained from RGB and depth streams. Modality weighting in this way is used to compensate imbalance and complement decision in different data modalities. Once the modality-specific branches proceed, we combine the predictions through the weighted SVM as follows. Let represents SVM confidence scores of each category class , where is number of classes, and indicates RGB and depth modalities. Then, weights are computed as in Eq. 4.


where is normalized squared magnitudes for each modality and defined as:


Finally, multi-modal RGB-D predictions are estimated as follows, in Eq. 6:


where is a category class. Concretely, if RGB and depth results are balanced in confidence scores, then the final soft voting decision is based on equal contribution from each stream similar to averaging.

4 Experimental Evaluation

The proposed framework has been evaluated on two challenging benchmarks (Sec. 4.1) for two tasks: (i) RGB-D object recognition (Sec. 4.3) using Washington RGB-D object dataset [12] and (ii) RGB-D scene recognition (Sec. 4.4) using SUN RGB-D scene dataset [13]. In order to evaluate effects of various model parameters and setup properties in our framework, we carry out extensive experiments (Sec. 4.2) on the challenging Washington RGB-D object dataset, which is a larger-scale RGB-D dataset comparing to other RGB-D benchmarks. Finally, we compare our results with state-of-the-art results for both benchmarks. Results of other methods are taken from the original papers.

4.1 Dataset and Setup

4.1.1 Washington RGB-D Object Dataset

Washington RGB-D object dataset includes a total of images for each modality under object categories and category instances. Categories are commonly used household objects such as cups, camera, keyboards, vegetables, fruits, etc. Each instance of a category has images taken from , and elevation angles. The dataset provides train/test splits where in each split, one instance for each category is used for testing and the remaining instances are for training. Thus, for a single split run, a total of category instances (roughly images) are used at testing and the remaining instances (roughly images) are used at training phase. We evaluate the proposed work on the provided cropped images with the same setup in [12] for the 10 splits and average accuracy results are reported for the comparison to the related works.

4.1.2 SUN RGB-D Scene Dataset

SUN RGB-D scene dataset is the largest real-world RGB-D scene understanding benchmark to the date and contains RGB-D images of indoor scenes. Following the publicly available configuration of the dataset, we choose scene categories with a total of images for training and images for testing. We use the same train/test split of Song et al. [13] to evaluate the proposed work for scene recognition.

4.2 Model Ablation

We first have analysed and validated the proposed framework with extensive experiments with a variety of architectural configurations on the popular benchmark of Washington RGB-D dataset. In this section, the analysis and evaluations of the model ablative investigations are presented. The developmental experiments are carried out on two splits of Washington RGB-D Object dataset for both modalities in order to evaluate on more stable results. The average results are analysed. However, in some experiments, more runs have been carried out, which are clearly stated in the related sections. Then, the best performing models are compared with the state-of-the-art methods with the exact provided evaluation setups. We assess the proposed framework on a desktop PC with AMD Ryzen 9 3900X 12-Core Processor, 3.8 GHz Base, 128 GB DDR4 RAM 2666 MHz, and NVIDIA GeForce GTX 1080 Ti graphics card with 11 GB memory.

Fig. 7: Effect of randomness on the accuracy results for each level (L1 to L7). Values indicate standard deviations.
Fig. 10: Level-wise average accuracy performance of different baseline models on all the 10-splits of Washington RGB-D dataset.

4.2.1 Computation Time and Memory Profiling on Different Models

width= Time (hh:mm:ss) Memory Model Feature Extraction (CNN-RNN Stages) Classification (SVMs) Overall CNN-Stage (GPU) RNN-Stage (CPU) Overall (CPU) Pool Weights RNN Weights AlexNet 00:07:41 00:28:33 00:36:14 1115 MB 772.1 kB 4.2 GB 12.6 GB VGGNet-16 00:21:21 00:36:42 00:58:03 9259 MB 8.6 MB 4.8 GB 11.8 GB ResNet-50 00:16:23 00:38:36 00:54:59 6067 MB 9.6 MB 5.1 GB 10.8 GB ResNet-101 00:19:08 00:40:33 00:59:41 8795 MB 9.6 MB 5.1 GB 11.8 GB DenseNet-121 00:17:02 00:26:47 00:43:49 8821 MB 8.3 MB 5.4 GB 13.3 GB

TABLE I: Average computational time and memory overhead for overall data processing and model learning on two splits of Washington RGB-D dataset. Results cover both of train and test phases together.

We first evaluate different baseline CNN models within our framework in terms of computational time and memory requirements. We evaluate the proposed framework in two parts: (i) Feature extraction containing CNN-RNN stages and (ii) Classification where a model based on the extracted features is learnt to distinguish the different classes. The batch size is set to 64 for all the models. Table I reports computational times and memory workspaces for the whole data processing ( images) on Washington RGB-D dataset. The results here are the average results of two splits on RGB images. There is additional cost for depth data processing as it is required to colorize them. The results on this table cover the overall processing and classification of all level features. Moreover, it should be noted that classification time covers both training and testing processes, in which training takes the main computational burden. Therefore, the main cost in terms of processing time comes from training SVM models that works on CPU for times. The process for only a single optimum level would reduce the computational time to a ratio of seven approximately. Hence, using a single optimum level or fusion of selected levels can be efficient enough in terms of time and memory requirements while presenting sufficient representations.

4.2.2 Empirical Evaluation of the Effect of Randomness

The use of random weights both in pooling and RNN structures leads to the question of how stable are the results. Thus, we experimentally investigate to see whether there is a decisive difference between different runs that generate and use new random weights. We run the pipeline with different random weights on two splits, 5 times for each. Fig. 7 reports average results with their standard deviations for each level. The figure clearly shows that randomness does not cause any instability in the model and produces similar results with very small deviations.

4.2.3 Level-wise Performance of Different Models

Fig. 10 shows level-wise average accuracy performances of all the baseline models for both of RGB and depth modalities on all the 10 evaluation splits. The graphs show a similar performance trend line with a clear upward at the beginning and a downward at the end. Although the levels at which optimum performance is obtained vary according to the model, what is common to all models in general is that instead of final level representations, intermediate level representations present the optimal results. These experiments also verify that while deep models transform attributes from general to specific through the network eventually [1, 71], intermediate layers present the optimal representations. This makes sense because while early layers response to low-level raw features such as corners and edges, late layers extract more object-specific features of the trained datasets. This is more clear on the depth plot in Fig. 10, where the dataset difference is obvious due to the domain difference. We should state that RNN encoding on features extracted from FC layers with less than dimension might not be efficient since they are already compact enough. Therefore, encoding outputs of these layers to a larger feature space through RNNs might lead to redundancy in representations. This might be another reason why there is a drop in accuracy of these layers (e.g. see L7 in Fig. 10). In addition, depth plot contains more fluctuations and irregularities comparing to the RGB plot, since the pretrained models of the RGB ImageNet are used as fixed extractors without finetuning. As for the different baseline model comparison, ResNet-101 and DenseNet-121 models perform similarly in terms of accuracy and are better than others.

4.2.4 Comparative Results of Random Weighted Pooling

Fig. 11: Average accuracy performance of different pooling methods on RGB and depth data for the baseline model of DenseNet-121 on two splits of Washington RGB-D dataset.

In our approach, we extend the idea of randomness into a pooling strategy to cope with the high dimensionality of CNN activations. We particularly employ random pooling to confirm that randomness works greatly in overall RNN-Stage even in such a pooling strategy together with random RNNs. To this end, we investigate the comparative accuracy performances of random pooling together with average pooling and max pooling. We use the DenseNet-121 model, where pooling is used extensively on each level (except in level 4), and we conduct experiments using the same RNN weights for fair comparison. Fig. 11 shows average accuracy results of two splits for each pooling on both RGB and depth data. As seen from the figure, random weighted pooling generally performs similar to average pooling, while it performs better than max pooling. Moreover, it is seen that random pooling acquires better results especially in middle/late levels(L4-L7), which presents more stable and meaningful representations comparing to the early levels.

4.2.5 Effect of Multi-Level RNN Structure

Fig. 12: Comparison of single-level and multi-level RNNs on two different CNN activations (L6 and L7) of AlexNet. The horizontal axis shows average accuracy performances (%) on two splits of Washington RGB-D dataset.

An RNN in this study is of one-level structure with a single parent computation, which is obviously computationally fast comparing to the multi-level structural RNNs. Furthermore, in this way, it provides an ease of use with no need of further processing for fixing the required input forms. However, in order to testify the performance of single-level RNNs over multiple-level RNNs, we analyze the comparative accuracy performances of 1-level RNNs together with 3-levels RNNs (see Fig. 3

). To this end, we conduct experiments on two CNN activation levels with highest semantic information (L6 and L7) of the baseline model of AlexNet. The average results of two splits for both of RGB and depth data are shown in Fig.

12. The results show that RNN with 1-level performs better than RNN with 3-levels on RGB data, while 3-levels of RNN is better than 1-level of RNN on depth data. The better performance of RNN with 3-levels on depth data might be due to the use of a pretrained CNN model based on the RGB data of ImageNet. Hence, further processing might provide more representative information for depth data in that way. Therefore, this difference might be diminished or turn in favor of 1-level RNNs in the use of finetuned CNNs for depth modality as well. Overall, considering both RGB and depth data together, RNNs with 1-level are better in terms of accuracy performance as well.

4.2.6 Contribution of Fine-tuning

Fig. 15: Level-wise average accuracy performance of finetuned CNN models together with fixed models on all the 10-splits of Washington RGB-D dataset.

We have not used any training or fine-tuning in our approach to feature extraction in the experiments so far. Although impressive results are obtained on RGB data, the same success is not achieved on depth data. The reason for this difference is that the baseline CNN models are pretrained models on RGB dataset of the ImageNet. Therefore, as the next step, we analyze the changes in accuracy performance of RGB and depth data modalities by fine-tuning the baseline CNN models in our approach. To this end, we first carry out a systematic inquiry to find optimal fine-tuning hyper-parameters on a predefined set of values using only one split of Washington RGB-D dataset as a validation set for AlexNet and DenseNet-121 models. Then, fine-tuning of the models are performed by stochastic gradient descent (SGD) with momentum. The hyper-parameters of momentum, learning rate, batch size, learning rate decay factor and decay step size, and number of epochs, respectively are used as following;

and are used for AlexNet on RGB and depth data, respectively, whereas and are used for DenseNet-121. Apart from these two models, we also perform fine-tuning on the ResNet-101 model. We use the same fine-tuning hyperparameters of DenseNet-121 for ResNet-101, since they are in a similar architectural structure. Fig. 15 shows average accuracy performance of finetuned CNN models together with fixed models on all the 10 evaluation splits of Washington RGB-D object dataset. The plot shows a clear upward in performance on depth data as expected. However, there is a loss of accuracy in general, when fine-tuning is performed on RGB data. Washington RGB-D object dataset contains a subset of the categories in ImageNet. Accordingly, pretrained models of ImageNet are already satisfy highly correlated distribution on RGB data. Therefore, there is no need for fine-tuning on RGB data. In contrast, in order to ensure coherence and relevance, fine-tuning is required for depth data due to domain difference of the inputs with the pretrained models.

4.2.7 Empirical Performance of Different Fusion Strategies

width= AlexNet DenseNet-121 ResNet-101 RGB Depth RGB Depth RGB Depth Single LB1 81.4 1.8 83.5 2.2 89.7 1.0 85.0 2.1 89.2 1.3 85.5 2.2 LB2 81.1 2.1 83.3 2.2 91.0 1.2 86.2 2.3 91.1 1.0 87.1 2.7 LB3 79.2 2.4 83.2 2.3 89.5 1.5 86.8 2.1 90.5 1.6 86.9 2.6 Concats LB1 + LB2 83.0 1.9 84.0 2.4 90.4 1.0 85.5 2.1 91.1 1.1 87.1 2.7 LB1 + LB3 82.2 2.0 83.8 2.4 90.0 1.5 86.9 2.1 91.1 1.4 86.9 2.6 LB2 + LB3 81.0 2.0 83.4 2.3 89.6 1.5 86.8 2.1 91.1 1.5 87.0 2.7 LB1 + LB2 + LB3 82.5 2.0 83.8 2.3 90.0 1.5 86.9 2.1 91.5 1.3 87.0 2.7 SVM Avg Voting LB1 + LB2 82.8 1.9 84.1 2.3 91.2 1.0 86.0 2.2 91.0 1.1 87.0 2.5 LB1 + LB3 82.5 2.0 84.1 2.4 91.2 1.0 86.8 2.2 91.8 1.2 87.1 2.5 LB2 + LB3 81.1 2.0 83.4 2.3 91.3 1.3 86.8 2.2 92.2 1.0 87.0 2.7 LB1 + LB2 + LB3 82.7 2.1 84.0 2.4 91.5 1.1 86.7 2.2 92.3 1.0 87.2 2.5

TABLE II: Average accuracy performance of different fusion combinations on the best three levels using Washington RGB-D dataset (%).

We have shown that a fixed pretrained CNN model together with random RNN already achieves impressive results on a single level. Likewise, when such pretrained models are fine-tuned on depth data, the results are boosted greatly. The best single levels for RGB and depth data, respectively, are L4, L5 for AlexNet; L5, L6 for ResNet-101; and L6, L7 for DenseNet-121. Next, to further improve accuracy performances, we investigate empirical accuracy analysis of multi-level fusions using fixed pretrained CNN models on RGB data and fine-tuned CNN models on depth data. In this work, in addition to the feature concatenation as in our previous work [14], we also apply average voting based on SVM confidence scores on the best performing levels. Table II reports the average accuracy on the all 10 train/test splits of Washington RGB-D dataset for AlexNet, DenseNet-121, and ResNet-101. The table shows the top three level results (best levels) for each modality and their fusion combinations. The best level triples (LB1, LB2, LB3) for both of AlexNet and ResNet-101 are (L4, L5, L6) on RGB data and (L5, L6, L7) on depth data, while for DenseNet-121 these levels are (L5, L6, L7) on both RGB and depth data. As can be seen from the table, one single level has already produced very good results. Since both model structures and data modality characteristics are different, the best results for each column generally vary depending on the data type and the used model. Nevertheless, in general, average voting on SVM confidence scores gives better results comparing to feature concatenation.

Finally, we provide RGB-D combined results for all three models as shown in Table III based on the SVM confidences. The table reports average results for fusion of the best levels of RGB and depth, and the best trio levels. We evaluate two variants of soft voting, our proposed weighted vote and average vote. The proposed weighted vote increases accuracy comparing to average vote for all the models both on the multi-modal fusion of the best single and best trio levels of RGB and depth streams. The results also confirm the strength of our multi-modal voting approach that combines RGB and depth modalities effectively.

width= AlexNet DenseNet-121 ResNet-101 Avg Vote 90.2 1.3 92.9 1.4 92.7 1.6 Weighted Vote 90.2 1.2 93.5 1.0 93.8 1.1 Avg Vote 90.6 1.6 92.6 1.4 93.0 1.3 Weighted Vote 90.9 1.3 93.5 1.0 94.1 1.0

TABLE III: Average accuracy performance of RGB-D (RGB + Depth) with different fusion combinations on Washington RGB-D dataset (%).

4.3 Object Recognition Performance

Fig. 16: Per-category average accuracy performances of ResNet101-RNN on Washington RGB-D Object dataset.

width= Method RGB Depth RGB-D Kernel SVM [12] 74.5 3.1 64.7 2.2 83.9 3.5 KDES [66] 77.7 1.9 78.8 2.7 86.2 2.1 CNN-RNN [11] 80.8 4.2 78.9 3.8 86.8 3.3 CaRFs [35] - - 88.1 2.4 MMDL [24] 74.6 2.9 75.5 2.7 86.9 2.6 Subset-RNN [60] 82.8 3.4 81.8 2.6 88.5 3.1 CNN Features [2] 83.1 2.0 - 89.4 1.3 CNN-SPM-RNN [61] 85.2 1.2 83.6 2.3 90.7 1.1 CFK [28] 86.8 2.7 85.8 2.3 91.2 1.4 Fus-CNN [46] 84.1 2.7 83.8 2.7 91.3 1.4 AlexNet-RNN [62] 89.7 1.7 - - Fusion 2D/3D CNNs [27] 89.0 2.1 78.4 2.4 91.8 0.9 STEM-CaRFs [36] 88.8 2.0 80.8 2.1 92.2 1.3 MM-LRF-ELM [72] 84.3 3.2 82.9 2.5 89.6 2.5 VGG_f-RNN [14] 89.9 1.6 84.0 1.8 92.5 1.2 DECO [37] 89.5 1.6 84.0 2.3 93.6 0.9 MDSI-CNN [45] 89.9 1.8 84.9 1.7 92.8 1.2 HP-CNN [42] 87.6 2.2 85.0 2.1 91.1 1.4 RCFusion [43] 89.6 2.2 85.9 2.7 94.4 1.4 This work - AlexNet-RNN 83.0 1.9 84.1 2.3 90.9 1.3 This work - DenseNet121-RNN 91.5 1.1 86.9 2.1 93.5 1.0 This work - ResNet101-RNN 92.3 1.0 87.2 2.5 94.1 1.0

TABLE IV: Average accuracy comparison of our approach with the related methods on Washington RGB-D Object dataset (%). Red: Best result, Blue: Second best result, Green: Third best result.

Table IV shows average accuracy performance of our approach along with the state-of-the-art methods for object recognition on Washington RGB-D object benchmark. Our approach greatly improves the previous state-of-the-art results for both of RGB and depth modalities with a margin of and , respectively. As for the combined RGB-D results, our approach surpasses all the other methods except that of [43], which is slightly better than ours (). These results emphasize the importance of deep features in a unified framework based on the incorporation of CNNs and random RNNs.

We also present average accuracy performance of individual object categories on the 10 evaluation splits of Washinton RGB-D Object dataset using the best-performing structure, ResNet101-RNN. As shown in Fig. 16, our approach is highly accurate in recognition of the most of the object categories. Categories with lower accuray results are mushroom, peach, and pitcher. The common reason that leads to the lower performance in these categories seems to be due to their less number of instances. In particular, these categories have only instances, which is the minimum number for any category in the dataset. Considering the other categories with up to instances, this imbalance of the data may have biased the learning to favor of categories with more examples. Moreover, the accuracy of our combined RGB and depth based on weighted confidences of modalities reflects that the fusion of RGB and depth data in this way can provide strong discrimination capability for object categories.

4.4 Scene Recognition Performance

To test the generalization ability of our approach, we also carry out comparative analysis of our best-performing model, namely ResNet101-RNN, on SUN RGB-D Scene [13] dataset for scene recognition as a more challenging task of scene understanding. To this end, we first apply ResNet101 pretrained model without finetuning, namely Fixed ResNet101-RNN, for both of RGB and depth modalities. Then, we finetune the pretrained CNN model on SUN RGB-D Scene dataset using the same hyper-parameters of object recognition task (see Sec. 4.2.6). The results of these experiments together with the-state-of-the-art results on this dataset are reported in Table V. Our best system outperforms the-state-of-the-art methods for all of the data types with impressive improvement of , , and for RGB, depth, and RGB-D, respectively, over the previous best performing results. It is worth mentioning that we use the pretrained CNN model on object-centric dataset of ImageNet [65], which is less commonly used for scene recognition task than the pretrained models on scene-centric datasets such as Places [73]. Nevertheless, our approach outperforms existing state-of-the-art methods for RGB-D scene recognition task. Moreoever, it is interesting that our system even with fixed pretrained CNN model is already discriminative enough and achieves impressive accuracy performances. Contrary to our findings on Washington RGB-D Object dataset, finetuning provides much better results not only for depth domain but also for the RGB domain as well. This is what we expect as scene recognition is a cross-domain task for our approach that has the pretrained CNN model of the object-centric ImageNet as the backbone. Specifically, finetuning on depth data boosts the accuracy greatly by providing both domain and modality adaptation.

width= Method RGB Depth RGB-D Places CNN-Lin SVM [73] 35.6 25.5 37.2 Places CNN-RBF SVM [73] 38.1 27.7 39.0 SS-CNN-R6 [3] 36.1 - 41.3 DMFF [74] 37.0 - 41.5 Places CNN-RCNN [75] 40.4 36.3 48.1 MSMM [40] 41.5 40.1 52.3 RGB-D-CNN [76] 42.7 42.4 52.4 MDSI-CNN [45] 39.6 35.2 45.2 DF2Net [25] - - 54.6 HP-CNN-T [42] 38.8 28.5 42.2 RGB-D-OB [4] - 42.4 53.8 G-L-SOOR [77] 50.5 44.1 55.5 This work - Fix ResNet101-RNN 50.8 38.6 53.1 This work - Finetuned ResNet101-RNN 58.5 50.1 60.7

TABLE V: Accuracy comparison of our approach with the related methods on SUN RGB-D Scene dataset (%). Red: Best result, Blue: Second best result, Green: Third best result.
Fig. 17:

RGB-D confusion matrix of ResNet101-RNN on SUN RGB-D Scene dataset (best viewed with magnification).

Fig. 17 shows the confusion matrix of our approach with fine-tuning over the categories of SUN RGB-D Scene dataset for RGB-D. The matrix demonstrates the degree of confusion between pairs of scene categories and implies the similarity between scenes on this dataset. The largest misclassification errors happen to be between extremely similar scene categories such as computer room - office, conference room-classroom, discussion area-rest space, lecture theatre-classroom, study space-classroom, lab-office, etc. In addition to the inter-class similarity, other reasons for poor performance might be intra-class variations of the scenes and lack of getting enough representative knowledge transfer from the ImageNet models.

Fig. 18: Top-5 RGB-D predictions of our system using sample test images of frequently confused scene categories on SUN RGB-D Scene dataset.

width=0.65 Accuracy RGB Depth RGB-D top-1 58.5 50.1 60.7 top-3 81.0 71.5 83.6 top-5 88.5 80.9 89.9

TABLE VI: Scene recognition accuracy of top-1, top-3, and top-5 on SUN RGB-D Scene dataset (%).

To further analyse the performance of our system, we give top-3 and top-5 classification accuracy together with top-1 results as in Table VI. While the top-1 accuracy shows the percentage of test images that exactly matches with the predicted classes, the top-3 and top-5 indicates the percentage of test images that are among the top ranked 3 and 5 predictions, respectively. The top-3 and top-5 results demonstrate the effectiveness of our system more closely by overcoming ambiguity among scene categories greatly. Fig. 18 depicts some test examples of scene categories confused with each other frequently on SUN RGB-D Scene dataset. As shown in the figure, these scene categories have similar appearances that make them hard to distinguish even for a human expert without sufficient context knowledge in the evaluation. Nevertheless, our approach is able to identify scene category labels among the top-3 and top-5 predictions with high accuracy.

4.5 Discussion

Our framework presents an effective solution for deep feature extraction in an efficient way by integrating a pretrained CNN model with random weights based RNNs. Randomization throughout our RNN-Stage raises the question of whether the results are stable enough. The carefully implemented experiments in Sec. 4.2.2 are an empirical justification for the stability of random weights. On the other hand, our multi-level analysis shows that the optimum performance gain from a single level always comes from an intermediate level for all the models with/without finetuning for both of RGB and depth modalities. The only exception is in the use of finetuned DenseNet-121 model on depth data. This is an interesting finding, because one expects better representation capabilities of final layers, especially in the use of finetuned models. Yet, as expected, performance generally increases from the first level to the last level throughout the networks when the underlying CNN models are finetuned. Since Washington RGB-D Object [12] dataset includes a subset of object categories in the ImageNet [65], finetuning does not improve accuracy success on RGB data. In contrast, accuracy gain is significant due to the need for domain adaptation in depth data. This also shows that using an appropriate technique to handle depth data as in our approach (Sec. 3.1), leads impressive performance improvement by knowledge transfer between modalities.

In this study, although we have explored different techniques to fuse representations of multiple levels to further increase the classification success, a single optimum level may actually be sufficient enough for many tasks. In this way, especially for tasks where computational time is more critical, results can be obtained much faster without sacrificing accuracy success. Another point of interest is that the data imbalance in Washington RGB-D Object dataset results in poor performance for the individual categories with less instances and consequently leads to a drop in the overall success of the system. Hence, this imbalance might be overcome by applying data augmentation on the categories with less instances. However, it is worth to note that we do not perform any data augmentation in this study for both tasks.

The success of our approach for RGB-D scene recognition confirms the generalization ability of the proposed framework. Unlike object recognition, when the underlying CNN models are finetuned, success in both RGB and depth modalities increases significantly in scene recognition task. This is due to the need for cross-domain task adaptation of object-centric based pretrained models. Therefore, similar findings in object recognition could be observed if scene-centric based pretrained models are employed for scene recognition (e. g. Places [73]). Moreover, such pretrained models could improve the results further with our framework. Another potential that could improve the success for scene recognition is embedding contextual knowledge by jointly employing attention mechanism such as [78] in our structure.

5 Conclusion

In this paper, we have presented a framework that incorporates pretrained CNN models together with multiple random recursive neural networks. The proposed approach greatly improves RGB-D object and scene recognition performances over the-state-of-the-art results in the literature on the widely used Washington RGB-D Object and SUN RGB-D Scene datasets. The proposed randomized pooling schema allows us to deal with high-dimensional activations of CNN models effectively. The extensive experimental analysis of various parameters and setup properties show that the incorporation of multiple random RNNs with a pretrained CNN model provides a robust and effective general solution for both of RGB-D object and scene recognition tasks. Utilizing depth data by mapping it into RGB-like image domain allows knowledge transfer from RGB pretrained CNN models effectively. The generic design and the generalization capability of the proposed framework allow to utilize it for other visual recognition tasks. Thus, we will open our code along with models to the community in order to help future studies.


This paper is based on the results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).


  • [1] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    , 2014, pp. 806–813.
  • [2] M. Schwarz, H. Schulz, and S. Behnke, “Rgb-d object recognition and pose estimation based on pre-trained convolutional neural network features,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on.   IEEE, 2015, pp. 1329–1335.
  • [3] Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu, “Understand scene categories by objects: A semantic regularized scene classifier using convolutional neural networks,” in 2016 IEEE international conference on robotics and automation (ICRA).   IEEE, 2016, pp. 2318–2325.
  • [4] X. Song, S. Jiang, L. Herranz, and C. Chen, “Learning effective rgb-d representations for scene recognition,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 980–993, 2019.
  • [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
  • [6] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in International Conference on Learning Representations (ICLR), 2014.
  • [7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
  • [8] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328.
  • [9] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 447–456.
  • [10] H. F. Zaki, F. Shafait, and A. Mian, “Convolutional hypercube pyramid for accurate rgb-d object category and instance recognition,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on.   IEEE, 2016, pp. 1685–1692.
  • [11] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng, “Convolutional-recursive deep learning for 3d object classification,” in Advances in neural information processing systems, 2012, pp. 656–664.
  • [12] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on.   IEEE, 2011, pp. 1817–1824.
  • [13] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [14] A. Caglayan and A. Burak Can, “Exploiting multi-layer features using a cnn-rnn approach for rgb-d object recognition,” in The European Conference on Computer Vision (ECCV) Workshops, September 2018.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [18] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, no. 2, 2017, p. 3.
  • [19] O. Guclu, A. Caglayan, and A. Burak Can, “Rgb-d indoor mapping using deep features,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [20] L. Bo, X. Ren, and D. Fox, “Depth kernel descriptors for object recognition,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2011, pp. 821–826.
  • [21] L. Bo, K. Lai, X. Ren, and D. Fox, “Object recognition with hierarchical kernel descriptors,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.   IEEE, 2011, pp. 1729–1736.
  • [22] S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Skubic, and S. Lao, “Histogram of oriented normal vectors for object recognition with a depth sensor,” in Asian conference on computer vision.   Springer, 2012, pp. 525–538.
  • [23] A. Wang, J. Cai, J. Lu, and T.-J. Cham, “Mmss: Multi-modal sharable and specific feature learning for rgb-d object recognition,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [24] A. Wang, J. Lu, J. Cai, T. Cham, and G. Wang, “Large-margin multi-modal deep learning for rgb-d object recognition,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1887–1898, Nov 2015.
  • [25] Y. Li, J. Zhang, Y. Cheng, K. Huang, and T. Tan, “Df2net: Discriminative feature learning and fusion network for rgb-d indoor scene classification,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [26] M. M. Rahman, Y. Tan, J. Xue, and K. Lu, “Rgb-d object recognition with multimodal deep convolutional neural networks,” in 2017 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2017, pp. 991–996.
  • [27] S. Zia, B. Yuksel, D. Yuret, and Y. Yemez, “Rgb-d object recognition using deep convolutional neural networks,” in 2017 IEEE International Conference on Computer Vision Workshop (ICCVW).   IEEE, 2017, pp. 887–894.
  • [28] Y. Cheng, R. Cai, X. Zhao, and K. Huang, “Convolutional fisher kernels for rgb-d object recognition,” in 3D Vision (3DV), 2015 International Conference on.   IEEE, 2015, pp. 135–143.
  • [29] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, “Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2055–2068, 2017.
  • [30] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on.   IEEE, 2014, pp. 1717–1724.
  • [31] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson, “From generic to specific deep representations for visual recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 36–45.
  • [32] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “Factors of transferability for a generic convnet representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 9, pp. 1790–1802, 2015.
  • [33] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in European Conference on Computer Vision.   Springer, 2014, pp. 345–360.
  • [34] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in British Machine Vision Conference (BMVC), 2014.
  • [35]

    U. Asif, M. Bennamoun, and F. Sohel, “Efficient rgb-d object categorization using cascaded ensembles of randomized decision trees,” in

    2015 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2015, pp. 1295–1302.
  • [36] U. Asif, M. Bennamoun, and F. A. Sohel, “Rgb-d object recognition and grasp detection using hierarchical cascaded forests,” IEEE Transactions on Robotics, vol. 33, no. 3, pp. 547–564, 2017.
  • [37] F. M. Carlucci, P. Russo, and B. Caputo, “(de)co: Deep depth colorization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2386–2393, 2018.
  • [38] L. Liu, C. Shen, and A. van den Hengel, “The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4749–4757.
  • [39] H. F. Zaki, F. Shafait, and A. Mian, “Learning a deeply supervised multi-modal rgb-d embedding for semantic scene and object category recognition,” Robotics and Autonomous Systems, vol. 92, pp. 41–52, 2017.
  • [40] X. Song, S. Jiang, and L. Herranz, “Combining models from multiple sources for rgb-d scene recognition,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, (IJCAI-17), 2017, pp. 4523–4529. [Online]. Available: https://doi.org/10.24963/ijcai.2017/631
  • [41] S. Yang and D. Ramanan, “Multi-scale recognition with dag-cnns,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1215–1223.
  • [42] H. F. Zaki, F. Shafait, and A. Mian, “Viewpoint invariant semantic object and scene categorization with rgb-d sensors,” Autonomous Robots, vol. 43, no. 4, pp. 1005–1022, 2019.
  • [43] M. R. Loghmani, M. Planamente, B. Caputo, and M. Vincze, “Recurrent convolutional fusion for rgb-d object recognition,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2878–2885, 2019.
  • [44] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” in

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , 2014, pp. 1724–1734.
  • [45] U. Asif, M. Bennamoun, and F. A. Sohel, “A multi-modal, discriminative and spatially invariant cnn for rgb-d object labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 9, pp. 2051–2065, 2018.
  • [46] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard, “Multimodal deep learning for robust rgb-d object recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on.   IEEE, 2015, pp. 681–687.
  • [47] L. Tang, Z.-X. Yang, and K. Jia, “Canonical correlation analysis regularization: An effective deep multiview learning baseline for rgb-d object recognition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 11, no. 1, pp. 107–118, 2019.
  • [48]

    W. F. Schmidt, M. A. Kraaijveld, and R. P. Duin, “Feed forward neural networks with random weights,” in

    Proceedings of the 11th IAPR International Conference on Pattern Recognition.   IEEE, 1992, pp. 1–4.
  • [49] Y.-H. Pao and Y. Takefuji, “Functional-link net computing: theory, system architecture, and functionalities,” Computer, vol. 25, no. 5, pp. 76–79, 1992.
  • [50] Y.-H. Pao, G.-H. Park, and D. J. Sobajic, “Learning and generalization characteristics of the random vector functional-link net,” Neurocomputing, vol. 6, no. 2, pp. 163–180, 1994.
  • [51] B. Igelnik and Y.-H. Pao, “Stochastic choice of basis functions in adaptive function approximation and the functional-link net,” IEEE Transactions on Neural Networks, vol. 6, no. 6, pp. 1320–1329, 1995.
  • [52] G.-B. Huang, L. Chen, C. K. Siew et al., “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 879–892, 2006.
  • [53] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
  • [54] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in neural information processing systems, 2009, pp. 1313–1320.
  • [55]

    J. B. Pollack, “Recursive distributed representations,”

    Artificial Intelligence, vol. 46, no. 1-2, pp. 77–105, 1990.
  • [56] G. E. Hinton, “Mapping part-whole hierarchies into connectionist networks,” Artificial Intelligence, vol. 46, no. 1-2, pp. 47–75, 1990.
  • [57] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, “Parsing natural scenes and natural language with recursive neural networks,” in

    Proceedings of the 28th international conference on machine learning (ICML-11)

    , 2011, pp. 129–136.
  • [58] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1637–1645.
  • [59] A. Sharma, O. Tuzel, and M.-Y. Liu, “Recursive context propagation network for semantic scene labeling,” in Advances in Neural Information Processing Systems, 2014, pp. 2447–2455.
  • [60] J. Bai, Y. Wu, J. Zhang, and F. Chen, “Subset based deep learning for rgb-d object recognition,” Neurocomputing, vol. 165, pp. 280–292, 2015.
  • [61] Y. Cheng, X. Zhao, K. Huang, and T. Tan, “Semi-supervised learning and feature evaluation for rgb-d object recognition,” Computer Vision and Image Understanding, vol. 139, pp. 149–160, 2015.
  • [62] H. M. Bui, M. Lech, E. Cheng, K. Neville, and I. S. Burnett, “Object recognition using deep convolutional features transformed by a recursive network structure,” IEEE Access, vol. 4, pp. 10 059–10 066, 2016.
  • [63] M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” in International Conference on Learning Representations (ICLR), 2013.
  • [64] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in 2009 IEEE 12th international conference on computer vision.   IEEE, 2009, pp. 2146–2153.
  • [65] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  • [66] L. Bo, X. Ren, and D. Fox, “Depth kernel descriptors for object recognition,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2011, pp. 821–826.
  • [67] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proceedings of the 23rd ACM international conference on Multimedia.   ACM, 2015, pp. 689–692.
  • [68] A. M. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng, “On random weights and unsupervised feature learning.” in ICML, vol. 2, no. 3, 2011, p. 6.
  • [69] M. Rigotti, D. D. Ben Dayan Rubin, X.-J. Wang, and S. Fusi, “Internal representation of task rules by recurrent dynamics: the importance of the diversity of neural responses,” Frontiers in computational neuroscience, vol. 4, p. 24, 2010.
  • [70] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011.
  • [71] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision.   Springer, 2014, pp. 818–833.
  • [72] H. Liu, F. Li, X. Xu, and F. Sun, “Multi-modal local receptive field extreme learning machine for object recognition,” Neurocomputing, vol. 277, pp. 4–11, 2018.
  • [73] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Advances in neural information processing systems, 2014, pp. 487–495.
  • [74] H. Zhu, J.-B. Weibel, and S. Lu, “Discriminative multi-modal feature fusion for rgbd indoor scene recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [75] A. Wang, J. Cai, J. Lu, and T.-J. Cham, “Modality and component aware feature fusion for rgb-d scene classification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [76] X. Song, L. Herranz, and S. Jiang, “Depth cnns for rgb-d scene recognition: Learning from scratch better than transferring from rgb-cnns,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [77] X. Song, S. Jiang, B. Wang, C. Chen, and G. Chen, “Image representations with spatial object-to-object relations for rgb-d scene recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 525–537, 2020.
  • [78] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention branch network: Learning of attention mechanism for visual explanation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.