I Introduction
Data comes from different sources and in different forms; images, videos, texts and audios. Each of which may complement the other in information content. Thus, multiple data modalities are usually more informative for a task than a single data modality. With the enormous availability of various electronic and digital multimedia devices, huge volumes of multimodal data contents are being generated on the Internet daily. However, for realworld applications, these modalities should be first well integrated and appropriately fused to have more comprehensive information contents. Many methods have been developed in multimodal learning to exploit both the different characteristics as well as the shared relationships between different data modalities in order to perform various tasks [65, 10, 63, 29, 43, 2]
. One of the rigid milestones in developing many visualbased data applications is scene understanding. It is mainly applied to understand the contents of an image or video prior performing the target task (e.g. largescale video retrieval
[29]). Imperatively, scene labeling (i.e; semantic segmentation or image parsing) is a crucial part of understanding an outdoor or indoor captured scene image. The task here is to classify each pixel into its semantic category (belonging to some object or stuff) in an input scene image.In our paper, we tackle the problem of RGBD indoor scene labeling where we process two different data modalities; RGB color channels and Depth planes. Indoor RGBD scene labeling is one of the most challenging visual classification problems [55, 56]. Many applications are built on understanding the surrounding scenes, e.g. social behavior understanding [49] and objects detection and recognition [70].
This problem is usually addressed as a multimodal learning problem where the task is to exploit and fuse both RGB and Depth modalities to better label each pixel. Depth planes provide informative representation where the RGB representations are ambiguous. For example, Figure 1 show how the depth information can help to distinguish some similar appearance locations in the RGB image.
Scene labeling in general is a challenging classification problem since the scene image tends to contain multiple messy objects. These objects may also have variations due to factors affecting their appearance and geometry in the image. One key useful strategy is to leverage the neighborhood/contextual information for each pixel within each modality[15, 41, 1, 3]. Typically, the feature representation of a pixel is extracted from a local patch (cropped from the scene image) containing that target pixel and used for classification. Longrange/global contextual information (distant image patches) are important as well for local pixel classification. However, both local and global contextual information should be utilized adequately to maintain a good balance between the discriminative features and the abstract/toplevel features of that pixel feature representation.
Recently, Recurrent Neural Networks (RNN) has been shown to be very successful on encoding contextual information into local feature representations [72, 54, 6]
. Recurrent models have feedback connections so that the current state is engaged in the calculation of the future state. Thus, RNN is effectively used in tasks which require modeling the long and shortrange dependency within the input sequence, e.g. speech recognition and natural language processing
[21, 32, 19, 18]. We use RNNs to model the contextual information within each modality. However, traditionally, RNN is only used for a single modality signals. In this paper, we introduce a new multimodal RNNs method which models contextual information into local representations from multimodal RGBD data simultaneously. In our work, we first train Convolution Neural Networks
[37] (CNNs) to extract features from local RGBD image patches (from both the RGB images and the Depth planes). These convolutional local features form the input to our multimodal RNNs to further contextualize them and select informative patterns across the modalities. Our model can be easily extended to perform prediction tasks considering more modalities ().Our new multimodal RNNs method is built based on the basic quaddirectional 2DRNNs structures [18, 54]. The quaddirectional 2DRNN contains four hidden states where each is dedicated to traverse the image in specific 2D direction out of four possible directions (topleft, topright, bottomleft and bottomright)
. To process two modalities which are the RGB image and the Depth plane; our model has two RNNs, one of which is assigned to learn the representations of a single input modality. To connect both RNN modalities and allow information fusion, we develop information transfer layers that crossly connecting the RNNs. The transfer layers are responsible about learning to select and transfer the relevant patterns from one modality to the other modality, and vice versa. Concretely, for every patch in the input image, and during the process of learning the RNN hidden representations for one modality, our method does not only encode the contextual information within its own modality, but also learns to encode relevant contextual patterns from the other modality. As a result, our method can learn powerful contextaware and multimodal features for local pixel representations.
Our method is different from existing deep multimodal learning methods [45, 59, 58, 38]. They usually concatenate inputs at the beginning or concatenate learned middle level features to extract highlevel common features as the representations of the multimodal data. These methods mainly focus on discovering common patterns between different modalities. Although common patterns are important to extract, however these methods are prone to miss important modality specific information that are highly discriminative within a single modality, e.g., some texture patterns inside the RGB channels. Concretely, our model retains modalityspecific information by assigning an RNN model to learn features from each modality. Our method also allows sharing information between modalities by using the information transfer layers to adaptively transfer relevant crossmodality patterns.
Our model is trained endtoend and the transfer layers are learned to extract relevant acrossmodality information for each patch of the image. We perform experiments on two popular RGBD benchmarks, the NYU V1 and V2 and achieve comparable performance with other stateoftheart methods on them. Additionally, the proposed method significantly outperforms its counterparts baselines (e.g. concatenation of the RGBD data as the input of RNN models).
The remaining parts of our paper are summarized as follows: We first discuss the related work in Section ii@. Our proposed model alongside the framework details is presented in Section iii@. Experiments on popular RGBD benchmarks and results are demonstrated in Section iv@. Finally, we conclude the paper in Section v@.
Ii Related Work
Because this work is mainly related to RGBD scene labeling and RNN, we briefly review the most recent literature.
Indoor RGBD Scene Labeling: Many papers propose different methods to solve RGBD scene labeling [55, 56, 22, 8, 51, 42]. The most popular indoor scene datasets are the NYU depth datasets V1 [55] and V2 [56]. The initial results [55] are generated by extracting SIFT features on the RGB color images in addition to depth maps. Their results prove that depth information can refine the prediction performance.
Further improvements are made to the NYU V1 by the work [51], where they adapt a framework of kernel descriptors which converts the local similarities (kernels) to patch descriptors. They further use a superpixel Markov Random Field (MRF) and a segmentation tree for contextual modeling. Other works [8, 11] explore depth information through a feature learning approach; [8] learns their features using convolution neural network on four channels; three from the RGB image and the fourth is the Depth image. Wang et al. [64] adapt an unsupervised feature learning approach by performing the learning and encoding simultaneously to boost the performance. Another interesting work done by Tang et al. [61]
proposes new feature vectors called Histogram of Oriented Normal Vectors (HONV), designed specifically to capture local geometric properties for object recognition with a depth sensor.
Meanwhile some works in RGB image parsing utilize the label dependency modeling approaches, such as graphical models (Conditional Random Fields CRFs [25]), and other works perform feature learning to generate hierarchical multiscale features that are capable of capturing the input context, which is successfully applied through the aid of deep convolutional neural networks [14, 48]. In our multimodal recurrent neural networks, we can capture short and long range context within and between input modalities.
Multimodal Learning: In contrast to singleview approaches, most multimodal learning methods introduce a separate function to model one data modality and jointly optimize them together [66, 67, 4, 69]. Deep networks have been applied to learn features over multiple views [45, 59, 58, 38]. Multimodal learning improves the classification performance through exploiting the sharable knowledge of the data across modalities.
Figure2 shows the general method of how multimodal learning is applied. Notice that the key idea here is to allow fusion/sharing between the modalities at some point so that the join space is able to capture the relationships between the input modalities. Typical methods in the literature can be categorized based on whether they combine on feature level [9, 50, 62], or classifier level [52, 40]. Some works perform preprocessing (module level) steps on various modalities to generate helpful cues before the learning process, as in [16, 13, 39]. Fusion can be done by combining all features into one highdimensional vector, or by jointly training multiple classifiers to maximize the mutual agreement on distinct modalities of the input data[30, 24]. These methods employ typical regularizations which are applied to explore shared visual knowledge, such as group structured sparsity [36].
We can also group multimodal methods according to their training procedures into three main categories: Cotraining[66, 69], Multiple kernel learning [67, 69], and Subspace learning[31, 4, 68, 69]. Cotraining algorithms tend to train alternatingly to maximize the mutual agreement on two distinct views of data. Multiple kernel learning approaches improve the performance by exploiting different types of kernels that correspond naturally to different views, and combine these kernels either linearly or nonlinearly. Meanwhile, Subspace learning methods are mainly similar in obtaining a latent subspace that is shared by multiple modalities. Another interesting line of work proposed by Gupta et al. [53], where they propose to transfer supervision between images from different modalities. Their model is able to learn representations for unlabeled modalities and can be used as a pretraining procedure for new modalities with limited labeled data. In our work and different from all previous methods, we introduce information transfer layers between two RNN modalities to perform the multimodal learning task simultaneously.
Recurrent Neural Networks (RNNs): A recurrent model refers to a model which has connections between its units to form a directed cycle, for example, when a feedback connection from the current state is engaged in the calculation of the future state. This structure creates a sort of complementary internal state for the network, which then allows it to exhibit dynamic temporal behavior. RNN is effectively used in tasks which require sequence modeling, like speech recognition, handwriting recognition, and other natural language processing tasks [21, 32, 19, 18].
One major drawback in the standard RNN is the vanishing gradient problem
[27]. This drawback limits the context range of the input data, because the capacity of the model is limited to capture enough long dependencies. To address this problem, Hochreiter and Schmidhuber [28]propose the Long Short Term Memory (LSTM), where they treat the hidden layer as multiple recurrently connected subnets, known as memory blocks, thus allowing the network to store and access information over long periods of time. Graves et al. extend the idea of unidirectional LSTM network into bidirectional networks which have shown good improvements over the unidirectional networks
[20, 17].Graves et al. also extend the onedimensional RNN into multidimensional one [18, 21] as shown in Figure3. The key idea is to replace the single recurrent connection in the 1DRNNs with as many recurrent connections as there are dimensions in the input data. Another interesting work from Graves et al. investigates deep structures of RNNs [19], which is also successfully applied in opinion mining by Irsoy et al. [32]. In our work we evaluate both deep and LSTM structures alongside the basic quaddirectional 2DRNNs. We found that there is no significant difference in the label prediction performance before and after stacking multiple layers of the network or engaging the LSTM units with the basic quaddirectional 2DRNNs. Thus, we only show the results of our multimodalRNNs model using the basic quaddirectional 2DRNNs.
Iii Model Framework
We first extract convolutional features from local RGBD patches using our trained CNN models. Then, our multimodalRNNs are developed to further learn contextaware and multimodal features based on the convolutional features. Afterwards, a softmax classifier is trained to classify each patch into its semantic category. Different from traditional single modality RNN, our multimodalRNNs also have transfer layers to learn to extract relevant contextual information across both modalities at each time step. Below, we first introduce the traditional RNNs.
1DRNN and 2DRNN: The popular Elmantype 1DRNN [12] and its 2D version are designed to capture the dynamic behavior of the signal over time, so that the hidden representations can capture the contextual information from the first time step until the current time step. Its forward pass is formulated as the following:
(1)  
where , and
are the input, output and hidden neurons at the time t respectively. The functions
and are elementwise nonlinear functions with bias terms and , and the matrices , and are the input to hidden, hidden to hidden and hidden to output weights, respectively. The 2DRNN [18, 72, 54] is generalized from the 1DRNN so that the data propagation will come from two dimensional neurons instead of one, thus the formulation becomes:(2)  
We can notice now the propagation is in a 2Dplane from top left regions and continues to flow until the end of the 2D sequence (in our case the image patch sequence), where denotes the location of pixels or patches.
We show first a conceptual illustration of our proposed model using 1DRNNs as shown in Figure 5. Concretely, in this paper, we adapt 2DRNN to learn hidden representations for local RGBD image patches. Figure 4 shows one scan direction using 2DRNN, which scans only the topleft sequence. Scanning the image in only one direction leaves some patches in the topleft sequence without being informed of the contextual information from the bottomright patches during the forwardpass of the testing images. We approximate images using four directional 2D sequences of patches following the work [54]. The other three directions are topright, bottomleft and bottomright sequences. We combine the features that are learned from these four directions to obtain the final features of the image patches.
Iiia Multimodal RNNs via Information Transfer Layers
Traditional RNNs are developed to model single modality signals. In this work, we extend RNNs to represent RGBD signal by our multimodalRNNs, where we have a pair of single RNNs and each of them is assigned to process one modality (either RGB or Depth). Besides, we propose a transfer layer to connect the hidden planes in one RNN model and the other RNN model’s hidden planes and vise versa. This transfer layers will learn to adaptively extract relevant patterns from one modality to enhance the feature representations of the other modality. If the DepthRNN is processing the Depth patch in the sequence at location , the other RNN, which is RGBRNN, will be processing the corresponding RGB patch at the same location. The DepthRNN is also fed with the processed hidden state values obtained from the RGBRNN and vice versa. The RGBD data flows concurrently in both models where their internal processing clocks are synchronized, so that at each time step we process a pair of RGB and Depth local patches simultaneously.
The architecture of our model is summarized in Figure 6. It is an endtoend learning framework. Recurrent layers and transfer layers are automatically learned to maximize the labeling performance on the training RGBD data. Compared to the baseline which concatenates RGBD data and thus mixes the multimodality information (both relevant and irrelevant), our method retains modalityspecific information and only shares relevant crossmodality information.
Given one 2D direction in the RNN (the first hidden plane from the quaddirectional hidden planes), as shown in Figure 6, the current state of the network that is being processed at depends basically on four main previous states. Two are obtained from the network itself and the others are obtained through the transfer layers from the previous states processed in the other modality network, in addition to the input patch features from either the RGB or the depth images at location . Both networks are synchronized and process the input modalities simultaneously.
Given an RGB image where refers to ‘color’ and is processed by RGBRNN and a depth image where refers to ‘depth’ and is processed by DepthRNN, in this paper, we first extract multiple patches from the images and generate their corresponding convolutional feature vectors to form the input to our multimodalRNNs model. We denote the corresponding convolutional feature maps as or for each patch in or . Concretely, the forward propagation formulation to process one hidden plane out of the four (quad) directional hidden planes is as the following:
(3)  
where is a feature vector of a certain patch at location in the RGB image and is a feature vector of a certain patch at location in the Depth image. and are the hidden sates inside each RGBRNN and DepthRNN respectively. The weight matrices and are responsible to for inputhidden mapping in RGB and Depth modalities respectively. In the other hand, here we have two types of hiddenhidden transformation matrices; withinmodality hiddenhidden transformation and acrossmodality hiddenhidden transformation. and are withinmodality hiddenhidden mapping inside the RGBRNN and the DepthRNN respectively. Meanwhile, is a transformation weight matrix to transform features from the Depth to RGB hidden states (from the DepthRNN modality hidden state to the RGBRNN modality hidden state). is a transformation weight matrix to transform features from the RGB to Depth hidden states (from the RGBRNN modality hidden state to the DepthRNN modality hidden state), i.e. these weight matrices act as the information transfer layers that crossly connect the RGB and Depth hidden states. Ultimately, the and are the hiddenoutput transformation matrices in each corresponding modality.
and are learnt to extract shared patterns between the modalities. Notice that the weight matrix that transforms from the top side hidden state in the RGBRNN and the weight matrix that transforms from the left side hidden state are shared which is . Similarly, in the DepthRNN is shared (transforms from and from ). Also the case in the transfer layers; and are shared. In detail, transforms from and from . While transforms from and from .The nonlinear function
is ReLU
in our implementation. The function is the typical softmax and and are biases.Since we adopt the quaddirections, the forward pass is similar to Equation 3 and in addition to the remaining directions beside the topleft sequence, i.e. topright, bottomleft and bottomright. To facilitate readability of the equation and to easily distinguish between all quad/four directions; we will use arrow notations to represent different directions. refers to the topleft processing sequence, indicates the topright sequence, is the bottomleft and finally is the bottomright sequence. Now, the full model forward propagation pass in the RGBRNN becomes:
(4) 
As mentioned, we use four quaddirectional 2DRNN to approximate each image. The arrows indicate the quaddirections (topleft, topright, bottomleft and bottomright), and the , , and are the corresponding four hidden planes. Each hidden plane has its own weight matrices besides the introduced transfer layers , e.g. (inputhidden mapping), (hiddenhidden mapping), (the transfer layer from DepthRNN hidden plane to RGBRNN hidden plane), (hiddenoutput mapping) and the bias term which is denoted by . Note that the connection between the two hidden planes and the corresponding is weighted by the corresponding weight matrix , which is leaned to extract shared patterns between the modalities.
Simultaneously, the forward pass in the DepthRNN model is as follows:
(5) 
where is the feature vector of a certain patch at location in the depth image. The , , and are the quad hidden planes in the DepthRNN. Each hidden plane accompanies its own weight matrices (inputhidden mapping), (hiddenhidden mapping), (transfer layer from RNNRGB hidden plane to DepthRNN hidden plane), (hiddenoutput mapping) and its own bias term . The remaining terms in the quad planes are similar to the case of the first hidden plane. The function is a nonlinear ReLU unit and the function is the typical softmax and is a bias term.
We can notice that the crossconnections through the transfer layers are applied in each processing direction (four sequences). By this method, the processing of a specific patch will rely on both previous hidden neighbors from its own modality in addition to the other modalities, thus learns more contextuallyaware hidden representations of the RGBD image patches from both modalities.
Our labeling task is a typical supervised classification problem. We aggregate the cross entropy losses from both modalities and calculate the loss for every patch. The error signal of an image is averaged across all the valid patches (those that are semantically labeled), which is mathematically formulated as the following:
(6) 
where is the indicator function, is the total number of patches and is the number of semantic classes, is the ground truth label for the RGBD patch representation, which is the same as that of the center pixel, and are the class likelihood generated from both the RGBRNN and the DepthRNN for the patch representations and respectively, where they are dimensional vectors. Note that we ignore the contribution of unlabeled (invalid) patches in the loss calculation.
Optimization of the multimodalRNNs model: To learn the multimodal RNNs parameters in Equations 4 and 5, we optimize the objective function in Equation 6 with a stochastic gradientbased method. Both RGBRNN and DepthRNN are optimized simultaneously using Back Propagation Through Time (BPTT) [12]
. We unfold both networks in time and calculate their gradients which are backpropagated at each time step throughout both networks. This is similar to the typical multilayer feedforward neural network propagation, but with the difference that the weights are shared across time steps defined by the architecture of the recurrent network. The whole model is differentiable and trained endtoend. The propagation of the gradients through the RGBRNN and DepthRNN is simultaneous.
In more detail, we provide the first plane (topleft sequence) backward pass in the RGBRNN. The backward pass derivations of the remaining sequences (the remaining three planes) can be easily inferred by following similar derivation strategy as we will explain next in the backward pass formulations for the first plane, but with considering the change of the sequence directions (topright, bottomleft and bottomright sequences). Similarity, the total derivations for the DepthRNN are straightforward and exactly the same as for the RGBRNN but with swapping the notations of and and vise versa.
In the RGBRNN topleft sequence, we generate the gradients of the loss function in Equation
6 by deriving it with respect to the model internal parameters, i.e. topleft sequence weight matrices , , , , and .Notice that since the weight matrix is shared between and and the transfer layer weight matrix is shared between and , hence we will rewrite the forward pass as the following to further facilitate the understanding of the weight sharing concept and the backward pass:
(7) 
Notice that the other remaining terms that present originally in Equation 4 (i.e. , , and do not engage any of the internal parameters of the topleft sequence, hence we omit them.
Before we formulate the backward pass, Figure 7 show an illustration of the forward and backward passes in the topleft plane sequence. Notice that the derivatives which are computed in the backward pass at each hidden state at some specific location are processed in the reverse order of forward propagation sequence.
For better readability, we ignore the southeast arrow sign in the following derivations for the topleft sequence.
Notice that now there are two types of error signals: direct one which is reachable directly through the loss () and indirect ones, i.e. the error signals that are coming from neighboring future states at and locations; ( + ).
Concretely, and given that the derivative of loss function with respect to the Softmax output function is , similarity , hence at location the backward pass of the RGBRNN (derivations w.r.t. all internal parameters) is formulated as the following:
(8)  
The term is defined to allow the model to propagate local information internally. The sign is the Hadamard product (elementwise). We can notice through the term that there are two sources of gradients, one is generated directly from the current hidden state and the other generated indirectly from the bottom and right locations (corresponding to the three terms: direct error: , indirect error from neighbor bottom location: and indirect from right neighbor location: ). Notice that these two types of error signals (direct and indirect) are due to the weight sharing effect. It is also important to mention that in our derivations we show the error signals coming from the direct bottom and right neighbors, however, in real implementations the equations are recurrently applied to allow propagation of the errors recursively (recurrently) from all potential future neighbors until we reach out the current referenced patch at specific location.
Similarly, the backward passes for the remaining three planes in the RGBRNN are straightforward to be derived following Equation 8 but with paying attention to the directions of the processing sequence. Additionally, the backward pass in the DepthRNN follows exactly similar derivations but with swapping the notations of with and vise versa. Notice that the derivation w.r.t is the same in the other remaining three planes/sequences.
To recap, we adopt BPTT to train both basic quaddirectional 2DRNN (RGBRNN and DepthRNN). We use the cross entropy loss in our implementation and the errors are calculated by chain rule as we described previously in detail.
Iv Experiments and Results
Iva Datasets
We evaluate our model on the benchmark dataset NYU versions 1 and 2 [55, 56]. The NYU V1 dataset contains 2284 RGBD indoor scene images labeled with around 13 categories. The NYU V2 is also comprised of video sequences from indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It contains 1449 densely labeled pairs of aligned RGB and depth images. We follow the settings in [56], where the first task is to predict the pixel label out of four semantic classes: Ground, Furniture, Props and Structure. The second task is to predict the label out of 14 categories. The accuracy is calculated in terms of total pixelwise, average classwise and Intersection Over Union (average IOU) among the semantic classes for comparison.
IvB CNNs and RNNs Training
We train our convolutional neural networks (CNNs) on both NYU version 1 and 2 images. We follow the network structure proposed by [54]
without considering any spatial information channels (RGB locations). We also didn’t perform any hybrid sampling as they mention in their implementation details. The CNN model consists mainly of three convolutional layers (where in between there are maxpooling and ReLU layers) and two last fullyconnected layers followed by an
way softmax loss layer. In more details, conv1( or ), max pool(), conv2(), max pool(), conv3(), max pool(), FC1() and FC2(). The CNN models are trained on patches associated with their centering pixel labels. Each image contains patches. Each CNN model produces a 64dimensional vector per patch for each modality. We use these CNN features as the input of our RNN models.In CNN training, we start by a learning rate of and decrease it every epochs by (divided by ). The momentum is initialized as
and remains the same throughout the training. We perform the typical normalization as a preprocessing step of images, we subtract the mean image and divide by the standard deviation. The CNN models nearly converge after 50 epochs (around 6 hours on NVIDIA TK40 GPU), however, we iterate the models in most cases until 100 epochs.
We train our multimodal RNNs using BPTT (back propagation through time). We adapt stochastic gradient descent (SGD) throughout our training. We initialize the learning rate as
and decrease it by after each epoch. The momentum is set to . The internal parameter dimension of the networks is set to. We apply gradient clipping
[5, 46] and set the threshold value to , however, even if the ReLU can potentially cause gradient explosion, it plays a critical role in RNN to mitigate the gradient vanishing problem. Thus, we mainly use ReLU in all of our RNN models. The other RNN model parameters are initialized either by randomly or zeros. Notice that we didn’t consider any pixels that have zero labels in our training, we held them out. Our model converges much faster compared to the other competitive baselines. It converged at almost the similar speed of a single RNN model, with the benefit of processing multiple input modalities simultaneously.IvC Baselines
To show the effectiveness of our proposed model, we develop the following baselines for comparison alongside our proposed model:

CNNRGB: in this baseline, we train CNN based on RGB images (input is three RGB channels) for label prediction.

CNNDepth: we train CNN as in ‘CNNRGB’ but using Depth images only (input is one Depth channel).

CNNRGBD: we train CNN as in ‘CNNRGB’ with extra Depth images (input is four RGBD channels) similar to the work presented in [8].

RNNRGB: in this baseline, we follow the structure of the quaddirectional 2DRNN proposed by [54]. We use the model in ‘CNNRGB’ to extract RGB features. We only input the RGB features to train the RNN for label prediction.

RNNDepth: we extract Depth features using the trained model in ‘CNNDepth’ and use these features to train quad 2DRNN as in ‘RNNRGB’. We only input the Depth features to train the RNN for label prediction.

RNNRGBD:
we use the trained model in ‘CNNRGBD’ for feature extraction. We use these RGBD features to train quad 2DRNN similar to ‘RNNRGB’ baseline to perform label prediction.

RNNClassifiersCombined: or as we call it ‘postfusion’, here we train RGBRNN and DepthRNN for label prediction and combine their classification scores on the classifier level. We use the trained models in ‘CNNRGB’ and ‘CNNDepth’ for feature extraction.

RNNFeaturesCombined: or ‘prefusion’, here we use both trained models in ‘CNNRGB’ and ‘CNNDepth’ for feature extraction. We concatenate both RGB and Depth features (to form higher dimensional feature vectors) and train one quad 2DRNN similar to ‘RNNRGB’ baseline to perform label prediction.

RNNHiddensCombined: or ‘middlefusion’, we fuse the hidden representations of both RNNs just before classification and train them jointly.

MultimodalRNNsOurs in this sitting we implement our proposed multimodal RNNs structure. Here, we have two internal RNN models one is responsible for processing the RGB features and the other is responsible for processing the Depth features. Both models are optimized simultaneously using BPTT where we combine their classification scores to finally obtain the label map per RGBD image.

MultimodalRNNsOursMultiscale: similar to our ‘MultimodalRNNsOurs’, but in this setting we applied our proposed structure on multiscale convolutional features as proposed by [14]. We follow similar training sittings to train multiscale CNN models on different image sizes, then we use them to extract multiscale features. We concatenate these features together to form our final input patch representation to our multimodalRNNs.
We also compare our results with other stateoftheart methods.
IvD Results
Given the input RGB images and their corresponding Depth images, we divide each image into nonoverlapping patches of size . The extracted local CNN features are used as the input of our multimodal RNNs. The RGBRNN and the DepthRNN models process RGB and Depth local patch features respectively.
Results on NYU V1  13 categories: In this dataset the task is to predict pixel labels out of 12 categories plus an unknown category. Table I shows the results of our baselines alongside the multimodal RNNs model and other stateoftheart methods. RNN models outperform CNN baselines by a large margin. And our multimodal RNNs further improve the accuracy over the RNN baselines. Our model achieves comparable results with the other stateoftheart methods. The accuracy comparison with all baselines shows the effectiveness of our method with the proposed transfer layers. Figure 8 show some qualitative results generated by our method and the most competitive baseline method ‘RNNFeaturesCombined’. Our multimodalRNNs model can correctly classify many misclassifications results compared to the baseline in most cases.
Algorithm  Pixel Acc  Class Acc  IOU 

CNNRGB  69.21%  56.23%  45.17% 
CNNDepth  58.56%  32.82%  22.21% 
CNNRGBD  69.17%  57.83%  45.24% 
RNNRGB  71.38%  64.43%  52.80% 
RNNDepth  60.67%  46.88%  28.24% 
RNNRGBD  71.09%  67.01%  57.87% 
RNNFeaturesCombined  72.94%  69.06%  60.54% 
RNNHiddensCombined  71.50%  67.00%  55.59% 
RNNClassifiersCombined  71.28%  66.37%  54.79% 
MultimodalRNNsOurs  74.68%  72.54%  62.53% 
MultimodalRNNsOursMultiscale  78.89%  75.73%  65.70% 
Wang et al.[64]    61.71%   
gradient KDES[51]    51.84%   
color KDES[51]    53.27%   
spin/surface normal KDES[51]    40.28%   
depth gradient KDES[51]    53.56%   
Silberman et al[55]    53.00%   
Pei et al.[47]    50.50%   
Ren et al.[51]    71.40%   
Eigen et al. [11]  75.40%  66.90%   
Results on NYU V2  4 categories: For example, Eigen et al. [11] achieve higher results than our model on this task, while we outperform their model on NYU V1 task. Their multiscale CNN structure is also well designed to address the problem. However, their network appears to be much more complex than ours as they have higher number of computational operations (way more convolutions). But we still believe our work can be potentially orthogonal and complementary to their method if it is trained endtoend with their method.
Likewise, Couprie et al.[56], use many types of features, including SIFT features, histograms of surface normals, 2D and 3D bounding box dimensions, color histograms, relative depth and their support features. In our model, we only use the transfer layers alongside the basic quad 2DRNN structures, and can achieve much better performance. Figure 9 also show some qualitative results generated by our method and the most competitive baseline method ‘RNNFeaturesCombined’. Our multimodalRNNs model can correctly classify many misclassifications results compared to the baseline in most cases.
Algorithm  Pixel Acc  Class Acc  IOU 

CNNRGB  65.55%  62.07%  45.42% 
CNNDepth  69.61%  65.96%  49.13% 
CNNRGBD  71.60%  69.89%  54.19% 
RNNRGB  68.14%  65.96%  51.05% 
RNNDepth  70.95%  67.76%  52.21% 
RNNRGBD  74.18%  69.99%  56.63% 
RNNFeaturesCombined  74.36%  72.80%  60.08% 
RNNHiddensCombined  73.70%  71.10%  56.00% 
RNNClassifiersCombined  73.39%  70.64%  55.34% 
MultimodalRNNsOurs  75.74%  75.01%  62.10% 
MultimodalRNNsOursMultiscale  78.60%  76.69%  65.09% 
Wang et al.[64]    65.30%   
Couprie et al. [8]  64.50%  63.50%   
Stuckler et al.[60]  70.90%  67.00%   
Khan et al. [35]  69.20%  65.60%   
Mueller et al.[44]  72.30%  71.90%   
Gupta et al.[23]  78.00%    64.00% 
Cadena and Kosecka [7]    64.10%   
Eigen et al. [11]  83.20%  82.00%   

Results on NYU V2  14 categories: On the NYU V2, we evaluate our model to label image pixels with one of 13 categories plus an unknown category. We show the comparison in this task with various baselines and stateoftheart methods as shown in Table III. This is the most competitive benchmark presented on this dataset. Notice that the work of Cadena and Kosecka [7] used many RGB image and 3D features and formulate the problem in the CRF framework. Meanwhile the work of Wang et al. [64] adapted an existing unsupervised feature learning technique to directly learn features. They stack their basic learning structure to learn hierarchical features. They combined their higherlevel features with lowlevel features and train linear SVM classifiers to perform labeling. Compared to these methods, our model is much simpler and achieve better performance.
Figure 10 also shows some qualitative results generated by our method and the most competitive baseline method ‘RNNFeaturesCombined’. Our multimodalRNNs model can correctly classify many misclassifications results compared to the baseline in most cases. We also show our per class accuracy on this sitting, as shown in Figure 11. We can notice that the improvement gain of our multimodal RNNs over CNNRGBD and RNNRGBD models is significant. This is an evidence that our model can effectively learn powerful contextaware and multimodel features.
We also study the effect of increasing the dimensionality of the internal hidden layer on single RNN performance. We notice that RNN performs almost the same but slightly better when the hidden layer dimension increases, while it becomes extremely slow and takes a lot of time to converge. Thus, we choose the dimension of the hidden layer to be . All RNN models in our baselines can achieve good performance and can converge in reasonable amount of time. Figure 12 shows the relationship between a single RNN hidden layer dimensionality versus the accuracy (in terms of global pixelwise) trained on the NYU V2 to predict semantic categories.
Algorithm  Pixel Acc  Class Acc  IOU 

CNNRGB  50.84%  35.37%  21.72% 
CNNDepth  56.08%  36.82%  23.41% 
CNNRGBD  54.40%  35.48%  22.07% 
RNNRGB  56.22%  42.41%  29.19% 
RNNDepth  62.27%  45.99%  33.62% 
RNNRGBD  61.95%  48.15%  35.74% 
RNNFeaturesCombined  64.30%  51.54%  39.31% 
RNNHiddensCombined  64.50%  51.70%  37.30% 
RNNClassifiersCombined  64.44%  50.27%  37.42% 
MultimodalRNNsOurs  66.23%  53.06%  40.59% 
MultimodalRNNsOursMultiscale  67.90%  54.67%  43.27% 
Wang et al.[64]    42.20%   
Couprie et al. [8]  52.40%  36.20%   
Hermans et al. [26]  54.20%  48.00%   
Khan et al. [35]  58.30%  45.10%   
Examining our multimodal RNNs with other CNN features  VGG Features on NYU V2  14 We also examine our proposed multimodal RNNs while replacing the input CNN features by the extracted features from the VGG16 pretrained model (extracted from Conv53 layer) [57]. We focused mainly on the most competitive task among our addressed tasks, i.e. classifying the 14 classes in NYU V2. The purpose of these experiments is to validate whether our fusion structure is a networkindependent model, and concretely orthogonal to other CNN networks. In other words, replacing the CNN models with more powerful network like VGG[57], ResNet[34], FCNs[33], Dilated Networks[71] and others can boost the overall performance while the relative improvement of our proposed crossconnectivity fusion is maintained. Throughout all of our experiments, we observe that replacing our CNN features with VGG features result in a constant overall increase in the accuracies in all of our RNN models (including the baselines) of around 5% higher in terms of IOU (most competitive metric).
Notice that in this paper, we didn’t perform joint training between the CNN layers and our multimodal RNN layers. In contrast, we perform stage training; we first train the CNN for feature extraction and then train our RNN model for final local classification. This makes the performance comparisons on the NYUV2 tasks between our model and other stateoftheart models not fair (many works perform endtoend joint training of various CNN or multiscale CNN and CRFbased methods and even with RNN/LSTM). Notice also that endtoend training with CNN can allow training of efficient deconvolution layers to upsample the output feature maps in order to restore their original resolution, while we use simple bilinear interpolation to upsample our final output map produced by the RNNs. Thus, in order to fairly examine the full performance of our multimodal RNN model combined with other types of recent CNN models like ResNets
[34], FCNs[33] and Dilated Networks[71]; a joint endtoend training is required. In this paper, we didn’t perform this joint training between the CNN and the RNN models as it is not our main contribution, but we consider it as a very good future work.IvE Observations
Modeling contextual dependencies between patches using RNNs helps: The improvement gain achieved by our RNN models over the CNN models is significant. CNN features are locally learned when performing the convolutions and thus fail to encode longrange contextual information. RNNs are powerful on modeling short and long range dependencies between patches within the image and can learn contextaware features effectively.
Sharing information between RNNs helps: We design the baseline ‘RNNClassifiersCombined’ that combines two RNN models on the classifier level. Our multimodal RNNs model outperforms this baseline as it benefits from the shared contextual information extracted through the transfer layers.
Learning transfer layers to connect RNNs helps: The baselines ‘RNNRGBD’ and ‘RNNFeaturesCombined’ are designed to mix RGB and Depth data modalities together before learning the features. Our model outperforms these baselines because it has an assigned single RNN for each modality to retain the modalityspecific information. Plus, the transfer layers are learned to adaptively extracts only the relevant multimodal shared information.
V Conclusion
This paper presents a new method for RGBD scene semantic segmentation. We introduce information transfer layers between two quaddirectional 2DRNNs. Transfer layers extract relevant contextual information across the modalities and help each modality to learn contextaware features that can capture shared information. In our future work, we will evaluate the effectiveness and the scalability of the transfer layers on more modalities (modalities ).
Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubspermissions@ieee.org.
Acknowledgment
The authors would like to thank NVIDIA Corporation for their donation of Tesla K40 GPUs used in this research at the RapidRich Object Search Lab. This research was carried out at both the Advanced Digital Sciences Center (ADSC), Illinois at Singapore Pt Ltd, Singapore, and at the RapidRich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. This work is supported by the research grant for ADSC from A*STAR. The ROSE Lab is supported by the National Research Foundation, Singapore, under its Interactive & Digital Media (IDM) Strategic Research Programme.
References
 [1] A. H. Abdulnabi, B. Shuai, S. Winkler, and G. Wang. Episodic camn: Contextual attentionbased memory networks with iterative feedback for scene labeling. In CVPR, 2017.
 [2] A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia. Multitask cnn model for attribute prediction. TMM, 2015.
 [3] A. H. Abdulnabi, S. Winkler, and G. Wang. Beyond forward shortcuts: Fully convolutional masterslave networks (msnets) with backward skip connections for semantic segmentation. arXiv preprint arXiv:1707.05537, 2017.
 [4] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, 2013.
 [5] Y. Bengio, N. BoulangerLewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In ICASSP, 2013.
 [6] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In CVPR, 2015.
 [7] C. Cadena and J. Kosecka. Semantic parsing for priming object detection in rgbd scenes. In Workshop on Semantic Perception, Mapping and Exploration, pages 582–597, 2013.
 [8] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572, 2013.
 [9] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2006.

[10]
C. Ding and D. Tao.
Robust face recognition via multimodal deep face representation.
In IEEE Transactions on Multimedia, volume 17, pages 2049–2058, Nov 2015.  [11] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In ICCV, 2015.
 [12] J. L. Elman. Finding structure in time. In Cognitive Science, 1990.
 [13] A. Ess, B. Leibe, and L. V. Gool. Depth and appearance for mobile scene analysis. In ICCV, pages 1–8, 2007.
 [14] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity trees, and optimal covers. In ICML, 2012.
 [15] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for finegrained image recognition. In CVPR, 2017.
 [16] D. Gavrila and S. Munder. Multicue pedestrian detection and tracking from a moving vehicle. In IJCV, 2007.
 [17] A. Graves, S. Fernandez, and J. Schmidhuber. Bidirectional LSTM networks for improved phoneme classification and recognition. In ICANN, pages 799–804, 2005.
 [18] A. Graves, S. Fernandez, and J. Schmidhuber. Multidimensional recurrent neural networks. In ICANN, 2007.
 [19] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 [20] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. In Neural Networks, 2005.
 [21] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS, 2008.
 [22] S. Gupta, P. Arbelaez, R. Girshick, and J. Malik. Indoor scene understanding with rgbd images: Bottomup segmentation, object detection and semantic segmentation. In IJCV, 2015.
 [23] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from rgbd images. In CVPR, 2013.
 [24] J. He and R. Lawrence. A graphbased framework for multitask multiview learning. In ICML, pages 25–32, 2011.
 [25] X. He, R. Zemel, and M. CarreiraPerpindn. Multiscale conditional random fields for image labeling. In CVPR, 2004.
 [26] A. Hermans, G. Floros, and B. Leibe. Dense 3d semantic mapping of indoor scenes from rgbd images. In ICRA, 2014.
 [27] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.
 [28] S. Hochreiter and J. Schmidhuber. Long shortterm memory. In MIT Press, 1997.
 [29] S. C. H. Hoi and M. R. Lyu. A multimodal and multilevel ranking scheme for largescale video retrieval. 10(4):607–619, June 2008.
 [30] Z. Hong, X. Mei, D. Prokhorov, and D. Tao. Tracking via robust multitask multiview joint sparse representation. In ICCV, 2013.
 [31] H. Hotelling. Relations between two sets of variates. In Biometrika, 1936.
 [32] O. Irsoy and C. Cardie. Opinion mining with deep recurrent neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.
 [33] E. S. Jonathan Long and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 640–651, 2015.
 [34] S. R. Kaiming He, Xiangyu Zhang and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [35] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Geometry driven semantic labeling of indoor scenes. In ECCV, 2014.
 [36] S. Kim and E. P. Xing. Treeguided group lasso for multitask regression with structured sparsity. In ICML, 2010.
 [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [38] H. Lee, C. Ekanadham, and A. Y. NG. Sparse deep belief net model for visual area v2. In NIPS, 2007.
 [39] B. Leibe, N. Cornelis, K. Cornelis, and L. V. Gool. Dynamic 3d scene analysis from a moving vehicle. In CVPR, pages 1–8, 2007.
 [40] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. In IJRR, 2014.
 [41] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan. Perceptual generative adversarial networks for small object detection. 2017.

[42]
W. Liu, Y. Zhang, S. Tang, J. Tang, R. Hong, and J. Li.
Accurate estimation of human body orientation from rgbd sensors.
In IEEE Trans. Cybernetics 43(5) 14421452, 2013.  [43] S. S. Mukherjee and N. M. Robertson. Deep head pose: Gazedirection estimation in multimodal video. In IEEE Transactions on Multimedia, volume 17, pages 2094–2107, Nov 2015.
 [44] A. C. Muller and S. Behnke. Learning depthsensitive conditional random fields for semantic segmentation of rgbd images. In ICRA, 2014.
 [45] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.
 [46] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In arXiv preprint arXiv:1211.5063, 2012.
 [47] D. Pei, H. Liu, Y. Liu, and F. Sun. Unsupervised multimodal feature learning for semantic image segmentation. IJCNN, pages 1–6, 2013.
 [48] P. H. O. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014.
 [49] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in firstperson camera views. In CVPR, 2012.
 [50] M. Rapus, S. Munder, G. Baratoff, and J. Denzler. Pedestrian recognition using combined lowresolution depth and intensity images. In Intelligent Vehicles Symposium, 2008.
 [51] X. Ren, L. Bo, and D. Fox. Rgb(d) scene labeling: Features and algorithms. In CVPR, 2012.

[52]
M. Rohrbach, M. Enzweiler, and D. M. Gavrila.
HighLevel fusion of depth and intensity for pedestrian
classification.
In
DAGM 2009: Pattern Recognition
, pages 101–110, 2009.  [53] J. H. Saurabh Gupta and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
 [54] B. Shuai, Z. Zuo, and G. Wang. Quaddirectional 2drecurrent neural networks for image labeling. In Signal Processing Letters, 2015.
 [55] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In ICCV Workshops, 2011.
 [56] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
 [57] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556v6, 2015.
 [58] K. Sohn, W. Shang, and H. Lee. Improved multimodal deep learning with variation of information. In NIPS, 2014.

[59]
N. Srivastava and R. Salakhutdinov.
Multimodal learning with deep boltzmann machines.
In NIPS, pages 2949–2980, 2012.  [60] J. Stuckler, B. Waldvogel, H. Schulz, and S. Behnke. Dense realtime mapping of objectclass semantics from rgbd video. In J. RealTime Image Processing, 2014.
 [61] S. Tang, X. Wang, X. Lv, T. Han, J. Keller, Z. He, M. Skubic, and S. Lao. Histogram of oriented normal vectors for object recognition with a depth sensor. In ACCV, 2013.
 [62] W. van der Mark and D. M. Gavrila. Realtime dense stereo for intelligent vehicles. In IEEE Transactions on Intelligent Transportation Systemsn, 2006.
 [63] A. Wang, J. Lu, J. Cai, T. J. Cham, and G. Wang. Largemargin multimodal deep learning for rgbd object recognition. In IEEE Transactions on Multimedia, volume 17, pages 1887–1898, Nov 2015.
 [64] A. Wang, J. Lu, G. Wang, J. Cai, and T. Cham. Multimodal unsupervised feature learning for rgbd scene labeling. In ECCV, 2014.
 [65] D. Wang, P. Cui, M. Ou, and W. Zhu. Learning compact hash codes for multimodal representations using orthogonal deep structure. In IEEE Transactions on Multimedia, volume 17, pages 1404–1416, Sept 2015.
 [66] W. Wang and Z. hua Zhou. A new analysis of cotraining. In ICML, 2010.
 [67] Z. Wang, S. Chen, and T. Sun. Multikmhks: A novel multiple kernel learning algorithm. In TPAMI, 2008.
 [68] H. Wolfgang and S. Leopold. Canonical correlation analysis. In Applied Multivariate Statistical Analysis, 2007.
 [69] C. Xu, D. Tao, and C. Xu. A survey on multiview learning. arXiv preprint arxiv:1304.5634, 2013.
 [70] J. G. Yong Jae Lee and K. Grauman. Discovering important people and objects for egocentric video summarization. In CVPR, 2012.
 [71] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. In ICLR, pages 1–9, 2016.
 [72] Z. Zuo, B. Shuai, G. Wang, X. Liu, X. Wang, B. Wang, and Y. Chen. Convolutional recurrent neural networks:learning spatial dependencies for image representation. In CVPR Workshops, 2015.