I Introduction
Various sensing modalities can be concurrently used to enhance the performance of scene understanding systems. For instance, high resolution 2D images provide useful textural information of the objects and 3D point cloud data reveal the 3D structure and size of the objects. In the context of scene labeling, where the goal is to assign a class label to the elements of each modality, such as image pixels and 3D points, this has been shown to consistently yield increased accuracy over relying on a single domain
[1, 2, 3, 4, 5, 6].In this paper, we propose a multimodal model that can leverage the potential of various modalities simultaneously and the classification of each modality can be enhanced using the information of other sensing modalities (Figure 1). Taking into account multiple modalities that contain different types of information and cover different sorts of object categories is a challenging task, which will be addressed in this work. There are only a limited number of works done using multimodality sensing, where it has been generally assumed that the corresponding elements in various modalities must take identical class labels. This assumption is encoded either explicitly by having a single label variable for all modalities [1, 3, 4], or implicitly by penalizing label differences between the domains [2, 5, 6]. This assumption, however, is very restricting if not infeasible, given different data modalities with their own specific properties and object categories. For example, Grass in 2D data may correspond to the class of horizontal plane in 3D data, or Sky which is a frequent class in 2D images of outdoor data can not be recorded using 3D data. In addition, the different modalities are typically not perfectly aligned/registered in practice. Furthermore, in dynamic scenes, moving objects may not easily be captured by some sensors, such as 3D Lidar, due to their lower acquisition speed. Note that a Lidar system captures 3D data continuously using a rotating sensor, unlike snapshot sensors where the image data are captured instantaneously. To give a concrete example, in the DATA61/2D3D dataset employed in our experiments, 17% of the connections between the two modalities correspond to inconsistent labels. As a consequence, existing methods fail to model these inconsistencies and, hence, produce wrong labels in at least one modality.
Given the dissimilarities in the classes of different modalities and also the inherent misalignments between the domains, these modalities should be either studied separately, or connected such that each one of them could simultaneously utilize the incoming information of other modalities correctly. To this end, as shown in Figure 2, we formulate multimodal scene parsing as inference in a Conditional Random Field (CRF), and introduce latent nodes to handle conflicting evidence between the different domains. The benefit of these latent nodes is twofold: First, they can leverage information from both domains to improve their respective labeling. Second, and maybe more importantly, these nodes allow us to cut the edges between regions in different modalities when the local evidence of the domains is inconsistent. As a result, our approach lets us correctly assign different labels to the modalities. In our formulation, different modalities can cover different sets of class labels and still leverage the information of other modalities to enhance the performance of the scene parsing system.
More specifically, each connection between two domains is encoded by a latent node, which can take either a label from the same set as the regular nodes, or an additional label that explicitly represents a broken link. We then model the connections between the latent nodes and the different modalities with potential functions that allow us to handle inconsistencies. While many such connections exist, they come at little cost, because the only cases of interest are when the latent node and the regular node have the same label, and when the latent node indicates a broken edge. By contrast, having direct links between two modalities would require to consider potential functions for each combination of two labels (i.e., for labels, vs in our model). The connections between the modalities that do not have identical label spaces are also governed by the latent nodes which have access to the features of both modalities. If these features match, the latent nodes then take the class labels that are consistent with the labels of the nodes at two ends of their respective connections. For example, the class Grass for a latent node is consistent with both horizontal plane in one modality and Grass in another one (Grass usually grows on horizontal surfaces). However, in case of a mismatch between the features of two modalities, the latent node breaks the link between them.
Note that our method enables us to apply additional modalities with their own set of categories. To investigate this ability of our model, we use a 2D3D dataset and take into account each modality twice using its corresponding geometric and semantic annotations and model their relationships. This in turn improves the performance of the system, with negligible impact on its runtime. Furthermore, we also model intradomain connections with potential functions that encode some notion of label compatibility and thus let us model more accurately the relationships between different class labels. Altogether, these connections allow the information to be transferred across the domains, thus encoding the fact that some classes may be easier to recognize in one modality than in the others. Since such general potential functions cannot realistically be manually tuned, we propose to learn them from training data. To this end, we make use of the truncated treereweighted (TRW) learning algorithm of [7]. The resulting method therefore incorporates local evidence from each domain, intradomain relationships and interdomain compatibility via our latent nodes.
We demonstrate the effectiveness of our approach on two publicly available 2D3D scene analysis datasets: The DATA61/2D3D dataset [6] and the CMU/VMR dataset [5]. Our experiments evidence the benefits of the latent nodes and augmentation of the multiple modalities with their semantic and geometric annotations. It also indicates the advantage of learning the potentials for multimodal scene parsing. In particular, our approach outperforms the stateoftheart on both datasets.
Ii Related Work
Scene parsing has been an important and challenging problem in computer vision in the recent years. In particular, semantic labeling of 2D image data has been studied to a large extent, yielding increasingly accurate results
[8, 9, 10, 11, 12, 13]. With the advent of 3D depth sensors, such as laser range sensors (Lidar) [14, 15] and RGBD cameras (e.g., Kinect) [16, 17, 18, 19], it seems natural to leverage these additional sources of information to further increase the level of scene understanding [20, 21, 22, 23].In fact, more recently, several works have focussed on integrating 2D imagery and 3D point clouds for scene parsing [1, 3, 2, 4, 24, 5, 6]. In particular, [1, 3, 24] designed models based on variables corresponding to only one visual domain and then augmented them with visual cues extracted from the other modality. This approach, however, assumes that the same regions of the scene are observed in both domains, which is virtually never the case in practice. On the contrary, the model of [4] incorporates variables for the two domains, but still relies on a single variable for the corresponding regions in both modalities. As a result, this approach still assumes that there is a perfect alignment between different visual domains. This, unfortunately, can typically not be achieved in practice, and the abovementioned techniques will thus misclassify some regions in at least one of the domains.
This assumption has been relaxed in some approaches by dedicating separate variables to the scene elements in the two modalities, even for matched regions. More specifically, [5] came up with a hierarchical segmentation framework that performs parsing in two domains alternatively. However, since each modality transfers its labeling results to facilitate labeling in other modality (depending on the overlap area of the 2D region and the projection of the 3D segment onto the 2D region), this method implicitly assumes that the regions that correspond with each other in two domains should take identical labels. In [2], a framework to train a joint 2D3D graph from unlabeled data was proposed. Similar to [5], this method also propagates the labeling cues from one domain to the other thus implicitly assuming that corresponding nodes in 2D and 3D data should take the same labels. Likewise, [6] introduced a multimodal graphical model where each domain was represented by separate nodes. This approach, however, is designed based on Pott’s model as pairwise potentials for both intradomain and interdomain edges. As a result, the assumption of assigning identical labels to the matched nodes in 2D and 3D domains is implicitly encoded.
Here, by contrast, we propose to introduce latent nodes in a CRF to explicitly model the inconsistencies between two modalities. Furthermore, our approach lets us learn the intradomain and interdomain relationships from training data. Learning the parameters of CRFs for semantic labeling has been tackled by a number of works, such as [25, 26] with meanfield inference, [27] with TRW, and [28] with loopy belief propagation. Of more specific interest to us is the problem of learning label compatibility [26], as studied by [26] for 2D images and by [29] for 3D data. Here, we consider label compatibility within and across domains. To the best of our knowledge, this is the first time such a learning approach is employed for multimodal scene parsing.
There are other works that focus on semantic labeling and 3D reconstruction [30, 31, 32]. However, none of these works deal with misalignment problem between natural 2D and 3D data. In particular, [32] is formulated based on only a single modality (RGB image) as input, and [30, 31] reconstruct 3D data synthetically from stereo images in their framework. Zhang et al [33] also addressed the problem of multimodal 2D3D semantic labeling by independently parsing the 2D and 3D data, and fusing their classification results. They however fail to account for misalignment issue, which is a challenging problem in natural multimodal datasets. The closest work to this paper is [34]
, where the authors addressed domain mismatch problem by designing a specific cardinality loss function with an SSVM framework. However, the higherorder potentials in their model makes their approach computationally demanding, particularly when dealing with largescale datasets. On the contrary, our graphical model is scalable and can be easily generalized to larger set of modalities and classes. Furthermore, unlike other graphbased approaches, the set of edges in our graph is flexible and can vary depending on how aligned the data modalities are in the problem.
Xie et al [35] presented a multimodal dataset for outdoor scene understanding, though only 3D ground truth annotation information is provided with the dataset. The authors then used a dense 2D3D graph to tansfer the 3D label information to all 2D pixels. Gould et al [36] integrated the semantic and geometric clues into their 2D scene understanding system and decomposed the scene into semantically and geometrically meaningful regions. Following [36], Tighe and Lazebnik [37] incorporated the geometric information into their regionwise scene parsing system (Superparsing) where they enforced coherence between the semantic labels (building, car, person, etc.) and geometric labels (sky, ground, vertical surfaces).
Inspired by the above, we propose to use the semantic and geometric information of both 2D and 3D data simultaneously. To this end, we build our model upon different nodes which represent the semantic and geometric labels of each modality separately. These nodes are then linked together as seen in Figure 3 for a simultaneous inference procedure. The evaluation results illustrate the superiority of this method over the previous work.
Iii A General Multimodal CRF
In this section, we present our multimodal graphical model. Let
, be the set of features extracted from the elements of the
modality and , be the set of variables encoding the labels of the nodes in that modality, where each variable can take a label in the set. Then the joint distribution of all modalities conditioned on the features can be expressed as
(1)  
where is the partition function, and denotes the unary potentials of modality . and denote pairwise potentials defined over the set of edges (intradomain) and (interdomain), respectively. The potential functions in Equation 1
are built such that they could intuitively model the correlation between the class probabilities and local information of each node, as well as the contextual relationships between the pairs of adjacent nodes in the graph (pairwise potentials).
In [6]
, handcrafted potentials were used for the multimodal graphical model, where the pairwise potential function is defined in a way that penalizes dissimilar class labels for two adjacent regions if their feature vectors are very similar. The contributions of the handcrafted potentials in the inference process are determined via a set of weighting parameters. These parameters are then adjusted through a validation step, so as to produce the lowest error on the validation data.
A drawback of the handcrafted potentials that are based on a Pott’s model is that they do not convey any information on the compatibility of different objects and class labels. As an example, take the scenario where a superpixel in the 2D domain is classified as
Grass and it has connections with two different 3D segments, one labeled as a flat object, e.g., Road or Grass, and the other one predicted to be a cylindrical object such as Powerpole. Assigning the same weight to these pairwise links, even if they have the same amount of 2D3D overlap, might not be a right decision because, in the former case, the predicted 2D class is compatible with the predicted class in the 3D domain. However, in the latter, the difference in shape of the predicted classes demands a more tuned and classspecific pairwise weight. This problem can be addressed by considering different weights for different class combinations of the nodes in a pairwise edge, e.g., 2D:Grass3D:Grass, 2D:Grass3D:Road, or 2D:Grass3D:Tree Trunk. Therefore, we assign a set of label compatibility parameters for all possible class combinations and learn them from data.Moreover, assigning a fixed set of weights to the unary potentials of different modalities overlooks the fact that some of the classes are recognized better using one data modality and some other object classes can be described more precisely using the another modality. For instance, when it is deduced from the 3D data that the object of interest has a flat shape, the labeling algorithm should trust the 3D information more to put the object in one of the flat categories. If, in this case, the 2D data describes the object as a green entity, e.g., Grass, Bushes, Tree top, the classifier should ideally pick Grass as class label.
Our goal is to construct and train our graphical model based on a set of potentials that describe: I) the reliability of the local information of each domain per class, and II) the cost of various intradomain and interdomain class neighborhoods (a.k.a. the label compatibility). To obtain a labeling, we perform inference in our CRF by making use of the truncated TRW algorithm of [7].
Iiia Potential Definition
The CRF formulation in Equation 1 includes several unary and pairwise potentials that are defined here. The unary potential of a node is generally computed via its local information and indicates the cost of assigning a class label to the node. We define the cost of assigning label to the corresponding variables as
(2) 
where is the parameter matrix for the unary potential in modality , with the row of corresponding to label . Since they directly act on the local features , this matrix encodes how much each feature dimension should be relied on to predict a specific label. Note that refers to the dimension of the feature vector in modality .
Pairwise potentials express the cost of all possible joint label assignments for two adjacent nodes in the graph. The handcrafted potentials are limited to simply encouraging the nodes to share the same labels. By contrast, here, we define general pairwise potentials that let us encode sophisticated label compatibilities. For the intradomain edges, these potentials are defined as
(3) 
where is a parameter matrix with rows representing all possible combinations of two labels, and is the row of corresponding to the combination of label with label . In this case, we set the edge features to be the norm of the difference of a subset of the original node features and , which will be discussed in Section VIA1.
Similarly, the interdomain pairwise potential between modality and modality is defined as
(4) 
where is the concatenation of a subset of the original node features in and .
Iv General Multimodal CRF with Latent Nodes
We now address the problem of inconsistencies across the modalities by introducing latent nodes to our model. The latent nodes are placed between the pairs of corresponding nodes in two modalities. This breaks down the betweenmodality edges into two edges that link the node in and the latent node, and also the node in and the latent node. In other words, no edge directly connects to . Our latent nodes can either take a label from the same space as the label space of the or the nodes^{1}^{1}1When and have different label spaces, the latent node can take a label from one of them., or another label indicating that the link between the two modalities should be cut.
Formally, let be the set of variables encoding the node label in modality . Each of these variables can take a label in the set . Furthermore, let be the number of pairs of corresponding nodes in modality and modality , found in the manner described in Section VIA1. We then denote by the latent nodes associated with these correspondences. These variables can be assigned a label from the space , where label represents a broken link, which means the nodes do not influence each other.
Given as the features extracted from the elements in modality
, the joint probability distribution of all data nodes and latent nodes conditioned on the features can be expressed as
(5)  
Where denotes the unary potential of the latent nodes and denotes the pairwise potentials defined over the set of edges . To obtain a labeling, as in Section III we use the TRW method to perform inference in our CRF. In the remainder of this section, the latent potentials in Equation 5 are described.
Iva Unary Potentials of Latent Nodes
Similar to data modality nodes, the unary potential for the latent nodes is defined as
(6) 
where is, again, a parameter matrix, which this time contains rows to represent the fact that a latent node can take an additional label to cut the connection between two modalities. The feature vector of a latent node is constructed by concatenating the features of the corresponding and nodes, i.e., . Having access to both and features allows this unary to detect mismatches in the and observations, and in that event, favor cutting the corresponding edge.
IvB Interdomain Pairwise Potentials with Latent Nodes
The interdomain pairwise potentials associated with the latent nodes
that connect two modalities are defined as
(7)  
and
(8)  
where the parameter matrices now have rows to account for the extra label of the latent nodes. In practice, we set and to 1, thus resulting in parameters. Note, however, that the effective number of parameters corresponding to these potentials is much smaller. The reason is that the only cases of interest are when the latent node and the regular node take the same label, and when the latent node indicates a broken link. The cost of the other label combinations should be heavily penalized since they never occur in practice. This therefore truly results in parameters for each of these potentials.
V Training our Multimodal Latent CRF
Our multimodal CRF contains many parameters, which thus cannot be tuned manually. Here, we propose to learn these parameters from training data. To this end, we make use of the direct loss minimization method of [7].
More specifically, let , be a set of labeled training examples, such that , where, with a slight abuse of notation compared to Section III and Section IV, , resp. , englobes the features, resp. groundtruth labels, of all the nodes in the training sample for modality , and similarly for the other terms in . In practice, to obtain the groundtruth labels of the latent nodes , we simply check if the groundtruth labels of the corresponding and nodes agree, and set the label of the latent node to the same label if they do, and to 0 otherwise^{2}^{2}2Note that our nodes are latent in the sense that they do not correspond to physical entities, not in the sense that we do not have access to their groundtruth during training..
Learning the parameters of our model is then achieved by minimizing the empirical risk
(9) 
w.r.t. , where is a loss function.
Here, we use a marginalbased loss function, which measures how well the marginals obtained via inference in the model match the groundtruth labels. In particular, we rely on a loss function defined on the clique marginals [38]. This can be expressed as where sums over all the cliques in the CRF, i.e., all the interdomain and intradomain pairwise cliques in our case, denotes the variables of involved in a particular clique , and indicates the marginals of clique obtained by performing inference with parameters .
We use the publicly available implementation of [7] with truncated TRW as inference method. This method was shown to converge to stable parameters in only a few iterations. In practice, we run a maximum of 5 iterations of this algorithm.
Vi Special Cases
In this section, we demonstrate how our general multimodal model can be used for modeling two special cases of I) 2D3D multimodal data, and II) 2D3D semantic and geometric multimodal data, both accompanied with latent nodes.
Via 2D3D CRF with Latent Nodes
Since 2D imagery and 3D data are often the most popular modalities used for semantic labeling, here we focus the discussion on these two visual domains. Nevertheless, our approach generalizes to other modalities, such as infrared or hyperspectral data.
Our model specifies separate nodes to 2D regions (i.e., superpixels) and 3D regions (i.e., 3D segments). More details about these regions are provided in Section VIA1. We also consider latent nodes that enable us to take into account inconsistencies between the different modalities. To this end, and as illustrated in Figure 2, we incorporate one such latent node between each pair of corresponding 2D and 3D nodes. This results in edges between either a 2D node and a latent node, or a 3D node and a latent node, but no edges directly connecting a 2D node to a 3D node. Our latent nodes can then either take a label from the same space as the 2D and 3D nodes, or take another label indicating that the link between the two modalities should be cut (label 0). Figure 4 illustrates through an example how latent nodes operate in case of a misalignment between 2D and 3D data for narrow objects. In Figure 5 we show that multimodal data is prone to errors due to moving objects like a vehicle. In each case, latent nodes utilize the 2D and 3D information and either assist the linked 2D3D regions to find their class label or cut off the link between them.
Formally, let , be the set of variables encoding the labels of the 2D nodes in frames, with frame containing 2D regions. Similarly, let be the set of variables encoding the label of 3D nodes. Each of these variables, either 2D or 3D, can take a label in the set . Furthermore, let be the number of pairs of corresponding 2D and 3D nodes, found in the manner described in Section VIA1. We then denote by the latent nodes associated with these correspondences. These variables can be assigned a label from the space .
Given features extracted from the 2D and 3D regions, and , respectively, the joint distribution of the 2D, 3D and latent nodes conditioned on the features can be expressed as
(10)  
where , , and denote the unary potentials of the 2D, 3D and latent nodes, respectively. , , and denote pairwise potentials defined over the set of edges , , and , respectively. All the unary and pairwise potentials are calculated based on the formulations in Section III and Section IV. Below, we provide some details regarding our features and potentials.
ViA1 Features and Potentials
3D nodes: We extracted the following 3D shape features from the point cloud data: fast point feature histogram (FPFH [39]
) that describes the local point distributions based on the point distances and orientations of their surface normal vectors w.r.t. each other, eigenvalue features that model the shape of the spatial distribution of the points, deviation of the surface normal vectors from the vertical axis, and also the height of the points. The 3D segments were obtained from these features by first classifying the points using an SVM classifier, partitioning them into different groups given their class labels, and then performing kmeans clustering on each group of the points based on their spatial coordinates. We then further leveraged the SVM results and used the negative logarithm of the multiclass SVM probabilities as features in our unary potentials. The probabilities for a segment were obtained by averaging over the points belonging to the segment. We also used three eigenvalue descriptors and the verticalaxis deviation as additional features for the segments.
2D nodes: As 2D regions, we used superpixels extracted by the meanshift algorithm [40]. We utilized histogram of SIFT features [41], GLCM features (entropy, homogeneity and contrast, each computed in both horizontal and vertical directions), and RGB values to train an SVM classifier, and used the negative logarithm of the SVM probabilities as features in our unary potentials. We augmented these features with six GLCM features and three RGB features.
Latent nodes: The features of the latent nodes were obtained by concatenating the features of their respective 2D and 3D nodes, described above. Furthermore, we augmented these features with the normalized overlap area of the projection of the 3D segment onto the 2D superpixel.
Edges: For the intradomain potentials, we employed the norm of the difference of a subset of the local feature vectors (RGB for 2D2D edges and verticalaxis deviation for 3D3D edges) as pairwise features. The feature vectors of the 2D and 3D edges were set to a single value of . In the case of the 2D3D CRF with no latent nodes, however, the feature vector of the 2D3D edges was constructed by concatenating the RGB values of the 2D node with the eigenvalue features and deviation of the 3D node from the verticalaxis, as well as with the same normalized overlap area used for the unary of the latent nodes. We selected these features through an ablation study that was conducted on the validation set. As evidenced by our results, they yield better accuracies than when employing all of them, which causes overfitting. Note that we obtain the 2D3D edges by projecting the 3D clusters onto the 2D regions and then, linking the pairs of 2D3D elements that have a considerable projection overlap with each other, i.e., an intersection over union of more than 0.2.
ViB Simultaneous Inference of Semantic and Geometric Classes both in 2D and 3D
Fusing geometric and semantic cues has shown some ability enhancing scene parsing results [36], [37]. This procedure can become more promising by using 3D data geometric labeling, counter to relying on 2D data for computing geometric labels [36], [37]. In Figure 6 the results of semantic and geometric labeling of wire and tree leaves are shown. In semantic labeling they were wrongly labeled as tree leaves, but using geometric labeling, they were distinct from tree leaves and correctly labeled as wire and scattered categories. This can help us improve the semantic labeling. In this paper, we use the 2D and 3D semantic labellings as well as the 2D and 3D geometric labeling collaboratively and leverage their information through a concurrent inference process to improve the labeling results in each one of them. [36], [37] picked three categories, horizontal, vertical and sky, as geometric classes in their methods. Having access to 3D point cloud data enabled us to expand this list by taking into account the cylindrical and scattered categories in both 2D and 3D data, which is explained in more detail in Section VIB1. In our semanticgeometric mapping, each semantic class belongs only to one of the geometric classes, e.g., all the roads are assigned a horizontal label and all the vehicles are given a vertical label.
Let , , and be the variables encoding the 2D semantic, 3D semantic, 2D geometric and 3D geometric class labels, respectively. We can then define the joint distribution of the 2D semantic, 2D geometric, 3D semantic, 3D geometric and the latent nodes, conditioned on the node features, similarly to the definition in Equation 5. Note that the label set in geometric nodes and semantic nodes are different.
Given that the geometric nodes represent the same set of 2D and 3D regions that were previously produced for semantic labeling, the 2D3D geometric edges are similar to the 2D3D semantic edges. Furthermore, note that the latent nodes which link the semantic and geometric nodes both representing one 2D region (or 3D segment), cannot cut their corresponding edges although their class labels are different. The reason behind this is that they connect two visually identical regions (segments). Instead, they try to find a coherent pair of semantic and geometric class labels that sufficiently fit the 2D and 3D features of the region (segment). The truncated TRW method is used for the inference, similar to what is described in Section V. The inference time, however, is still quite short and satisfactory, despite the considerable increase in the size of the graph (number of nodes and edges). Table II presents the training and inference times for the DATA61/2D3D and CMU/VMR datasets.
Our method trains all the compatibility parameters between the semantic and geometric class labels, which contrasts with the Superparsing method [37], where only one parameter is embedded in the cost function to enforce consistency between these two groups of classes. Note that we used the same features as in Sec. VIA1 for the geometric nodes.
ViB1 Semantic and Geometric Classes
In order to best exploit the geometric cues, particularly given the 3D point cloud data, the data is clustered into different structural classes including horizontal plane, vertical plane, scattered and cylindrical (in addition to three other groups for specifically representing sky, person and wire). Table III provides the mapping between the geometric and semantic classes.


DATA61/2D3D  CMU/VMR  


Parameter Matrices  Rows  Columns  Parameter Matrices  Rows  Columns 


14 classes  14 2Dprobabilities + 6 GLCM + 3 RGB  19 classes  19 2Dprobabilities + 6 GLCM + 3 RGB  


13 classes  13 3Dprobabilities + 3 Eigenvalues + 1 deviation  19 classes  19 3Dprobabilities + 3 Eigenvalues + 1 deviation  


14 classes + 1 edgecut  23 2Dfeatures + 17 3Dfeatures + 1 2D3D overlap  19 classes + 1 edgecut  28 2Dfeatures + 23 3Dfeatures + 1 2D3D overlap  


1414 classes  1  1919 classes  1  


1313 classes  1  1919 classes  1  


1415 classes  1  1920 classes  1  


1315 classes  1  1920 classes  1  


1413 classes  1  1919 classes  1  


1413 classes  3 RGB + 3 Eigenvalues + 1 deviation + 1 2D3D overlap  1919 classes  3 RGB + 3 Eigenvalues + 1 deviation + 1 2D3D overlap  


1413 classes  3 RGB + 3 Eigenvalues + 1 deviation + 1 2D3D overlap  1919 classes  3 RGB + 3 Eigenvalues + 1 deviation + 1 2D3D overlap  
6 GLCM + 14 2Dprobabilities + 13 3Dprobabilities  6 GLCM + 19 2Dprobabilities + 19 3Dprobabilities  

Vii Experiments
We evaluate our method on two publicly available 2D3D multimodal datasets (DATA61/2D3D [6] and CMU/VMR [5]). KITTI [42] is another wellknown multimodal dataset used for outdoor scene understanding and object detection, and it has recently been adapted to be used as a benchmark for semantic labeling task [4]. However, the point cloud data in this dataset has a pretty small vertical field of view [35] and as a result, a large portion of 2D images do not have a correspondence in the 3D data. Therefore in our application where we are interested in investigating the 2D3D links, this dataset is less useful and thus we evaluate our method on the two above mentioned multimodal datasets.
We provide the results of 2D3D CRF with and without latent nodes and also simultaneous inference of semantic and geometric classes both in 2D and 3D. We also compare the results to the stateoftheart algorithms of [6] and [5]. The experiment on 2D3D CRF without latent nodes is a special case study of the general multimodal CRF (Section III). In addition, we provide the results of the pairwise models with learned potentials acting on a single domain, either 2D or 3D. These models are referred to as Pairwise 2D (learned) and Pairwise 3D (learned). We followed the evaluation protocol of [6] and partitioned the data into 4 nonoverlapping folds. We then used three of the folds for training and the remaining fold as test set.
Viia Results on DATA61/2D3D
The DATA61/2D3D dataset contains 12 outdoor scenes where each scene is described by a 3D point cloud block together with 1020 panoramic images. The number of 3D points in the scenes varies from 1 to 2 millions. It comprises 14 classes (13 for 3D where sky was removed), which yields the following sizes for the parameter matrices for 2D3D CRF with latent nodes: , , , , , and . The 2D3D CRF with no latent nodes involves a different parameter matrix of the form , and . Table I lists these parameters matrices and describes how their size relates with the selected set of features and class labels in the experiments.
Table IV and Table V compare the results, as F1scores, of the 2D3D CRF model with handcrafted and learned potentials and also with latent nodes and no latent nodes. Note that no results for [5]
are available on this dataset. The results in these tables evidence the benefits of using latent nodes, especially on the narrow classes that suffer more from misalignment. On average, our approach with latent nodes clearly outperforms the model with no latent nodes, and thus achieves stateoftheart results on this dataset. Moreover, note that the 2D3D CRF with no latent nodes that utilizes fewer features (selected features) for the 2D3D edges is less likely to face overfitting and yields better results, compared with the CRF model with a full set of features. Furthermore, the results of the 2D3D CRF with no latent nodes, where the feature vector of the 2D3D edges were set to a single value of 1 are presented in Table
IV and Table V for comparison (no feature).In Figure 7, we illustrate the influence of our latent nodes by two examples. As shown in the figure, cutting the edge between the nonmatching 2D and 3D nodes (which have been connected because of misalignment) helps predicting the correct class labels. Figure 8 shows the results of our approach in one of the scenes in this dataset, compared to the results of [6].
Our results on DATA61/2D3D indicate that, while our latent nodes are in general beneficial, thanks to their ability to cut incorrect connections, they still occasionally yield lower performance than a model without such nodes. We observed that this is mainly due to the inaccurate groundtruth (which is inevitable because of the imperfect 3D2D projection of the groundtruth labels particularly at the boundaries of the narrow objects), or to the fact that, sometimes, eventhough the 2D and 3D features seem to be inconsistent (e.g., due to challenging viewing conditions), they still belong to the same category. In these circumstances, the stronger smoothness imposed by the model without latent nodes is then able to address this problem.
2D3D multimodal scene parsing on semantic and geometric classes can be seen as a special case of our multimodal model with four modalities. We considered six geometric classes in the DATA61/2D3D dataset (Table III) and conducted similar procedures as in the semantic labeling for finding their regions and node features. The 2D and 3D geometric data augment the semantic model as two separate data modalities and their simultaneous inference is carried out given the semantic and geometric cues of the 2D and 3D data. Tables IV and V demonstrate the results of the 2D and 3D semantic scene parsing using the proposed semantic and geometric 2D/3D multimodal model. As reported in this table, leveraging the geometric cues has led to 4% and 5% improvement in F1scores of the 2D and 3D data, respectively. The results of the geometric labeling of the 2D and 3D data are shown in Table VI.
Furthermore, the panoramic images in the DATA61/2D3D provide the opportunity of observing an object in successive image frames and as a result, multiple 2D features for each object can be recorded. We linked these corresponding 2D nodes together with latent nodes in each connection to provide more information in the labeling process and gained a 2% improvement on the 2D performance, as shown in Table IV. Figure 10 shows some sample results of our semantic and geometric labeling on the DATA61/2D3D dataset.
ViiB Results on CMU/VMR
The CMU/VMR dataset is comprised of 372 pairs of urban images and corresponding 3D point cloud data, on average 31,000 3D points per image. Importantly, the groundtruth of this data is such that the labels of corresponding 2D and 3D nodes are always the same^{3}^{3}3Note that by examining the dataset, one can easily verify that its groundtruth is often erroneous, due to the inaccurate projection and misalignment problem.. In other words, this dataset is not particularly wellsuited to our approach. However, it remains a standard benchmark, and no other dataset, except the DATA61/2D3D dataset, explicitly evidencing the misalignment problem is available. The CMU/VMR dataset contains 19 classes, which yields the following sizes for the parameter matrices for 2D3D CRF with latent nodes: , , , , , and , with alternative matrices for the 2D3D CRF with no latent nodes of the form , and . Table I lists these parameters matrices and describes how their size relates with the selected set of features and class labels in the experiments.
We compare the results of the 2D3D CRF model with handcrafted and learned potentials and also with latent nodes and no latent nodes in Table VII and Table VIII for the 2D and 3D domains, respectively. In this case, while our approach still yields the best F1scores on average, there is less difference between our results with latent nodes and the no latent method. This can easily be explained by the fact that, as mentioned above, the groundtruth labels of corresponding nodes in 2D and 3D are always the same. In addition, it can be seen that our method does not perform very well on the rare categories with insufficient number of training samples, e.g. the last five classes in the tables. This outcome is not surprising though, since our training strategy heavily relies on the training data. Figure 9 demonstrates a qualitative comparison.
Six geometric classes are considered in the CMU/VMR dataset (Table III). Similarly to the DATA61/2D3D dataset, the 2D and 3D geometric data are augmented to the semantic model as two separate data modalities and their simultaneous inference is carried out given the semantic and geometric cues of the 2D and 3D data. Tables VII and VIII demonstrate the results of the 2D and 3D semantic scene parsing using the proposed semantic and geometric 2D/3D multimodal model. It improves the F1scores of the 2D and 3D data. The results of the geometric labeling of the 2D and 3D data are shown in Table IX. Figure 11 shows some sample results of our semantic and geometric labeling on the CMU/VMR dataset.
Training time  Inference time  Training time  Inference time  
(DATA61/2D3D dataset)  (DATA61/2D3D dataset)  (CMU/VMR dataset)  (CMU/VMR dataset)  


2D3D CRF with latent nodes  6hr45min  0.85s  4hr40min  0.47s 


Simultaneous Inference of Semantic and  19hr20min  2.3s  24hr15min  1.2s 
Geometric Classes both in 2D and 3D  

Geometric Classes  Semantic Classes (DATA61/2D3D dataset)  Semantic Classes (CMU/VMR dataset) 


Horizontal Plane  Grass  Road  Sidewalk  Road  Sidewalk  Ground  Stairs 


Vertical Plane  Building  Vehicle  Building  Small Vehicle  Big Vehicle 


Cylindrical  Tree Trunk  Pole Sign  PostBarrier  Barrier  Bus Stop  Tree Trunk Tall Light  Post  Sign  Utility Pole Traffic Signal 


Scattered  Tree Leaves  Bush  Shrub  Tree Top 


Sky  Sky  — 


Person  —  Person 


Wire  Wire  Wire 

Grass 
Building 
Tree trunk 
Tree leaves 
Vehicle 
Road 
Bush 
Pole 
Sign 
Post 
Barrier 
Wire 
Sidewalk 
Sky 
avg  


Unary  80  33  14  80  49  95  16  28  3  0  0  29  15  98  38 


Pairwise 2D (learned)  85  57  17  85  55  95  18  30  0  0  3  34  20  99  43 


2D3D handcrafted potentials, Namin [6]  74  56  21  82  58  92  23  33  19  8  5  32  29  97  45 


2D3D learned potentials (no feature)  94  58  12  83  72  64  31  34  6  0  13  37  48  97  46 


2D3D learned potentials (full features)  90  63  10  91  68  96  31  43  1  0  0  44  53  99  49 


2D3D learned potentials (selected features)  92  64  18  92  69  98  36  34  3  0  28  40  60  99  52 


2D3D learned potentials with latent nodes  95  71  28  93  76  97  44  44  10  5  21  38  68  99  56 


Semantic results with semantic  geometric model  92  70  26  93  72  97  32  49  17  0  0  63  65  99  55 
(selected features)  


Semantic results with semantic  geometric model  93  79  45  95  77  98  34  55  22  0  0  63  83  99  60 
and latent nodes  


Semantic results with semantic  geometric model  95  82  52  90  78  99  78  99  33  60  20  61  92  99  62 
(Connected 2D frames)  

Grass 
Building 
Tree trunk 
Tree leaves 
Vehicle 
Road 
Bush 
Pole 
Sign 
Post 
Barrier 
Wire 
Sidewalk 
Sky 
avg  


Unary  52  61  27  87  58  82  10  24  19  43  19  74  0  #  43 


Pairwise 3D (learned)  58  80  50  97  56  76  16  62  32  40  0  89  0  #  50 


2D3D handcrafted potentials, Namin [6]  63  81  41  96  70  76  21  38  28  47  23  87  0  #  52 


2D3D learned potentials (no feature)  68  81  31  92  67  83  69  43  37  25  16  75  10  #  54 


2D3D learned potentials (full features)  72  75  27  95  77  90  42  62  31  9  0  89  0  #  52 


2D3D learned potentials (selected features)  60  92  45  97  75  79  61  58  49  29  27  82  0  #  58 


2D3D learned potentials with latent nodes  66  94  49  95  79  83  51  62  54  43  25  89  8  #  61 


Semantic results with semantic  geometric model  71  88  51  97  76  84  56  60  51  49  6  92  21  #  62 
(selected features)  


Semantic results with semantic  geometric model  79  91  64  99  77  93  60  61  50  58  0  96  34  #  66 
and latent nodes  


Semantic results with semantic  geometric model  80  92  65  98  75  93  65  59  49  62  0  93  32  #  66 
(Connected 2D frames)  

Horizontal plane  Vertical plane  Cylindrical  Scattered  Wire  Sky  avg  


2D geometric results with semantic  geometric model  98  76  25  94  43  99  72 


3D geometric results with semantic  geometric model  99  91  62  99  95  #  89 

Road 
Sidewalk 
Ground 
Building 
Barrier 
Bus stop 
Stairs 
Shrub 
Tree trunk 
Tree top 
Small Vehicle 
Big vehicle 
Person 
Tall light 
Post 
Sign 
Utility pole 
Wire 
Traffic Signal 
avg  


Unary  95  81  75  56  29  17  32  50  31  53  32  49  29  16  15  16  33  41  29  41 


Pairwise 2D (learned)  89  77  74  84  25  17  40  62  37  89  78  57  38  1  5  3  16  12  9  43 


Munoz [5]  96  90  70  83  50  16  33  62  30  86  84  50  47  2  9  16  14  2  17  45 


2D3D handcrafted potentials, Namin [6]  94  87  79  74  45  22  40  54  27  84  67  24  38  13  2  10  37  35  40  46 


2D3D learned potentials (no feature)  95  84  78  70  58  18  57  68  43  84  81  52  55  9  3  2  15  5  8  47 


2D3D learned potentials (full features)  93  85  83  88  60  4  61  67  41  87  79  61  45  0  3  2  12  9  2  46 


2D3D learned potentials (selected features)  93  80  80  87  60  1  70  67  37  90  84  67  54  7  4  4  21  15  3  49 


2D3D learned potentials with latent nodes  94  84  84  84  65  4  75  64  43  89  84  58  52  11  6  2  25  18  3  50 


Semantic results with semantic  geometric model  94  87  82  82  61  26  59  68  43  89  74  60  55  0  4  4  27  15  8  49 
(selected features)  


Semantic results with semantic  geometric model  94  87  84  81  58  28  63  66  47  87  78  64  56  0  6  5  38  17  10  51 
and latent nodes  

Road 
Sidewalk 
Ground 
Building 
Barrier 
Bus stop 
Stairs 
Shrub 
Tree trunk 
Tree top 
Small Vehicle 
Big vehicle 
Person 
Tall light 
Post 
Sign 
Utility pole 
Wire 
Traffic Signal 
avg  


Unary  70  49  62  67  34  2  19  26  11  67  34  4  13  2  0  1  2  0  0  24 


Pairwise 3D (learned)  78  52  67  78  15  1  32  31  1  73  44  14  9  1  0  0  0  0  0  26 


Munoz [5]  82  73  68  87  46  11  38  63  28  88  73  56  26  10  0  0  0  0  0  39 


2D3D handcrafted potentials, Namin [6]  92  85  81  85  50  16  42  55  29  82  70  16  43  6  2  7  29  9  23  43 


2D3D learned potentials (no feature)  92  84  85  87  64  3  59  64  32  77  70  19  42  5  2  3  7  3  9  42 


2D3D learned potentials (full features)  90  86  87  90  59  2  64  69  31  79  70  29  47  1  1  0  5  0  0  43 


2D3D learned potentials (selected features)  90  85  85  89  62  2  63  68  29  86  78  46  53  3  1  0  15  0  0  45 


2D3D learned potentials with latent nodes  92  88  84  88  64  7  66  66  31  86  75  42  53  8  7  0  17  10  0  47 


Semantic results with semantic  geometric model  93  86  85  92  66  12  62  68  39  86  80  47  56  0  2  2  21  10  0  48 
(selected features)  


Semantic results with semantic  geometric model  94  86  87  90  71  18  60  70  44  87  78  43  58  0  2  2  28  13  0  50 
and latent nodes  

Horizontal plane  Vertical plane  Cylindrical  Scattered  Person  Wire  avg  


2D geometric results with semanticgeometric model  97  85  44  88  56  52  70 


3D geometric results with semanticgeometric model  96  91  60  87  56  19  68 

ViiC Scalability
In this section we discuss the scalability of our model. In the proposed model with multiple modalities, each node has correspondence with only a few nodes in other modalities and hence, the number of latent nodes in the graph grows linearly with the total number of nodes in all modalities.
Augmenting our model with latent nodes introduces new potential functions between latent nodes and 2D/3D nodes, with parameter matrices of size . However, as explained in Sec. IV, only parameters are required to be trained for each potential function.
In order to show the scalability of the proposed method and since no dataset with multiple (more than two) visual modalities were publicly available, we considered geometric class labels to represent another form of visual modality. Note that using a real visual sensor as another modality might impose some challenges on finding node correspondences between modalities. Nonetheless, in terms of computational complexity, the problem is not different from our case where geometric classes are taken as a domain.
The training and inference times reported in Table II demonstrate that even though the training time prolongs to some extent as the number of modalities grows, the inference times remain quite short.
Viii Conclusion
In this paper, we have presented a general multimodal model that could simultaneously accommodate multiple modalities. We have also addressed the problem of domain inconsistencies in multimodal semantic labeling, which is an important issue when multimodal data is concerned. Such inconsistencies typically cause undesirable connections between two modalities, which in turn lead to poor labeling performance. We have, therefore, proposed a latent CRF model, in which latent nodes supervise the pairwise edges between each two domains. Having access to the information of both modalities, these nodes can either improve the labeling in both domains or cut the links between inconsistent regions. Furthermore, we presented a new set of datadriven learned potentials, which can model complex relationships between the latent nodes and the modalities. In addition, our general model enables us to jointly consider the geometric and semantic classes for both 2D and 3D data and perform a concurrent inference on them to further improve the 2D and 3D semantic labeling results. Thanks to our general model, latent nodes and our learned potentials, our model achieved stateoftheart results on two publicly available datasets.
Acknowledgment
The authors would like to thank Justin Domke for his assistance in implementing the learned potentials.
References
 [1] I. Posner, M. Cummins, and P. Newman, “Fast probabilistic labeling of city maps,” in RSS, 2008.
 [2] H. Zhang, J. Wang, T. Fang, and L. Quan, “Joint segmentation of images and scanned point cloud in largescale street scenes with lowannotation cost,” IEEE TIP, vol. 23, no. 11, pp. 4763–4772, 2014.
 [3] B. Douillard, A. Brooks, and F. Ramos, “A 3D laser and vision based classifier,” in ISSNIPC, 2009.
 [4] C. Cadena and J. Koseck, “Semantic Segmentation with Heterogeneous Sensor Coverages,” in ICRA, 2014.
 [5] D. Munoz, J. A. Bagnell, and M. Hebert, “Coinference for multimodal scene analysis,” in ECCV, 2012.
 [6] S. T. Namin, M. Najafi, M. Salzmann, and L. Petersson, “A multimodal graphical model for scene analysis,” in WACV, 2015.
 [7] J. Domke, “Learning graphical model parameters with approximate marginal inference,” PAMI, vol. 35, no. 10, pp. 2454–2467, 2013.
 [8] G. Singh and J. Kosecka, “Nonparametric scene parsing with adaptive feature relevance and semantic context,” in CVPR, 2013.
 [9] J. Xiao and L. Quan, “Multiple view semantic segmentation for street view images.” in ICCV, 2009.
 [10] L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P. H. S. Torr, “What, where and how many? combining object detectors and crfs,” in ECCV, 2010.
 [11]
Comments
There are no comments yet.