Soft Correspondences in Multimodal Scene Parsing

09/28/2017 ∙ by Sarah Taghavi Namin, et al. ∙ EPFL CSIRO Australian National University 0

Exploiting multiple modalities for semantic scene parsing has been shown to improve accuracy over the singlemodality scenario. However multimodal datasets often suffer from problems such as data misalignment and label inconsistencies, where the existing methods assume that corresponding regions in two modalities must have identical labels. We propose to address this issue, by formulating multimodal semantic labeling as inference in a CRF and introducing latent nodes to explicitly model inconsistencies between two modalities. These latent nodes allow us not only to leverage information from both domains to improve their labeling, but also to cut the edges between inconsistent regions. We propose to learn intradomain and inter-domain potential functions from training data to avoid hand-tuning of the model parameters. We evaluate our approach on two publicly available datasets containing 2D and 3D data. Thanks to our latent nodes and our learning strategy, our method outperforms the state-of-the-art in both cases. Moreover, in order to highlight the benefits of the geometric information and the potential of our method in simultaneous 2D/3D semantic and geometric inference, we performed simultaneous inference of semantic and geometric classes both in 2D and 3D that led to satisfactory improvements of the labeling results in both datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 10

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Various sensing modalities can be concurrently used to enhance the performance of scene understanding systems. For instance, high resolution 2D images provide useful textural information of the objects and 3D point cloud data reveal the 3D structure and size of the objects. In the context of scene labeling, where the goal is to assign a class label to the elements of each modality, such as image pixels and 3D points, this has been shown to consistently yield increased accuracy over relying on a single domain 

[1, 2, 3, 4, 5, 6].

In this paper, we propose a multimodal model that can leverage the potential of various modalities simultaneously and the classification of each modality can be enhanced using the information of other sensing modalities (Figure 1). Taking into account multiple modalities that contain different types of information and cover different sorts of object categories is a challenging task, which will be addressed in this work. There are only a limited number of works done using multimodality sensing, where it has been generally assumed that the corresponding elements in various modalities must take identical class labels. This assumption is encoded either explicitly by having a single label variable for all modalities [1, 3, 4], or implicitly by penalizing label differences between the domains [2, 5, 6]. This assumption, however, is very restricting if not infeasible, given different data modalities with their own specific properties and object categories. For example, Grass in 2D data may correspond to the class of horizontal plane in 3D data, or Sky which is a frequent class in 2D images of outdoor data can not be recorded using 3D data. In addition, the different modalities are typically not perfectly aligned/registered in practice. Furthermore, in dynamic scenes, moving objects may not easily be captured by some sensors, such as 3D Lidar, due to their lower acquisition speed. Note that a Lidar system captures 3D data continuously using a rotating sensor, unlike snapshot sensors where the image data are captured instantaneously. To give a concrete example, in the DATA61/2D3D dataset employed in our experiments, 17% of the connections between the two modalities correspond to inconsistent labels. As a consequence, existing methods fail to model these inconsistencies and, hence, produce wrong labels in at least one modality.

Fig. 1: The proposed multimodal graphical model. The dots represent the nodes of more modalities, the intra-domain connections are represented by colored lines and the inter-domain connections are denoted by gray lines. The latent nodes exist between each inter-modality connection, though they have not been illustrated in this figure to avoid any confusion.

Given the dissimilarities in the classes of different modalities and also the inherent misalignments between the domains, these modalities should be either studied separately, or connected such that each one of them could simultaneously utilize the incoming information of other modalities correctly. To this end, as shown in Figure 2, we formulate multimodal scene parsing as inference in a Conditional Random Field (CRF), and introduce latent nodes to handle conflicting evidence between the different domains. The benefit of these latent nodes is twofold: First, they can leverage information from both domains to improve their respective labeling. Second, and maybe more importantly, these nodes allow us to cut the edges between regions in different modalities when the local evidence of the domains is inconsistent. As a result, our approach lets us correctly assign different labels to the modalities. In our formulation, different modalities can cover different sets of class labels and still leverage the information of other modalities to enhance the performance of the scene parsing system.

More specifically, each connection between two domains is encoded by a latent node, which can take either a label from the same set as the regular nodes, or an additional label that explicitly represents a broken link. We then model the connections between the latent nodes and the different modalities with potential functions that allow us to handle inconsistencies. While many such connections exist, they come at little cost, because the only cases of interest are when the latent node and the regular node have the same label, and when the latent node indicates a broken edge. By contrast, having direct links between two modalities would require to consider potential functions for each combination of two labels (i.e., for labels, vs in our model). The connections between the modalities that do not have identical label spaces are also governed by the latent nodes which have access to the features of both modalities. If these features match, the latent nodes then take the class labels that are consistent with the labels of the nodes at two ends of their respective connections. For example, the class Grass for a latent node is consistent with both horizontal plane in one modality and Grass in another one (Grass usually grows on horizontal surfaces). However, in case of a mismatch between the features of two modalities, the latent node breaks the link between them.

Note that our method enables us to apply additional modalities with their own set of categories. To investigate this ability of our model, we use a 2D-3D dataset and take into account each modality twice using its corresponding geometric and semantic annotations and model their relationships. This in turn improves the performance of the system, with negligible impact on its run-time. Furthermore, we also model intra-domain connections with potential functions that encode some notion of label compatibility and thus let us model more accurately the relationships between different class labels. Altogether, these connections allow the information to be transferred across the domains, thus encoding the fact that some classes may be easier to recognize in one modality than in the others. Since such general potential functions cannot realistically be manually tuned, we propose to learn them from training data. To this end, we make use of the truncated tree-reweighted (TRW) learning algorithm of [7]. The resulting method therefore incorporates local evidence from each domain, intra-domain relationships and inter-domain compatibility via our latent nodes.

We demonstrate the effectiveness of our approach on two publicly available 2D-3D scene analysis datasets: The DATA61/2D3D dataset [6] and the CMU/VMR dataset [5]. Our experiments evidence the benefits of the latent nodes and augmentation of the multiple modalities with their semantic and geometric annotations. It also indicates the advantage of learning the potentials for multimodal scene parsing. In particular, our approach outperforms the state-of-the-art on both datasets.

Ii Related Work

Scene parsing has been an important and challenging problem in computer vision in the recent years. In particular, semantic labeling of 2D image data has been studied to a large extent, yielding increasingly accurate results 

[8, 9, 10, 11, 12, 13]. With the advent of 3D depth sensors, such as laser range sensors (Lidar) [14, 15] and RGB-D cameras (e.g., Kinect) [16, 17, 18, 19], it seems natural to leverage these additional sources of information to further increase the level of scene understanding [20, 21, 22, 23].

In fact, more recently, several works have focussed on integrating 2D imagery and 3D point clouds for scene parsing [1, 3, 2, 4, 24, 5, 6]. In particular, [1, 3, 24] designed models based on variables corresponding to only one visual domain and then augmented them with visual cues extracted from the other modality. This approach, however, assumes that the same regions of the scene are observed in both domains, which is virtually never the case in practice. On the contrary, the model of [4] incorporates variables for the two domains, but still relies on a single variable for the corresponding regions in both modalities. As a result, this approach still assumes that there is a perfect alignment between different visual domains. This, unfortunately, can typically not be achieved in practice, and the above-mentioned techniques will thus misclassify some regions in at least one of the domains.

This assumption has been relaxed in some approaches by dedicating separate variables to the scene elements in the two modalities, even for matched regions. More specifically,  [5] came up with a hierarchical segmentation framework that performs parsing in two domains alternatively. However, since each modality transfers its labeling results to facilitate labeling in other modality (depending on the overlap area of the 2D region and the projection of the 3D segment onto the 2D region), this method implicitly assumes that the regions that correspond with each other in two domains should take identical labels. In [2], a framework to train a joint 2D-3D graph from unlabeled data was proposed. Similar to [5], this method also propagates the labeling cues from one domain to the other thus implicitly assuming that corresponding nodes in 2D and 3D data should take the same labels. Likewise, [6] introduced a multimodal graphical model where each domain was represented by separate nodes. This approach, however, is designed based on Pott’s model as pairwise potentials for both intra-domain and inter-domain edges. As a result, the assumption of assigning identical labels to the matched nodes in 2D and 3D domains is implicitly encoded.

Here, by contrast, we propose to introduce latent nodes in a CRF to explicitly model the inconsistencies between two modalities. Furthermore, our approach lets us learn the intra-domain and inter-domain relationships from training data. Learning the parameters of CRFs for semantic labeling has been tackled by a number of works, such as [25, 26] with mean-field inference, [27] with TRW, and [28] with loopy belief propagation. Of more specific interest to us is the problem of learning label compatibility [26], as studied by [26] for 2D images and by [29] for 3D data. Here, we consider label compatibility within and across domains. To the best of our knowledge, this is the first time such a learning approach is employed for multimodal scene parsing.

There are other works that focus on semantic labeling and 3D reconstruction [30, 31, 32]. However, none of these works deal with misalignment problem between natural 2D and 3D data. In particular, [32] is formulated based on only a single modality (RGB image) as input, and [30, 31] reconstruct 3D data synthetically from stereo images in their framework. Zhang et al [33] also addressed the problem of multimodal 2D-3D semantic labeling by independently parsing the 2D and 3D data, and fusing their classification results. They however fail to account for misalignment issue, which is a challenging problem in natural multimodal datasets. The closest work to this paper is [34]

, where the authors addressed domain mismatch problem by designing a specific cardinality loss function with an SSVM framework. However, the higher-order potentials in their model makes their approach computationally demanding, particularly when dealing with large-scale datasets. On the contrary, our graphical model is scalable and can be easily generalized to larger set of modalities and classes. Furthermore, unlike other graph-based approaches, the set of edges in our graph is flexible and can vary depending on how aligned the data modalities are in the problem.

Xie et al [35] presented a multimodal dataset for outdoor scene understanding, though only 3D ground truth annotation information is provided with the dataset. The authors then used a dense 2D-3D graph to tansfer the 3D label information to all 2D pixels. Gould et al [36] integrated the semantic and geometric clues into their 2D scene understanding system and decomposed the scene into semantically and geometrically meaningful regions. Following [36], Tighe and Lazebnik [37] incorporated the geometric information into their region-wise scene parsing system (Superparsing) where they enforced coherence between the semantic labels (building, car, person, etc.) and geometric labels (sky, ground, vertical surfaces).

Inspired by the above, we propose to use the semantic and geometric information of both 2D and 3D data simultaneously. To this end, we build our model upon different nodes which represent the semantic and geometric labels of each modality separately. These nodes are then linked together as seen in Figure 3 for a simultaneous inference procedure. The evaluation results illustrate the superiority of this method over the previous work.

Fig. 2: Top: Existing approaches typically directly connect corresponding regions in different modalities and penalize these regions for taking different labels, thus producing wrong labeling in the presence of data misalignment, or other causes of label disagreement. Bottom: Here, we introduce latent nodes that are placed between each connected pair of 2D and 3D nodes in the graph. They explicitly let us account for such inconsistencies, and potentially cut edges between the different domains. Circles denote the nodes in one domain (e.g., 3D) and squares denote the nodes in another domain (e.g., 2D). The latent nodes are depicted by triangles.

Iii A General Multimodal CRF

In this section, we present our multimodal graphical model. Let

, be the set of features extracted from the elements of the

modality and , be the set of variables encoding the labels of the nodes in that modality, where each variable can take a label in the set

. Then the joint distribution of all modalities conditioned on the features can be expressed as

(1)

where is the partition function, and denotes the unary potentials of modality . and denote pairwise potentials defined over the set of edges (intra-domain) and (inter-domain), respectively. The potential functions in Equation 1

are built such that they could intuitively model the correlation between the class probabilities and local information of each node, as well as the contextual relationships between the pairs of adjacent nodes in the graph (pairwise potentials).

In [6]

, handcrafted potentials were used for the multimodal graphical model, where the pairwise potential function is defined in a way that penalizes dissimilar class labels for two adjacent regions if their feature vectors are very similar. The contributions of the handcrafted potentials in the inference process are determined via a set of weighting parameters. These parameters are then adjusted through a validation step, so as to produce the lowest error on the validation data.

A drawback of the handcrafted potentials that are based on a Pott’s model is that they do not convey any information on the compatibility of different objects and class labels. As an example, take the scenario where a superpixel in the 2D domain is classified as

Grass and it has connections with two different 3D segments, one labeled as a flat object, e.g., Road or Grass, and the other one predicted to be a cylindrical object such as Powerpole. Assigning the same weight to these pairwise links, even if they have the same amount of 2D-3D overlap, might not be a right decision because, in the former case, the predicted 2D class is compatible with the predicted class in the 3D domain. However, in the latter, the difference in shape of the predicted classes demands a more tuned and class-specific pairwise weight. This problem can be addressed by considering different weights for different class combinations of the nodes in a pairwise edge, e.g., 2D:Grass-3D:Grass, 2D:Grass-3D:Road, or 2D:Grass-3D:Tree Trunk. Therefore, we assign a set of label compatibility parameters for all possible class combinations and learn them from data.

Moreover, assigning a fixed set of weights to the unary potentials of different modalities overlooks the fact that some of the classes are recognized better using one data modality and some other object classes can be described more precisely using the another modality. For instance, when it is deduced from the 3D data that the object of interest has a flat shape, the labeling algorithm should trust the 3D information more to put the object in one of the flat categories. If, in this case, the 2D data describes the object as a green entity, e.g., Grass, Bushes, Tree top, the classifier should ideally pick Grass as class label.

Our goal is to construct and train our graphical model based on a set of potentials that describe: I) the reliability of the local information of each domain per class, and II) the cost of various intra-domain and inter-domain class neighborhoods (a.k.a. the label compatibility). To obtain a labeling, we perform inference in our CRF by making use of the truncated TRW algorithm of [7].

Fig. 3: Top: Our model which considers 2D semantic, 3D semantic, 2D geometric and 3D geometric nodes that are connected to each other via latent nodes. This model enables us to do inference on all the nodes using the semantic and geometric information simultaneously. Different colors represent different modalities. The latent nodes are represented by triangles.

Iii-a Potential Definition

The CRF formulation in Equation 1 includes several unary and pairwise potentials that are defined here. The unary potential of a node is generally computed via its local information and indicates the cost of assigning a class label to the node. We define the cost of assigning label to the corresponding variables as

(2)

where is the parameter matrix for the unary potential in modality , with the row of corresponding to label . Since they directly act on the local features , this matrix encodes how much each feature dimension should be relied on to predict a specific label. Note that refers to the dimension of the feature vector in modality .

Pairwise potentials express the cost of all possible joint label assignments for two adjacent nodes in the graph. The handcrafted potentials are limited to simply encouraging the nodes to share the same labels. By contrast, here, we define general pairwise potentials that let us encode sophisticated label compatibilities. For the intra-domain edges, these potentials are defined as

(3)

where is a parameter matrix with rows representing all possible combinations of two labels, and is the row of corresponding to the combination of label with label . In this case, we set the edge features to be the -norm of the difference of a subset of the original node features and , which will be discussed in Section VI-A1.

Similarly, the inter-domain pairwise potential between modality and modality is defined as

(4)

where is the concatenation of a subset of the original node features in and .

Iv General Multimodal CRF with Latent Nodes

We now address the problem of inconsistencies across the modalities by introducing latent nodes to our model. The latent nodes are placed between the pairs of corresponding nodes in two modalities. This breaks down the between-modality edges into two edges that link the node in and the latent node, and also the node in and the latent node. In other words, no edge directly connects to . Our latent nodes can either take a label from the same space as the label space of the or the nodes111When and have different label spaces, the latent node can take a label from one of them., or another label indicating that the link between the two modalities should be cut.

Formally, let be the set of variables encoding the node label in modality . Each of these variables can take a label in the set . Furthermore, let be the number of pairs of corresponding nodes in modality and modality , found in the manner described in Section VI-A1. We then denote by the latent nodes associated with these correspondences. These variables can be assigned a label from the space , where label represents a broken link, which means the nodes do not influence each other.

Given as the features extracted from the elements in modality

, the joint probability distribution of all data nodes and latent nodes conditioned on the features can be expressed as

(5)

Where denotes the unary potential of the latent nodes and denotes the pairwise potentials defined over the set of edges . To obtain a labeling, as in Section III we use the TRW method to perform inference in our CRF. In the remainder of this section, the latent potentials in Equation 5 are described.

Iv-a Unary Potentials of Latent Nodes


Similar to data modality nodes, the unary potential for the latent nodes is defined as

(6)

where is, again, a parameter matrix, which this time contains rows to represent the fact that a latent node can take an additional label to cut the connection between two modalities. The feature vector of a latent node is constructed by concatenating the features of the corresponding and nodes, i.e., . Having access to both and features allows this unary to detect mismatches in the and observations, and in that event, favor cutting the corresponding edge.

Iv-B Inter-domain Pairwise Potentials with Latent Nodes


The inter-domain pairwise potentials associated with the latent nodes that connect two modalities are defined as

(7)

and

(8)

where the parameter matrices now have rows to account for the extra label of the latent nodes. In practice, we set and to 1, thus resulting in parameters. Note, however, that the effective number of parameters corresponding to these potentials is much smaller. The reason is that the only cases of interest are when the latent node and the regular node take the same label, and when the latent node indicates a broken link. The cost of the other label combinations should be heavily penalized since they never occur in practice. This therefore truly results in parameters for each of these potentials.

V Training our Multimodal Latent CRF

Our multimodal CRF contains many parameters, which thus cannot be tuned manually. Here, we propose to learn these parameters from training data. To this end, we make use of the direct loss minimization method of [7].

More specifically, let , be a set of labeled training examples, such that , where, with a slight abuse of notation compared to Section III and Section IV, , resp. , englobes the features, resp. ground-truth labels, of all the nodes in the training sample for modality , and similarly for the other terms in . In practice, to obtain the ground-truth labels of the latent nodes , we simply check if the ground-truth labels of the corresponding and nodes agree, and set the label of the latent node to the same label if they do, and to 0 otherwise222Note that our nodes are latent in the sense that they do not correspond to physical entities, not in the sense that we do not have access to their ground-truth during training..

Learning the parameters of our model is then achieved by minimizing the empirical risk

(9)

w.r.t. , where is a loss function.

Here, we use a marginal-based loss function, which measures how well the marginals obtained via inference in the model match the ground-truth labels. In particular, we rely on a loss function defined on the clique marginals [38]. This can be expressed as where sums over all the cliques in the CRF, i.e., all the inter-domain and intra-domain pairwise cliques in our case, denotes the variables of involved in a particular clique , and indicates the marginals of clique obtained by performing inference with parameters .

We use the publicly available implementation of [7] with truncated TRW as inference method. This method was shown to converge to stable parameters in only a few iterations. In practice, we run a maximum of 5 iterations of this algorithm.

Vi Special Cases

In this section, we demonstrate how our general multimodal model can be used for modeling two special cases of I) 2D-3D multimodal data, and II) 2D-3D semantic and geometric multimodal data, both accompanied with latent nodes.

Vi-a 2D-3D CRF with Latent Nodes

Since 2D imagery and 3D data are often the most popular modalities used for semantic labeling, here we focus the discussion on these two visual domains. Nevertheless, our approach generalizes to other modalities, such as infrared or hyper-spectral data.

Our model specifies separate nodes to 2D regions (i.e., superpixels) and 3D regions (i.e., 3D segments). More details about these regions are provided in Section VI-A1. We also consider latent nodes that enable us to take into account inconsistencies between the different modalities. To this end, and as illustrated in Figure 2, we incorporate one such latent node between each pair of corresponding 2D and 3D nodes. This results in edges between either a 2D node and a latent node, or a 3D node and a latent node, but no edges directly connecting a 2D node to a 3D node. Our latent nodes can then either take a label from the same space as the 2D and 3D nodes, or take another label indicating that the link between the two modalities should be cut (label 0). Figure 4 illustrates through an example how latent nodes operate in case of a misalignment between 2D and 3D data for narrow objects. In Figure 5 we show that multimodal data is prone to errors due to moving objects like a vehicle. In each case, latent nodes utilize the 2D and 3D information and either assist the linked 2D-3D regions to find their class label or cut off the link between them.

Formally, let , be the set of variables encoding the labels of the 2D nodes in frames, with frame containing 2D regions. Similarly, let be the set of variables encoding the label of 3D nodes. Each of these variables, either 2D or 3D, can take a label in the set . Furthermore, let be the number of pairs of corresponding 2D and 3D nodes, found in the manner described in Section VI-A1. We then denote by the latent nodes associated with these correspondences. These variables can be assigned a label from the space .

Given features extracted from the 2D and 3D regions, and , respectively, the joint distribution of the 2D, 3D and latent nodes conditioned on the features can be expressed as

(10)

where , , and denote the unary potentials of the 2D, 3D and latent nodes, respectively. , , and denote pairwise potentials defined over the set of edges , , and , respectively. All the unary and pairwise potentials are calculated based on the formulations in Section III and Section IV. Below, we provide some details regarding our features and potentials.

Vi-A1 Features and Potentials

3D nodes: We extracted the following 3D shape features from the point cloud data: fast point feature histogram (FPFH [39]

) that describes the local point distributions based on the point distances and orientations of their surface normal vectors w.r.t. each other, eigenvalue features that model the shape of the spatial distribution of the points, deviation of the surface normal vectors from the vertical axis, and also the height of the points. The 3D segments were obtained from these features by first classifying the points using an SVM classifier, partitioning them into different groups given their class labels, and then performing k-means clustering on each group of the points based on their spatial coordinates. We then further leveraged the SVM results and used the negative logarithm of the multiclass SVM probabilities as features in our unary potentials. The probabilities for a segment were obtained by averaging over the points belonging to the segment. We also used three eigenvalue descriptors and the vertical-axis deviation as additional features for the segments.

2D nodes: As 2D regions, we used superpixels extracted by the mean-shift algorithm [40]. We utilized histogram of SIFT features [41], GLCM features (entropy, homogeneity and contrast, each computed in both horizontal and vertical directions), and RGB values to train an SVM classifier, and used the negative logarithm of the SVM probabilities as features in our unary potentials. We augmented these features with six GLCM features and three RGB features.

Latent nodes: The features of the latent nodes were obtained by concatenating the features of their respective 2D and 3D nodes, described above. Furthermore, we augmented these features with the normalized overlap area of the projection of the 3D segment onto the 2D superpixel.

Edges: For the intra-domain potentials, we employed the -norm of the difference of a subset of the local feature vectors (RGB for 2D-2D edges and vertical-axis deviation for 3D-3D edges) as pairwise features. The feature vectors of the 2D- and 3D- edges were set to a single value of . In the case of the 2D-3D CRF with no latent nodes, however, the feature vector of the 2D-3D edges was constructed by concatenating the RGB values of the 2D node with the eigenvalue features and deviation of the 3D node from the vertical-axis, as well as with the same normalized overlap area used for the unary of the latent nodes. We selected these features through an ablation study that was conducted on the validation set. As evidenced by our results, they yield better accuracies than when employing all of them, which causes overfitting. Note that we obtain the 2D-3D edges by projecting the 3D clusters onto the 2D regions and then, linking the pairs of 2D-3D elements that have a considerable projection overlap with each other, i.e., an intersection over union of more than 0.2.

Fig. 4: Latent nodes for data misalignment. Left: The projection of pole from 3D to 2D covers some regions of sky, which creates a connection between the corresponding 3D and 2D nodes. Having access to both 3D and 2D features, the latent node should detect the mismatch and cut this connection thus allowing the nodes to take different labels. Right: In this example, we have an accurate projection . As a result, the features of the 2D and 3D nodes are both congruent with category label pole. Hence, the latent node between them preserves an active edge between the nodes and predicts the same label.

Vehicle

Road

Fig. 5: Latent nodes for moving objects. Left: A vehicle was observed in the 2D image, but missing from the 3D data, since the 3D laser sensor has not covered that area when vehicle passed. As a result, the 3D points in that region are labeled as road. By relying on both 2D and 3D features, the latent node should predict that this connection must be cut. Middle: This represents the opposite scenario where the image depicts an empty road, while the 3D points were acquired when a vehicle was passing. Here again, the latent node should cut the edge, thus allowing the nodes to take different labels. Right: In contrast, here, the 2D and 3D regions belong to the same class and thus have coherent features. The latent node should therefore leverage this information to facilitate prediction of of the correct label vehicle.

Vi-B Simultaneous Inference of Semantic and Geometric Classes both in 2D and 3D

Fusing geometric and semantic cues has shown some ability enhancing scene parsing results [36][37]. This procedure can become more promising by using 3D data geometric labeling, counter to relying on 2D data for computing geometric labels [36][37]. In Figure 6 the results of semantic and geometric labeling of wire and tree leaves are shown. In semantic labeling they were wrongly labeled as tree leaves, but using geometric labeling, they were distinct from tree leaves and correctly labeled as wire and scattered categories. This can help us improve the semantic labeling. In this paper, we use the 2D and 3D semantic labellings as well as the 2D and 3D geometric labeling collaboratively and leverage their information through a concurrent inference process to improve the labeling results in each one of them.  [36][37] picked three categories, horizontal, vertical and sky, as geometric classes in their methods. Having access to 3D point cloud data enabled us to expand this list by taking into account the cylindrical and scattered categories in both 2D and 3D data, which is explained in more detail in Section VI-B1. In our semantic-geometric mapping, each semantic class belongs only to one of the geometric classes, e.g., all the roads are assigned a horizontal label and all the vehicles are given a vertical label.

Let , , and be the variables encoding the 2D semantic, 3D semantic, 2D geometric and 3D geometric class labels, respectively. We can then define the joint distribution of the 2D semantic, 2D geometric, 3D semantic, 3D geometric and the latent nodes, conditioned on the node features, similarly to the definition in Equation 5. Note that the label set in geometric nodes and semantic nodes are different.

Given that the geometric nodes represent the same set of 2D and 3D regions that were previously produced for semantic labeling, the 2D-3D geometric edges are similar to the 2D-3D semantic edges. Furthermore, note that the latent nodes which link the semantic and geometric nodes both representing one 2D region (or 3D segment), cannot cut their corresponding edges although their class labels are different. The reason behind this is that they connect two visually identical regions (segments). Instead, they try to find a coherent pair of semantic and geometric class labels that sufficiently fit the 2D and 3D features of the region (segment). The truncated TRW method is used for the inference, similar to what is described in Section V. The inference time, however, is still quite short and satisfactory, despite the considerable increase in the size of the graph (number of nodes and edges). Table II presents the training and inference times for the DATA61/2D3D and CMU/VMR datasets.

Our method trains all the compatibility parameters between the semantic and geometric class labels, which contrasts with the Superparsing method [37], where only one parameter is embedded in the cost function to enforce consistency between these two groups of classes. Note that we used the same features as in Sec. VI-A1 for the geometric nodes.

Vi-B1 Semantic and Geometric Classes

In order to best exploit the geometric cues, particularly given the 3D point cloud data, the data is clustered into different structural classes including horizontal plane, vertical plane, scattered and cylindrical (in addition to three other groups for specifically representing sky, person and wire). Table III provides the mapping between the geometric and semantic classes.

 

DATA61/2D3D CMU/VMR

 

Parameter Matrices Rows Columns Parameter Matrices Rows Columns

 

14 classes 14 2D-probabilities + 6 GLCM + 3 RGB 19 classes 19 2D-probabilities + 6 GLCM + 3 RGB

 

13 classes 13 3D-probabilities + 3 Eigenvalues + 1 -deviation 19 classes 19 3D-probabilities + 3 Eigenvalues + 1 -deviation

 

14 classes + 1 edge-cut 23 2D-features + 17 3D-features + 1 2D-3D overlap 19 classes + 1 edge-cut 28 2D-features + 23 3D-features + 1 2D-3D overlap

 

1414 classes 1 1919 classes 1

 

1313 classes 1 1919 classes 1

 

1415 classes 1 1920 classes 1

 

1315 classes 1 1920 classes 1

 

1413 classes 1 1919 classes 1

 

1413 classes 3 RGB + 3 Eigenvalues + 1 -deviation + 1 2D-3D overlap 1919 classes 3 RGB + 3 Eigenvalues + 1 -deviation + 1 2D-3D overlap

 

1413 classes 3 RGB + 3 Eigenvalues + 1 -deviation + 1 2D-3D overlap 1919 classes 3 RGB + 3 Eigenvalues + 1 -deviation + 1 2D-3D overlap
6 GLCM + 14 2D-probabilities + 13 3D-probabilities 6 GLCM + 19 2D-probabilities + 19 3D-probabilities

 

TABLE I: Different parameter matrices used in our experiment for two multimodal datasets. The feature sets used in each case are listed.

Vii Experiments

We evaluate our method on two publicly available 2D-3D multimodal datasets (DATA61/2D3D [6] and CMU/VMR [5]). KITTI [42] is another well-known multimodal dataset used for outdoor scene understanding and object detection, and it has recently been adapted to be used as a benchmark for semantic labeling task [4]. However, the point cloud data in this dataset has a pretty small vertical field of view [35] and as a result, a large portion of 2D images do not have a correspondence in the 3D data. Therefore in our application where we are interested in investigating the 2D-3D links, this dataset is less useful and thus we evaluate our method on the two above mentioned multimodal datasets.

We provide the results of 2D-3D CRF with and without latent nodes and also simultaneous inference of semantic and geometric classes both in 2D and 3D. We also compare the results to the state-of-the-art algorithms of [6] and [5]. The experiment on 2D-3D CRF without latent nodes is a special case study of the general multimodal CRF (Section III). In addition, we provide the results of the pairwise models with learned potentials acting on a single domain, either 2D or 3D. These models are referred to as Pairwise 2D (learned) and Pairwise 3D (learned). We followed the evaluation protocol of [6] and partitioned the data into 4 non-overlapping folds. We then used three of the folds for training and the remaining fold as test set.

Vii-a Results on DATA61/2D3D

The DATA61/2D3D dataset contains 12 outdoor scenes where each scene is described by a 3D point cloud block together with 10-20 panoramic images. The number of 3D points in the scenes varies from 1 to 2 millions. It comprises 14 classes (13 for 3D where sky was removed), which yields the following sizes for the parameter matrices for 2D-3D CRF with latent nodes: , , , , , and . The 2D-3D CRF with no latent nodes involves a different parameter matrix of the form , and . Table I lists these parameters matrices and describes how their size relates with the selected set of features and class labels in the experiments.

Table IV and Table V compare the results, as F1-scores, of the 2D-3D CRF model with handcrafted and learned potentials and also with latent nodes and no latent nodes. Note that no results for [5]

are available on this dataset. The results in these tables evidence the benefits of using latent nodes, especially on the narrow classes that suffer more from misalignment. On average, our approach with latent nodes clearly outperforms the model with no latent nodes, and thus achieves state-of-the-art results on this dataset. Moreover, note that the 2D-3D CRF with no latent nodes that utilizes fewer features (selected features) for the 2D-3D edges is less likely to face overfitting and yields better results, compared with the CRF model with a full set of features. Furthermore, the results of the 2D-3D CRF with no latent nodes, where the feature vector of the 2D-3D edges were set to a single value of 1 are presented in Table 

IV and Table V for comparison (no feature).

In Figure 7, we illustrate the influence of our latent nodes by two examples. As shown in the figure, cutting the edge between the non-matching 2D and 3D nodes (which have been connected because of misalignment) helps predicting the correct class labels. Figure 8 shows the results of our approach in one of the scenes in this dataset, compared to the results of [6].

Wire-sem

Tree leaves

Wire-geo

Scattered
Fig. 6: Semantic labeling vs. geometric labeling. Left: Semantic labeling Right: Geometric labeling. This sample image shows geometric labeling in compare with semantic labeling could distinct between wire and tree leaves.

Our results on DATA61/2D3D indicate that, while our latent nodes are in general beneficial, thanks to their ability to cut incorrect connections, they still occasionally yield lower performance than a model without such nodes. We observed that this is mainly due to the inaccurate ground-truth (which is inevitable because of the imperfect 3D-2D projection of the ground-truth labels particularly at the boundaries of the narrow objects), or to the fact that, sometimes, eventhough the 2D and 3D features seem to be inconsistent (e.g., due to challenging viewing conditions), they still belong to the same category. In these circumstances, the stronger smoothness imposed by the model without latent nodes is then able to address this problem.

2D-3D multimodal scene parsing on semantic and geometric classes can be seen as a special case of our multimodal model with four modalities. We considered six geometric classes in the DATA61/2D3D dataset (Table III) and conducted similar procedures as in the semantic labeling for finding their regions and node features. The 2D and 3D geometric data augment the semantic model as two separate data modalities and their simultaneous inference is carried out given the semantic and geometric cues of the 2D and 3D data. Tables IV and V demonstrate the results of the 2D and 3D semantic scene parsing using the proposed semantic and geometric 2D/3D multimodal model. As reported in this table, leveraging the geometric cues has led to 4% and 5% improvement in F1-scores of the 2D and 3D data, respectively. The results of the geometric labeling of the 2D and 3D data are shown in Table VI.

Furthermore, the panoramic images in the DATA61/2D3D provide the opportunity of observing an object in successive image frames and as a result, multiple 2D features for each object can be recorded. We linked these corresponding 2D nodes together with latent nodes in each connection to provide more information in the labeling process and gained a 2% improvement on the 2D performance, as shown in Table IV. Figure 10 shows some sample results of our semantic and geometric labeling on the DATA61/2D3D dataset.

Vii-B Results on CMU/VMR

The CMU/VMR dataset is comprised of 372 pairs of urban images and corresponding 3D point cloud data, on average 31,000 3D points per image. Importantly, the ground-truth of this data is such that the labels of corresponding 2D and 3D nodes are always the same333Note that by examining the dataset, one can easily verify that its ground-truth is often erroneous, due to the inaccurate projection and misalignment problem.. In other words, this dataset is not particularly well-suited to our approach. However, it remains a standard benchmark, and no other dataset, except the DATA61/2D3D dataset, explicitly evidencing the misalignment problem is available. The CMU/VMR dataset contains 19 classes, which yields the following sizes for the parameter matrices for 2D-3D CRF with latent nodes: , , , , , and , with alternative matrices for the 2D-3D CRF with no latent nodes of the form , and . Table I lists these parameters matrices and describes how their size relates with the selected set of features and class labels in the experiments.

We compare the results of the 2D-3D CRF model with handcrafted and learned potentials and also with latent nodes and no latent nodes in Table VII and Table VIII for the 2D and 3D domains, respectively. In this case, while our approach still yields the best F1-scores on average, there is less difference between our results with latent nodes and the no latent method. This can easily be explained by the fact that, as mentioned above, the ground-truth labels of corresponding nodes in 2D and 3D are always the same. In addition, it can be seen that our method does not perform very well on the rare categories with insufficient number of training samples, e.g. the last five classes in the tables. This outcome is not surprising though, since our training strategy heavily relies on the training data. Figure 9 demonstrates a qualitative comparison.

Six geometric classes are considered in the CMU/VMR dataset (Table III). Similarly to the DATA61/2D3D dataset, the 2D and 3D geometric data are augmented to the semantic model as two separate data modalities and their simultaneous inference is carried out given the semantic and geometric cues of the 2D and 3D data. Tables VII and VIII demonstrate the results of the 2D and 3D semantic scene parsing using the proposed semantic and geometric 2D/3D multimodal model. It improves the F1-scores of the 2D and 3D data. The results of the geometric labeling of the 2D and 3D data are shown in Table IX. Figure 11 shows some sample results of our semantic and geometric labeling on the CMU/VMR dataset.

Image

2D ground-truth

3D ground-truth

3D-2D projection

2D results [6]

3D results [6]

Our 2D results

Our 3D results

Grass

Building

Tree trunk

Tree leaves

Vehicle

Road

Pole

Wire
Fig. 7: Examples of how our latent nodes improve the labeling in practice. As shown in the 3D-2D projection, the data misalignment and object motions have caused 3D points labeled as leaves to cover the pole (top) and 3D points labeled as road to project onto the vehicles (bottom). Consequently, by applying the approach in [6] which encourages the corresponding nodes in two modalities to take identical class labels, the pole was segmented as leaves in the image domain and the vehicle was labeled as road in 3D data (highlighted by a white arrow). On the contrary, thanks to our latent nodes that can cut inconsistent edges, our method produces the correct labels.
Training time Inference time Training time Inference time
(DATA61/2D3D dataset) (DATA61/2D3D dataset) (CMU/VMR dataset) (CMU/VMR dataset)

 

2D-3D CRF with latent nodes 6hr45min 0.85s 4hr40min 0.47s

 

Simultaneous Inference of Semantic and 19hr20min 2.3s 24hr15min 1.2s
Geometric Classes both in 2D and 3D

 

TABLE II: Training and inference time for DATA61/2D3D and CMU/VMR datasets.
Geometric Classes Semantic Classes (DATA61/2D3D dataset) Semantic Classes (CMU/VMR dataset)

 

Horizontal Plane Grass - Road - Sidewalk Road - Sidewalk - Ground - Stairs

 

Vertical Plane Building - Vehicle Building - Small Vehicle - Big Vehicle

 

Cylindrical Tree Trunk - Pole- Sign - Post-Barrier Barrier - Bus Stop - Tree Trunk- Tall Light - Post - Sign - Utility Pole- Traffic Signal

 

Scattered Tree Leaves - Bush Shrub - Tree Top

 

Sky Sky

 

Person Person

 

Wire Wire Wire

 

TABLE III: Mapping table between the geometric and semantic classes for DATA61/2D3D dataset and CMU/VMR dataset.

Grass

Building

Tree trunk

Tree leaves

Vehicle

Road

Bush

Pole

Sign

Post

Barrier

Wire

Sidewalk

Sky

avg

 

Unary 80 33 14 80 49 95 16 28 3 0 0 29 15 98 38

 

Pairwise 2D (learned) 85 57 17 85 55 95 18 30 0 0 3 34 20 99 43

 

2D-3D handcrafted potentials, Namin [6] 74 56 21 82 58 92 23 33 19 8 5 32 29 97 45

 

2D-3D learned potentials (no feature) 94 58 12 83 72 64 31 34 6 0 13 37 48 97 46

 

2D-3D learned potentials (full features) 90 63 10 91 68 96 31 43 1 0 0 44 53 99 49

 

2D-3D learned potentials (selected features) 92 64 18 92 69 98 36 34 3 0 28 40 60 99 52

 

2D-3D learned potentials with latent nodes 95 71 28 93 76 97 44 44 10 5 21 38 68 99 56

 

Semantic results with semantic - geometric model 92 70 26 93 72 97 32 49 17 0 0 63 65 99 55
(selected features)

 

Semantic results with semantic - geometric model 93 79 45 95 77 98 34 55 22 0 0 63 83 99 60
and latent nodes

 

Semantic results with semantic - geometric model 95 82 52 90 78 99 78 99 33 60 20 61 92 99 62
(Connected 2D frames)

 

TABLE IV: Per class F1-scores for the 2D domain in the DATA61/2D3D dataset. We present the results for unary, pairwise model learned on the 2D domain only, the method of [6] with handcrafted potentials, the 2D-3D learned potentials, the 2D-3D learned potentials with latent nodes, semantic results with semantic - geometric model with and without latent nodes.

Grass

Building

Tree trunk

Tree leaves

Vehicle

Road

Bush

Pole

Sign

Post

Barrier

Wire

Sidewalk

Sky

avg

 

Unary 52 61 27 87 58 82 10 24 19 43 19 74 0 # 43

 

Pairwise 3D (learned) 58 80 50 97 56 76 16 62 32 40 0 89 0 # 50

 

2D-3D handcrafted potentials, Namin [6] 63 81 41 96 70 76 21 38 28 47 23 87 0 # 52

 

2D-3D learned potentials (no feature) 68 81 31 92 67 83 69 43 37 25 16 75 10 # 54

 

2D-3D learned potentials (full features) 72 75 27 95 77 90 42 62 31 9 0 89 0 # 52

 

2D-3D learned potentials (selected features) 60 92 45 97 75 79 61 58 49 29 27 82 0 # 58

 

2D-3D learned potentials with latent nodes 66 94 49 95 79 83 51 62 54 43 25 89 8 # 61

 

Semantic results with semantic - geometric model 71 88 51 97 76 84 56 60 51 49 6 92 21 # 62
(selected features)

 

Semantic results with semantic - geometric model 79 91 64 99 77 93 60 61 50 58 0 96 34 # 66
and latent nodes

 

Semantic results with semantic - geometric model 80 92 65 98 75 93 65 59 49 62 0 93 32 # 66
(Connected 2D frames)

 

TABLE V: Per class F1-scores for the 3D domain in the DATA61/2D3D dataset. We present the results for unary, pairwise model learned on the 2D domain only, the method of [6] with handcrafted potentials, the 2D-3D learned potentials, the 2D-3D learned potentials with latent nodes, semantic results with semantic - geometric model with and without latent nodes.
Horizontal plane Vertical plane Cylindrical Scattered Wire Sky avg

 

2D geometric results with semantic - geometric model 98 76 25 94 43 99 72

 

3D geometric results with semantic - geometric model 99 91 62 99 95 # 89

 

TABLE VI: Per class F1-scores for geometric results with semantic - geometric model and latent nodes in the DATA61/2D3D dataset.

Road

Sidewalk

Ground

Building

Barrier

Bus stop

Stairs

Shrub

Tree trunk

Tree top

Small Vehicle

Big vehicle

Person

Tall light

Post

Sign

Utility pole

Wire

Traffic Signal

avg

 

Unary 95 81 75 56 29 17 32 50 31 53 32 49 29 16 15 16 33 41 29 41

 

Pairwise 2D (learned) 89 77 74 84 25 17 40 62 37 89 78 57 38 1 5 3 16 12 9 43

 

Munoz [5] 96 90 70 83 50 16 33 62 30 86 84 50 47 2 9 16 14 2 17 45

 

2D-3D handcrafted potentials, Namin [6] 94 87 79 74 45 22 40 54 27 84 67 24 38 13 2 10 37 35 40 46

 

2D-3D learned potentials (no feature) 95 84 78 70 58 18 57 68 43 84 81 52 55 9 3 2 15 5 8 47

 

2D-3D learned potentials (full features) 93 85 83 88 60 4 61 67 41 87 79 61 45 0 3 2 12 9 2 46

 

2D-3D learned potentials (selected features) 93 80 80 87 60 1 70 67 37 90 84 67 54 7 4 4 21 15 3 49

 

2D-3D learned potentials with latent nodes 94 84 84 84 65 4 75 64 43 89 84 58 52 11 6 2 25 18 3 50

 

Semantic results with semantic - geometric model 94 87 82 82 61 26 59 68 43 89 74 60 55 0 4 4 27 15 8 49
(selected features)

 

Semantic results with semantic - geometric model 94 87 84 81 58 28 63 66 47 87 78 64 56 0 6 5 38 17 10 51
and latent nodes

 

TABLE VII: Per class F1-scores for the 2D domain in the CMU/VMR dataset. We present the results for unary, pairwise model learned on the 2D domain only, the method of [5], the method of [6] with handcrafted potentials, the 2D-3D learned potentials, the 2D-3D learned potentials with latent nodes, semantic results with semantic - geometric model with and without latent nodes.

Road

Sidewalk

Ground

Building

Barrier

Bus stop

Stairs

Shrub

Tree trunk

Tree top

Small Vehicle

Big vehicle

Person

Tall light

Post

Sign

Utility pole

Wire

Traffic Signal

avg

 

Unary 70 49 62 67 34 2 19 26 11 67 34 4 13 2 0 1 2 0 0 24

 

Pairwise 3D (learned) 78 52 67 78 15 1 32 31 1 73 44 14 9 1 0 0 0 0 0 26

 

Munoz [5] 82 73 68 87 46 11 38 63 28 88 73 56 26 10 0 0 0 0 0 39

 

2D-3D handcrafted potentials, Namin [6] 92 85 81 85 50 16 42 55 29 82 70 16 43 6 2 7 29 9 23 43

 

2D-3D learned potentials (no feature) 92 84 85 87 64 3 59 64 32 77 70 19 42 5 2 3 7 3 9 42

 

2D-3D learned potentials (full features) 90 86 87 90 59 2 64 69 31 79 70 29 47 1 1 0 5 0 0 43

 

2D-3D learned potentials (selected features) 90 85 85 89 62 2 63 68 29 86 78 46 53 3 1 0 15 0 0 45

 

2D-3D learned potentials with latent nodes 92 88 84 88 64 7 66 66 31 86 75 42 53 8 7 0 17 10 0 47

 

Semantic results with semantic - geometric model 93 86 85 92 66 12 62 68 39 86 80 47 56 0 2 2 21 10 0 48
(selected features)

 

Semantic results with semantic - geometric model 94 86 87 90 71 18 60 70 44 87 78 43 58 0 2 2 28 13 0 50
and latent nodes

 

TABLE VIII: Per class F1-scores for the 3D domain in the CMU/VMR dataset. We present the results for unary, pairwise model learned on the 2D domain only, the method of [5], the method of [6] with handcrafted potentials, the 2D-3D learned potentials, the 2D-3D learned potentials with latent nodes, semantic results with semantic - geometric model with and without latent nodes.
Horizontal plane Vertical plane Cylindrical Scattered Person Wire avg

 

2D geometric results with semantic-geometric model 97 85 44 88 56 52 70

 

3D geometric results with semantic-geometric model 96 91 60 87 56 19 68

 

TABLE IX: Per class F1-scores for geometric results with semantic-geometric model and latent nodes in the CMU/VMR dataset.

Grass

Building

Tree trunk

Tree leaves

Vehicle

Road

Bush

Pole

Sign

Post

Barrier

Wire

Sidewalk

Sky
Fig. 8: Sample results on the DATA61/2D3D dataset. 1st row: Left: 2D ground-truth; Middle: 2D results of [6]; Right: our 2D results. 2nd row: Left: 3D ground-truth; Middle: 3D results of [6]; Right: our 3D results. Our approach (the right figures) has managed to correct some of the mislabellings in the results of [6], e.g. the vehicles and wires in 3D domain, and poles and tree trunks in 2D domain. In fact, these are the categories that are most likely to be affected by misalignments.

Road

Sidewalk

Ground

Building

Barrier

Bus stop

Stairs

Shrub

Tree trunk

Tree top

Small vehicle

Big vehicle

Person

Tall light

Post

Sign

Utility pole

Wire

Traffic signal
Fig. 9: Sample results of two scenes in the CMU/VMR dataset. 1st row in each scene: Left: 2D ground-truth; Middle: the results of [6]; Right: our 2D results. 2nd row: ground-truth of the 3D data; 3rd row: the results of [6]; 4th row: our 3D results. The circles highlight mislabeling in the 3D ground-truth of this dataset, which happened because of misalignments between 2D images and 3D data, and indicate how our approach has enhanced the results in those challenging parts compared to [6].

Grass

Building

Tree trunk

Tree leaves

Vehicle

Road

Bush

Pole

Sign

Post

Barrier

Wire

Sidewalk

Sky

Horizontal

Vertical

Cylindrical

Scattered

Wire

Sky
Fig. 10: Example results of semantic and geometric labeling in the DATA61/2D3D dataset. 1st row: image, 2nd row: 2D semantic ground-truth, 3rd row: 2D geometric ground-truth, 4th row: 2D semantic results, 5th row: 2D geometric results, 6th row: 3D semantic results, 7th row: 3D geometric results.

Road

Sidewalk

Ground

Building

Barrier

Bus stop

Stairs

Shrub

Tree trunk

Tree top

Small vehicle

Big vehicle

Person

Tall light

Post

Sign

Utility pole

Wire

Traffic signal

Horizontal

Vertical

Cylindrical

Scattered

Person

Wire
Fig. 11: Example results of semantic and geometric labeling in the CMU/VMR dataset. 1st row: image, 2nd row: 2D semantic ground-truth, 3rd row: 2D geometric ground-truth, 4th row: 2D semantic results, 5th row: 2D geometric results, 6th row: 3D semantic ground-truth, 7th row: 3D semantic results, 8th row: 3D geometric results.

Vii-C Scalability

In this section we discuss the scalability of our model. In the proposed model with multiple modalities, each node has correspondence with only a few nodes in other modalities and hence, the number of latent nodes in the graph grows linearly with the total number of nodes in all modalities.

Augmenting our model with latent nodes introduces new potential functions between latent nodes and 2D/3D nodes, with parameter matrices of size . However, as explained in Sec. IV, only parameters are required to be trained for each potential function.

In order to show the scalability of the proposed method and since no dataset with multiple (more than two) visual modalities were publicly available, we considered geometric class labels to represent another form of visual modality. Note that using a real visual sensor as another modality might impose some challenges on finding node correspondences between modalities. Nonetheless, in terms of computational complexity, the problem is not different from our case where geometric classes are taken as a domain.

The training and inference times reported in Table II demonstrate that even though the training time prolongs to some extent as the number of modalities grows, the inference times remain quite short.

Viii Conclusion

In this paper, we have presented a general multimodal model that could simultaneously accommodate multiple modalities. We have also addressed the problem of domain inconsistencies in multimodal semantic labeling, which is an important issue when multimodal data is concerned. Such inconsistencies typically cause undesirable connections between two modalities, which in turn lead to poor labeling performance. We have, therefore, proposed a latent CRF model, in which latent nodes supervise the pairwise edges between each two domains. Having access to the information of both modalities, these nodes can either improve the labeling in both domains or cut the links between inconsistent regions. Furthermore, we presented a new set of data-driven learned potentials, which can model complex relationships between the latent nodes and the modalities. In addition, our general model enables us to jointly consider the geometric and semantic classes for both 2D and 3D data and perform a concurrent inference on them to further improve the 2D and 3D semantic labeling results. Thanks to our general model, latent nodes and our learned potentials, our model achieved state-of-the-art results on two publicly available datasets.

Acknowledgment

The authors would like to thank Justin Domke for his assistance in implementing the learned potentials.

References

  • [1] I. Posner, M. Cummins, and P. Newman, “Fast probabilistic labeling of city maps,” in RSS, 2008.
  • [2] H. Zhang, J. Wang, T. Fang, and L. Quan, “Joint segmentation of images and scanned point cloud in large-scale street scenes with low-annotation cost,” IEEE TIP, vol. 23, no. 11, pp. 4763–4772, 2014.
  • [3] B. Douillard, A. Brooks, and F. Ramos, “A 3D laser and vision based classifier,” in ISSNIPC, 2009.
  • [4] C. Cadena and J. Koseck, “Semantic Segmentation with Heterogeneous Sensor Coverages,” in ICRA, 2014.
  • [5] D. Munoz, J. A. Bagnell, and M. Hebert, “Co-inference for multi-modal s