xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

by   Maximilian Jaritz, et al.

Unsupervised Domain Adaptation (UDA) is crucial to tackle the lack of annotations in a new domain. There are many multi-modal datasets, but most UDA approaches are uni-modal. In this work, we explore how to learn from multi-modality and propose cross-modal UDA (xMUDA) where we assume the presence of 2D images and 3D point clouds for 3D semantic segmentation. This is challenging as the two input spaces are heterogeneous and can be impacted differently by domain shift. In xMUDA, modalities learn from each other through mutual mimicking, disentangled from the segmentation objective, to prevent the stronger modality from adopting false predictions from the weaker one. We evaluate on new UDA scenarios including day-to-night, country-to-country and dataset-to-dataset, leveraging recent autonomous driving datasets. xMUDA brings large improvements over uni-modal UDA on all tested scenarios, and is complementary to state-of-the-art UDA techniques.



There are no comments yet.


page 1

page 8

page 12


Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation

Domain adaptation is critical for success when confronting with the lack...

Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

Domain adaptation is an important task to enable learning when labels ar...

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021: Team M3EM Technical Report

In this report, we describe the technical details of our submission to t...

Transfer beyond the Field of View: Dense Panoramic Semantic Segmentation via Unsupervised Domain Adaptation

Autonomous vehicles clearly benefit from the expanded Field of View (FoV...

Unsupervised Domain Adaptation in Semantic Segmentation Based on Pixel Alignment and Self-Training

This paper proposes an unsupervised cross-modality domain adaptation app...

TICaM: A Time-of-flight In-car Cabin Monitoring Dataset

We present TICaM, a Time-of-flight In-car Cabin Monitoring dataset for v...

Cross-Modal Information Maximization for Medical Imaging: CMIM

In hospitals, data are siloed to specific information systems that make ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Three-dimensional scene understanding is required in numerous applications, in particular robotics, autonomous driving and virtual reality. Among the different tasks under concern, 3D semantic segmentation is gaining more and more traction as new datasets are being released 

[7, 1, 5]. Like other perception tasks, 3D semantic segmentation can encounter the problem of domain shift between supervised training and test time, for example between day and night, different countries or datasets. Domain adaptation aims at addressing this gap, but existing work concerns mostly 2D semantic segmentation [12, 34, 27, 17] and rarely 3D [32]. We also observe that previous domain adaptation work focuses on single modality, whereas 3D datasets are often multi-modal, consisting of 3D point clouds and 2D images. While the complementarity between these two modalities is already exploited by both human annotators and learned models to localize objects in 3D scenes [18, 20], we consider it through a new angle, asking the question: If 3D and 2D data are available in the source and target domain, can we capitalize on multi-modality to address Unsupervised Domain Adaptation (UDA)?

We coin our method Cross-Modal UDA, ‘xMUDA’ in short, and consider 3 real-to-real adaptation scenarios with different lighting conditions (day-to-night), environments (country-to-country) and sensor setup (dataset-to-dataset). It is a challenging task for various reasons. The heterogeneous input spaces (2D and 3D) make the pipeline complex as it implies to work with heterogeneous network architectures and 2D-3D projections. In fusion, if two sensors register the same scene, there is shared information between both, but each sensor also has private (or exclusive) information. Thanks to the latter, one modality can be stronger than the other in a certain case, but it can be the other way around in another, depending on class, context, resolution, etc. This makes selecting the “best” sensor based on prior knowledge unfeasible. Additionally, each modality can be affected differently by the domain shift. For example, camera is deeply impacted by the day-to-night domain change, while LiDAR is relatively robust to it, as shown on the left in Fig. 1.

In order to address these challenges, we propose a Cross-Modal UDA (‘xMUDA’) framework where information can be exchanged between 2D and 3D in order to learn from each other for UDA (see right side of Fig. 1). We use a disentangled 2-stream architecture to address the domain gap individually in each modality. Our learning scheme allows robust balancing of the Cross-Modal and segmentation objective. In addition, xMUDA is complementary to self-training with pseudo-labels [17]

, a popular UDA technique, as it exploits a different source of knowledge. Finally, it is common practice in supervised learning to use feature fusion (e.g., early or late fusion) when multiple modalities are available 

[9, 26, 18]: our framework can be extended to fusion while maintaining a disentangled Cross-Modal objective.

Our contributions can be summarized as follows:

  • We define new UDA scenarios and propose corresponding splits on recently published 2D-3D datasets.

  • We design an architecture that enables Cross-Modal learning by disentangling private and shared information in 2D and 3D.

  • We propose a novel UDA learning scheme where modalities can learn from each other in balance with the main objective. It can be applied on top of state-of-the-art self-training techniques to boost performance.

  • We showcase how our framework can be extended to late fusion and produce superior results.

On the different proposed benchmarks we outperform the single-modality state-of-the-art UDA techniques by a significant margin. Thereby, we show that the exploitation of multi-modality for UDA is a powerful tool that can benefit a wide range of multi-sensor applications.

Figure 2: Overview of our xMUDA framework for 3D semantic segmentation. The architecture comprises a 2D stream which takes an image as input and uses a U-Net-style 2D ConvNet [22], and a 3D stream which takes the point cloud as input and uses a U-Net-Style 3D SparseConvNet [8]. Feature outputs of both streams have same length , equal to the number of 3D points. To achieve that, we project the 3D points into the image and sample the 2D features at the corresponding pixel locations. The 4 segmentation outputs consist of the main predictions and the mimicry predictions . We transfer knowledge across modalities using KL divergence,

, where the objective of the 2D mimicry head is to estimate the main 3D output and vice versa,


2 Related Work

In this section, rather than thoroughly going through the literature, we review representative works for each focus.

Unsupervised Domain Adaptation.

The past few years have seen an increasing interest in unsupervised domain adaptation techniques for complex perception tasks like object detection and semantic segmentation. Under the hood of such methods lies the same spirit of learning domain-invariant representations, i.e., features coming from different domains should introduce insignificant discrepancy. Some works promote adversarial training to minimize source-target distribution shift, either on pixel- [12], feature- [13] or output-space [25, 27]

. Revisited from semi-supervised learning 

[15], self-training with pseudo-labels has also been recently proven effective for UDA [17, 34].

While most existing works consider UDA in the 2D world, very few tackle the 3D counterpart. Wu et al. [32] adopted activation correlation alignment [19] for UDA in 3D segmentation from LiDAR point clouds. In this work, we investigate the same task, but differently: our system operates on multi-modal input data, i.e., RGB + LiDAR.

To the best of our knowledge, there are no previous UDA works in 2D/3D semantic segmentation for multi-modal scenarios. Only some consider the extra modality, e.g. depth, solely available at training time on source domain and leverage such privileged information to boost adaptation performance [16, 28]. Otherwise, we here assume all modalities are available at train and test time on both source and target domains.

Multi-Modality Learning.

In a supervised setting, performance can naturally be improved by fusing features from multiple sources. The geometrically simplest case is RGB-Depth fusion with dense pixel-to-pixel correspondence for 2D segmentation [9, 26]. It is harder to fuse a 3D point cloud with a 2D image, because they live in different metric spaces. One solution is to project 2D and 3D features into a ‘bird eye view’ for object detection [18]. Another possibility is to lift 2D features from multi-view images to the 3D point cloud to enable joint 2D-3D processing for 3D semantic segmentation [23, 14, 3]. We are closer to the last series of works: we share the same goal of 3D semantic segmentation. However, we focus on how to exploit multi-modality for UDA instead of supervised learning and only use single view images and their corresponding point clouds.

3D networks for semantic segmentation.

While images are dense tensors, 3D point clouds can be represented in multiple ways which leads to competing network families that evolve in parallel. Voxels are very similar to pixels, but very memory intense as most of them are empty. Graham 

et al. [8] and similar implementation [4] address this problem by using hash tables to convolve only on active voxels. This allows for very high resolution with typically only one point per voxel. Point-based networks perform computation in continuous 3D space and can thus directly accept point clouds as input. PointNet++ [21]

uses point-wise convolution, max-pooling to compute global features and local neighborhood aggregation for hierarchical learning akin to CNNs. Many improvements have been proposed in this direction, such as continuous convolutions 

[29] and deformable kernels [24]. Graph-based networks convolve on the edges of a point point cloud [30]. In this work, we select SparseConvNet [8] as 3D network which is the state-of-the-art on the ScanNet benchmark [5].


The aim of Cross-Modal UDA (xMUDA) is to exploit multi-modality by enabling controlled information exchange between modalities so that they can learn from each other. This is achieved through letting them mutually mimic each other’s outputs, so that they can both benefit from their counterpart’s strengths.

Specifically, we investigate xMUDA using point cloud (3D modality) and image (2D modality) on the task of 3D semantic segmentation. An overview is depicted in Fig. 2. We first describe the architecture in Sec. 3.1, our learning scheme in Sec. 3.2, and later showcase its extension to the special case of fusion.

In the following, we consider a source dataset , where each sample consists of 2D image , 3D point cloud and 3D segmentation labels as well as a target dataset , lacking annotations, where each sample only consists of image and point cloud . Images are of spatial size and point clouds of spatial size , where is the number of 3D points within the camera field of view.

3.1 Architecture

To allow Cross-Modal learning, it is crucial to extract features specific to each modality. Opposed to 2D-3D architectures where 2D features are lifted to 3D [18], we use a 2-stream architecture with independent 2D and 3D branches that do not share features (see Fig. 2).

We use SparseConvNet [8] for 3D and a modified version of U-Net [22] with ResNet34 [10] for 2D. Even though each stream has a specific network architecture, it is important that the outputs are of same size to allow Cross-Modal learning. Implementation details are provided in Sec. 4.2.

Dual Segmentation Head.

We call segmentation head the last linear layer in the network that transforms the output features into logits followed by a softmax function to produce the class probabilities. For xMUDA, we establish a link between 2D and 3D with a ‘mimicry’ loss between the output probabilities, i.e., each modality should predict the other modality’s output. This allows us to explicitly control the Cross-Modal learning.

In a naive approach, each modality has a single segmentation head and a Cross-Modal optimization objective aligns the outputs of both modalities. Unfortunately, this leads to only using information that is shared between the two modalities, while discarding private information that is exclusive to each sensor (more details in the ablation study in Sec. 5.1). This is an important limitation, as we want to leverage both private and shared information, in order to obtain the best possible performance.

To preserve private information while benefiting from shared knowledge, we introduce an additional segmentation head to uncouple the mimicry objective from the main segmentation objective. This means that the 2D and 3D streams both have two segmentation heads: one main head for the best possible prediction, and one mimicry head to estimate the other modality’s output.

The outputs of the 4 segmentation heads (see Fig. 2) are of size , where

is equal to the number of classes such that we obtain a vector of class probabilities for each 3D point. The two main heads produce the best possible predictions,

and respectively for each branch. The two mimicry heads estimate the other modality’s output: 2D estimates 3D () and 3D estimates 2D ().

Figure 3: Proposed UDA setup. xMUDA learns from supervision on the source domain (plain lines) and self-supervision on the target domain (dashed lines), while benefiting from the Cross-Modal predictions of 2D/3D modalities.

3.2 Learning Scheme

The goal of our Cross-Modal learning scheme is to exchange information between the modalities in a controlled manner to teach them to be aware of each other. This auxiliary objective can effectively improve the performance of each modality and does not require any annotations which enables its use for UDA on target dataset . In the following we define the basic supervised learning setup, our Cross-Modal loss , and the additional pseudo-label learning method. The loss flows are depicted in Fig. 3.

Supervised Learning.

The main goal of 3D segmentation is learned through cross-entropy in a classical supervised fashion on the source data. We can write the segmentation loss for each network stream (2D and 3D) as:


where is either or .

Cross-Modal Learning.

The objective of unsupervised learning across modalities is twofold. Firstly, we want to transfer knowledge from one modality to the other on the target dataset. For example, let one modality be sensitive and the other more robust to the domain shift, then the robust modality should teach the sensitive modality the correct class in the target domain where no labels are available. Secondly, we want to design an auxiliary objective on source and target, where the task is to estimate the other modality’s prediction. By mimicking not only the class with maximum probability, but the whole distribution, more information is exchanged, leading to softer labels.

We choose KL divergence for the Cross-Modal loss and define it as follows:


with where is the target distribution from the main prediction which is to be estimated by the mimicking prediction . This loss is applied on the source and the target domain as it does not require ground truth labels and is the key to our proposed domain adaptation framework. For source, can be seen as an auxiliary mimicry loss in addition to the main segmentation loss .

The complete optimization objective for each network stream (2D and 3D) is the combination of the segmentation loss on source and the Cross-Modal loss on source and target:



are hyperparameters to weight

on source and target respectively and are the network weights of either the 2D or the 3D stream.

There are parallels between the Cross-Modal learning and model distillation which also adopts KL divergence as mimicry loss, but with the goal to transfer knowledge from a large network to a smaller one in a supervised setting [11]. Recently Zhang et al. introduced Deep Mutual Learning [33] where an ensemble of uni-modal networks are jointly trained to learn from each other in collaboration. Though to some extent, our Cross-Modal learning is of similar nature to those strategies, we tackle a different distillation angle, i.e. across modalities (2D/3D) and not in the supervised, but in the UDA setting.

Self-training with Pseudo-Labels.

Cross-Modal learning is complementary to pseudo-labelling strategy [15] used originally in semi-supervised learning and recently in UDA  [17, 34]. In details, once having optimized a model with Eq. 4, we extract pseudo-labels offline, selecting highly confident labels based on the predicted class probability. Then, training is done from scratch using the produced pseudo-labels for an additional segmentation loss on the target training set. Effectively, the optimization problem writes:


where is weighting the pseudo-label segmentation loss and are the pseudo-labels.

4 Experiments

4.1 Datasets

To evaluate xMUDA, we identified three real-to-real adaptation scenarios. In the day-to-night case, LiDAR has a small domain gap, because it is an active sensing technology, sending out laser beams which are mostly invariant to lighting conditions. In contrast, camera has a large domain gap as its passive sensing suffers from lack of light sources, leading to drastic changes in object appearance. The second scenario is country-to-country adaptation, where the domain gap can be larger for LiDAR or camera: for some classes the 3D shape might change more than the visual appearance or vice versa. The third scenario, dataset-to-dataset, comprises changes in the sensor setup, such as camera optics, but most importantly a higher LiDAR resolution on target. 3D networks are sensitive to varying point cloud density and the image could help to guide and stabilize adaptation.

We leverage recently published autonomous driving datasets nuScenes [2], A2D2 [7] and SemanticKitti [1] in which LiDAR and camera are synchronized and calibrated allowing to compute the projection between a 3D point and its corresponding 2D image pixel. The chosen datasets contain 3D annotations. For simplicity and consistency across datasets, we only use the front camera image and the LiDAR points that project into it.

For nuScenes, the annotations are 3D bounding boxes and we obtain the point-wise labels for 3D semantic segmentation by assigning the corresponding object label if a point lies inside a 3D box; otherwise the point is labeled as background. We use the meta data to generate the splits for two UDA scenarios: Day/Night and USA/Singapore.

A2D2 and SemanticKitti provide segmentation labels. For UDA, we define 10 shared classes between the two datasets. The LiDAR setup is the main difference: in A2D2, there are three LiDARs with 16 layers which generate a rather sparse point cloud and in SemanticKitti, there is one high-resolution LiDAR with 64 layers.

We provide more details about the data splits in Sec. A.

4.2 Implementation Details

Day/Night USA/Singapore A2D2/SemanticKitti
Method 2D 3D softmax avg 2D 3D softmax avg 2D 3D softmax avg
Baseline (source only) 42.2 41.2 47.8 53.4 46.5 61.3 36.0 36.6 41.8
UDA Baseline (PL) [17] 43.7 45.1 48.6 55.5 51.8 61.5 37.4 44.8 47.7
xMUDA w/o PL 46.2 44.2 50.0 59.3 52.0 62.7 36.8 43.3 42.9
xMUDA 47.1 46.7 50.8 61.1 54.1 63.2 43.7 48.5 49.1
Oracle 48.6 47.1 55.2 66.4 63.8 71.6 58.3 71.0 73.7
Table 1: mIoU on the respective target sets for 3D semantic segmentation in different Cross-Modal UDA scenarios. We report the result for each network stream (2D and 3D) as well as the ensembling result (‘softmax avg’).

2D Network.

We use a modified version of U-Net [22] with a ResNet34 [10]

encoder where we add dropout after the 3rd and 4th layer and initialize with ImageNet pretrained weights provided by PyTorch. In the decoder, each layer consists of a transposed convolution, concatenation with encoder features of same resolution (skip connection) and another convolution to mix the features. The network takes an image

as input and produces an output feature map with equal spatial dimensions , where is the number of feature channels. In order to lift the 2D features to 3D, we sample them at sparse pixel locations where the 3D points project into the feature map, and obtain the final two-dimensional feature matrix .

3D Network.

For SparseConvNet [8] we leverage the official PyTorch implementation and a U-Net architecture with 6 times downsampling. We use a voxel size of 5cm which is small enough to only have one 3D point per voxel.


For data augmentation we employ horizontal flipping and color jitter in 2D, and x-axis flipping, scaling and rotation in 3D. Due to the wide angle image in SemanticKitti, we crop a fixed size rectangle randomly on the horizontal image axis to reduce memory during training. Log-smoothed class weights are used in all experiments to address class imbalance. For the KL divergence for the Cross-Modal loss in PyTorch, we detach

the target variable to only backpropagate in either the 2D or the 3D network. We use a batch size of 8, the Adam optimizer with

, and an iteration based learning schedule where the learning rate of is divided by 10 at 80k and 90k iterations; the training finishes at 100k. We jointly train the 2D and 3D stream and at each iteration, accumulate gradients computed on source and target batch. All trainings fit into a single GPU with 11GB RAM.

There are two training stages. First, we train with Eq. 4, where we apply the segmentation loss using ground truth labels on source and Cross-Modal loss on source and target. Once trained, we generate pseudo-labels as in [17] from the last model. Note, that we do not select the best weights on the validation set, but rather use the last checkpoint to generate the pseudo-labels in order to prevent any supervised learning signal. In the second training with objective of Eq. 5, an additional segmentation loss on target using the pseudo-labels is added; the networks weights are reinitialized from scratch. The 2D and 3D network are trained jointly and optimized on source and target at each iteration.

4.3 Main Experiments

We evaluate xMUDA on the three proposed Cross-Modal UDA scenarios and compare against a state-of-the-art uni-modal UDA method [17].

We report mean Intersection over Union (mIoU) results for 3D segmentation in Tab. 1 on the target test set for the 3 UDA scenarios. We evaluate on the test set using the checkpoint that achieved the best score on the validation set. In addition to the scores of the 2D and 3D model, we show the ensembling result (‘softmax avg’) which is obtained by taking the mean of the predicted 2D and 3D probabilities after softmax. The baseline is trained on source only and the oracle on target only, except the Day/Night oracle, where we used batches of 50%/50% Day/Night to prevent overfitting. PL is applied separately on each modality (2D pseudo-labels to train 2D, 3D pseudo-labels to train 3D).

In all 3 UDA scenarios xMUDA outperforms both normal and UDA baselines significantly which proves the benefit of exchanging information between modalities. We observe that Cross-Modal learning and self-training with Pseudo Labels (PL) are complementary as the best score is always achieved combining both (‘xMUDA’). This is expected as they represent different concepts: uni-modal self-training with PL reinforces confident predictions while Cross-Modal learning enables knowledge sharing between modalities and can be seen as auxiliary task.

We observe that Cross-Modal learning consistently improves both modalities. Thus, even the strong modality can learn from the weaker one, thanks to decoupling of main and mimicking prediction.

Qualitative results are presented in Fig. 6 and show the versatility of xMUDA across all proposed UDA scenarios. Fig. 7 depicts the individual 2D/3D outputs to illustrate their respective strengths and weaknesses, e.g. at night 3D works much better than 2D. We also provide a video of the A2D2 to Semantic Kitti scenario at http://tiny.cc/xmuda.

4.4 Extension to Fusion

(a) Vanilla Fusion
(b) xMUDA Fusion
Figure 4: Architectures for fusion. LABEL:sub@fig:architectureFusionVanilla

In Vanilla Fusion the 2D and 3D features are concatenated, fed into a linear layer with ReLU to mix the features and followed by another linear layer and softmax to obtain a fused prediction

. LABEL:sub@fig:architectureFusionAS In xMUDA Fusion, we add two uni-modal outputs and that are used to mimic the fusion output .

In Sec. 4.3 we show how each modality can be improved with xMUDA and consequently, the softmax average also increases. However, how can we obtain the best possible results by 2D and 3D feature fusion?

A common fusion architecture is late fusion where the features from different sources are concatenated (see Fig. (a)a). However, for xMUDA we need modality independence in the features as the mimicking task becomes trivial otherwise. Therefore, we propose xMUDA Fusion (see Fig. (b)b) where each modality has a uni-modal prediction output which is used to mimic the fusion prediction.

In Tab. 2 we show results for different fusion approaches. ‘xMUDA Fusion w/o PL’ outperforms Vanilla Fusion thanks to Cross-Modal learning. We can improve over ‘Vanilla Fusion + PL’ with ‘Distilled Vanilla Fusion’ where we use the xMUDA model of the main experiments reported in Tab. 1 to generate pseudo-labels from the softmax average and train the Vanilla Fusion network. The best performance can be achieved with xMUDA, combining Cross-Modal learning and PL, analogously to the main experiments.

Vanilla Fusion (no UDA) 59.9
xMUDA Fusion w/o PL 61.9
Vanilla Fusion + PL 65.2
Distilled Vanilla Fusion 65.8
xMUDA Fusion 66.6
Oracle 72.2
Table 2: mIoU for fusion, USA/Singapore scenario.

5 Ablation Studies

(a) Single head architecture
(b) Mean probabilities (vehicle)
Figure 5: Single vs. Dual segmentation head. In LABEL:sub@fig:architectureVanilla, main and mimicry prediction are not uncoupled as in xMUDA of Fig. 2. In LABEL:sub@fig:probs, we compare mean predicted probabilities on points where the true label is vehicle. In the naive single head approach, 2D/3D probabilities are simply aligned, slightly decreasing the performance of the stronger 3D prediction, while the disentangled architecture of xMUDA with 2 segmentation heads uncouples the 2D improvement from the 3D result. Day/Night scenario.

5.1 Segmentation Heads

In the following we justify our design choice of two segmentation heads per modality stream as opposed to a single one in a naive approach (see Fig. (a)a).

In the single head architecture the mimicking objective is directly applied between the 2 main predictions which leads to an increase of probability in the weaker and a decrease in the stronger modality as can be seen for the vehicle class in Fig. (b)b. There is shared information between 2D/3D, but also private information in each modality. An unwanted solution to reduce the Cross-Modal loss is that the networks discard private information, so that they both only use shared information making it easier to align their outputs. However, we can obviously achieve the best performance if the private information is also used. By separating the main from the mimicking prediction with dual segmentation heads, we can effectively decouple the two optimization objectives: The main head outputs the best possible prediction to optimize the segmentation loss, while the mimicking head can align with the other modality.

The experiments only include the first training step without PL as we want to benchmark the pure Cross-Modal learning. From results in Tab. 3, xMUDA has much better performance than the single head architecture and is also much more robust when it comes to choosing a good hyperparameter, specifically the weight of the Cross-Modal loss . As the loss weight becomes too large, single head performance decreases as the network resorts to the trivial solution of predicting the most frequent class to align 2D/3D outputs, while xMUDA performance is robust. Note that we fix optimally for each architecture, 0.1 for single head and 1.0 for xMUDA.

Single head w/o PL xMUDA w/o PL
2D 3D softmax avg 2D 3D softmax avg
0.001 52.8 47.9 60.5 51.4 47.9 59.1
0.01 52.4 48.8 58.4 52.4 49.1 60.4
0.1 43.9 40.8 46.5 59.3 52.0 62.7
1.0 24.7 23.1 21.5 54.6 49.1 57.0
Table 3: mIoU without PL of Single head and xMUDA (Dual head), while varying the weight of the Cross-Modal loss on target . USA/Singapore scenario.
Figure 6: Qualitative results on the three proposed splits. We show the ensembling result obtained from averaging the softmax output of 2D and 3D on the UDA Baseline (PL) and xMUDA.
A2D2/SemanticKitti: xMUDA helps to stabilize and refine segmentation performance when there are sensor changes (3x16 layer LiDAR with different angles to 64 layer LiDAR).
USA/Singapore: Delivery motorcycles with a storage box on the back are common in Singapore, but not in USA. The 3D shape might resemble a vehicle. However, 2D appearance information is leveraged in xMUDA to improve the recognition.
Day/Night: The visual appearance of a car at night with headlights turned on is very different than during day. The uni-modal UDA baseline is not able to learn this new appearance. However, if information between camera and robust-at-night LiDAR is exchanged in xMUDA, it is possible to detect the car correctly at night.

5.2 Cross-Modal Learning on Source

In Eq. 4, Cross-Modal loss is applied on source and target, although we already have supervised segmentation loss on source. We observe an improvement of 4.8 mIoU on 2D and 4.4 on 3D when adding on source as opposed to applying it on target only. This shows that it is important to train the mimicking head on source, stabilizing the predictions, which can be exploited during adaptation on target.

5.3 Cross-Modal Learning for Oracle Training

We have shown that Cross-Modal learning is very effective for UDA. However, it can also be used in a purely supervised setting. When training the oracle with Cross-Modal loss , we can improve over the baseline, see Tab. 4. We conjecture that is a beneficial auxiliary loss and can help to regularize training and prevent overfitting.

Method 2D 3D softmax avg Method fusion
w/o 65.8 63.2 71.1 Vanilla Fusion 71.0
with 66.4 63.8 71.6 Fusion + 72.2
Table 4: Cross-Modal loss in supervised setting for oracle training. mIoU on USA/Singapore.

6 Conclusion

We propose xMUDA, Cross-Modal Unsupervised Domain Adaptation, where modalities learn from each other to improve performance on the target domain. For Cross-Modal learning we introduce mutual mimicking between the modalities, achieved through KL divergence. We design an architecture with separate main and mimicking head to disentangle the segmentation from the Cross-Modal learning objective. Experiments on 3D semantic segmentation on new UDA scenarios using 2D/3D datasets, show that xMUDA largely outperforms uni-modal UDA and is complementary to the pseudo-label strategy. An analog performance boost is observed on fusion.

We think that Cross-Modal learning could be useful in a wide variety of settings and tasks, not limited to UDA. Particularly, it should be beneficial for supervised learning and other modalities than image and point cloud.

Appendix A Dataset Splits

a.1 nuScenes

The nuScenes dataset [2] consists of 1000 driving scenes, each of 20 seconds, which corresponds to 40k annotated keyframes taken at 2Hz. The scenes are split into train (28,130 keyframes), validation (6,019 keyframes) and hidden test set. The point-wise 3D semantic labels are obtained from 3D boxes like in [31]. We propose the following splits destined for domain adaptation with the respective source/target domains: Day/Night and Boston/Singapore. Therefore, we use the official validation split as test set and divide the training set into train/val for the target set (see Tab. 5 for the number of frames in each split). As the number of object instances in the target split can be very small (e.g. for night), we merge the objects into 5 categories: vehicle (car, truck, bus, trailer, construction vehicle), pedestrian, bike (motorcycle, bicycle), traffic boundary (traffic cone, barrier) and background.

source target
Split train test train val test
Day - Night 24,745 5,417 2,779 606 602
Boston - Singapore 15,695 3,090 9,665 2,770 2,929
A2D2 - SemanticKitti 27,695 942 18,029 1,101 4,071
Table 5: Number of frames for the 3 splits.

a.2 A2D2 and SemanticKitti

The A2D2 dataset [7] features 20 drives, which corresponds to 28,637 frames. The point cloud comes from three 16-layer front LiDARs (left, center, right) where the left and right front LiDARS are inclined. The semantic labeling was carried out in the 2D image for 38 classes and we compute the 3D labels by projection of the point cloud into the labeled image. We keep scene 20180807_145028 as test set and use the rest for training.

The SemanticKitti dataset [1] provides 3D point cloud labels for the Odometry dataset of Kitti [6] which features large angle front camera and a 64-layer LiDAR. The annotation of the 28 classes has been carried out directly in 3D. We use the scenes as train set, as validation and as test set.

We select 10 shared classes between the 2 datasets by merging or ignoring them (see Tab. 6). The 10 final classes are car, truck, bike, person, road, parking, sidewalk, building, nature, other-objects.


  • [1] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019) SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV, Cited by: §A.2, §1, §4.1.
  • [2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §A.1, §4.1.
  • [3] H. Chiang, Y. Lin, Y. Liu, and W. H. Hsu (2019) A unified point-based framework for 3d segmentation. In 3DV, Cited by: §2.
  • [4] C. Choy, J. Gwak, and S. Savarese (2019)

    4D spatio temporal convnet: minkowski convolutional neural networks

    In CVPR, Cited by: §2.
  • [5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §1, §2.
  • [6] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, Cited by: §A.2.
  • [7] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V. H. Pham, M. Mühlegg, S. Dorn, T. Fernandez, M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, and P. Schuberth (2019) A2D2: AEV autonomous driving dataset. Audi Electronics Venture GmbH. Note: http://www.a2d2.audi Cited by: §A.2, §1, §4.1.
  • [8] B. Graham, M. Engelcke, and L. van der Maaten (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, Cited by: Figure 2, §2, §3.1, §4.2.
  • [9] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In ACCV, Cited by: §1, §2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1, §4.2.
  • [11] G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the knowledge in a neural network. In NIPS Workshops, Cited by: §3.2.
  • [12] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §1, §2.
  • [13] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. Cited by: §2.
  • [14] M. Jaritz, J. Gu, and H. Su (2019) Multi-view pointnet for 3d scene understanding. In ICCV Workshops, Cited by: §2.
  • [15] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshops, Cited by: §2, §3.2.
  • [16] K. Lee, G. Ros, J. Li, and A. Gaidon (2019) Spigan: privileged adversarial learning from simulation. In ICLR, Cited by: §2.
  • [17] Y. Li, L. Yuan, and N. Vasconcelos (2019) Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, Cited by: Figure 1, §1, §1, §2, §3.2, §4.2, §4.3, Table 1.
  • [18] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In ECCV, Cited by: §1, §1, §2, §3.1.
  • [19] P. Morerio, J. Cavazza, and V. Murino (2018) Minimal-entropy correlation alignment for unsupervised deep domain adaptation. In ICLR, Cited by: §2.
  • [20] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. Cited by: §1.
  • [21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §2.
  • [22] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: Figure 2, §3.1, §4.2.
  • [23] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz (2018) Splatnet: sparse lattice networks for point cloud processing. In CVPR, Cited by: §2.
  • [24] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) KPConv: flexible and deformable convolution for point clouds. In ICCV, Cited by: §2.
  • [25] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §2.
  • [26] A. Valada, R. Mohan, and W. Burgard (2019) Self-supervised model adaptation for multimodal semantic segmentation. IJCV. Cited by: §1, §2.
  • [27] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, Cited by: §1, §2.
  • [28] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) DADA: depth-aware domain adaptation in semantic segmentation. In ICCV, Cited by: §2.
  • [29] S. Wang, S. Suo, W. Ma, A. Pokrovsky, and R. Urtasun (2018) Deep parametric continuous convolutional neural networks. In CVPR, Cited by: §2.
  • [30] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM TOG. Cited by: §2.
  • [31] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d LiDAR point cloud. In ICRA, Cited by: §A.1.
  • [32] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, Cited by: §1, §2.
  • [33] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In CVPR, Cited by: §3.2.
  • [34] Y. Zou, Z. Yu, X. Liu, B.V.K. V. Kumar, and J. Wang (2019) Confidence regularized self-training. In ICCV, Cited by: §1, §2, §3.2.