Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

by   Maximilian Jaritz, et al.

Domain adaptation is an important task to enable learning when labels are scarce. While most works focus only on the image modality, there are many important multi-modal datasets. In order to leverage multi-modality for domain adaptation, we propose cross-modal learning, where we enforce consistency between the predictions of two modalities via mutual mimicking. We constrain our network to make correct predictions on labeled data and consistent predictions across modalities on unlabeled target-domain data. Experiments in unsupervised and semi-supervised domain adaptation settings prove the effectiveness of this novel domain adaptation strategy. Specifically, we evaluate on the task of 3D semantic segmentation using the image and point cloud modality. We leverage recent autonomous driving datasets to produce a wide variety of domain adaptation scenarios including changes in scene layout, lighting, sensor setup and weather, as well as the synthetic-to-real setup. Our method significantly improves over previous uni-modal adaptation baselines on all adaption scenarios. Code will be made available.



There are no comments yet.


page 1

page 7

page 9

page 14

page 15


xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

Unsupervised Domain Adaptation (UDA) is crucial to tackle the lack of an...

Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation

Domain adaptation is critical for success when confronting with the lack...

Learning Cross-modal Contrastive Features for Video Domain Adaptation

Learning transferable and domain adaptive feature representations from v...

DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation

Semantic segmentation of nighttime images plays an equally important rol...

TICaM: A Time-of-flight In-car Cabin Monitoring Dataset

We present TICaM, a Time-of-flight In-car Cabin Monitoring dataset for v...

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021: Team M3EM Technical Report

In this report, we describe the technical details of our submission to t...

Conditional Domain Adaptation GANs for Biomedical Image Segmentation

Due to visual differences in biomedical image datasets acquired using di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene understanding plays a central role in a number of modern applications and, among other tasks, semantic segmentation from images has been extensively studied. However, for applications that involve interaction with the 3D world, e.g., robots, self-driving cars or virtual reality, scenes should be understood in 3D. In this context, 3D semantic segmentation is gaining attention and an increasing number of datasets provide jointly annotated 3D point clouds and 2D images. The modalities are complementary since point clouds provide geometry while images capture texture and color.

Human labeling of segmentation is tedious in images [8], but even more so in 3D point clouds, because the annotator has to inspect the scene from different viewpoints [4]

. This results in a high annotation cost. Unfortunately, the question whether sufficient ground truth can be obtained to train a large neural network can make or break a computer vision system.

Our goal in this work is to alleviate this problem with transfer learning, in particular domain adaptation (DA), where a model is trained by leveraging data from a different source domain to improve performance on the desired target domain, while benefiting from multi-modal 2D/3D data.

Fig. 1: Overview of the proposed cross-modal learning for domain adaptation. Here, a 2D and a 3D network take an image and a point cloud as input respectively and predict 3D segmentation labels. Note, that the 2D predictions are uplifted to 3D. The proposed cross-modal learning enforces consistency between the 2D and 3D predictions via mutual mimicking, which proves beneficial in both unsupervised and semi-supervised domain adaptation.

We consider both unsupervised and semi-supervised DA, that is when labels are available in the source domain, but not (or only partially) in the target domain. Most of DA literature investigated the image modality [19, 20, 42, 40, 24], but only a few address the point-cloud modality [47]. Different from these, we perform DA on images and point clouds simultaneously with the aim to explicitly exploit multi-modality for the DA goal.

We use self-driving data from cameras and LiDAR point-cloud sensors, and want to profit from the fact that the domain gaps differ across these sensors. For example, a LiDAR is more robust to lighting changes (e.g., day/night) than a camera. On the other hand, LiDAR sensing density varies with the sensor setup while cameras always output dense images. Our work takes advantage of the cross-modal discrepancies while preserving the best performance of each sensor – thus avoiding that the limitations of one modality negatively affect the other modality’s performance.

We propose a cross-modal loss which enforces consistency between multi-modal predictions, as depicted in Fig. 1. Our specifically designed dual-head architecture enables robust training by decoupling the supervised main segmentation loss from the unsupervised cross-modal loss.

We demonstrate that our cross-modal framework proposal can be either applied in the unsupervised setting (coined xMUDA), or semi-supervised setting (coined xMoSSDA).

This paper is an extension of our work [21] which covered only UDA evaluated on three scenarios. Besides the significant expansion of the experimental evaluation (Sec. 4) including the addition of two new DA scenarios (see Fig. 9), we add a completely new use case of semi-supervised DA (SSDA) in Secs. 3.3 and 4.4. The original code-base of [21] will be extended with new experiments and the SSDA set up.

In summary our contributions are:

  • We introduce new domain adaptation scenarios (4 unsupervised and 2 semi-supervised), for the task of 3D semantic segmentation, leveraging recent 2D-3D driving datasets with cameras and LiDARs.

  • We propose a new DA approach with an unsupervised cross-modal loss which enforces multi-modal consistency and is complementary to other existing unsupervised techniques [22].

  • We design a robust dual-head architecture which uncouples the cross-modal loss from the main segmentation objective.

  • We evaluate xMUDA and xMoSSDA, our unsupervised and semi-supervised DA scenarios respectively, and demonstrate their superior performance.

Fig. 2: Our architecture for cross-modal unsupervised learning for domain adaptation. There are two independent network streams: a 2D stream which takes an image as input and uses a U-Net-style 2D ConvNet [33], as well as a 3D stream which takes a point cloud as input and uses a U-Net-Style 3D SparseConvNet [15]

. The size of the first dimension of feature output tensors of both streams is

, equal to the number of 3D points. To achieve this equality, we project the 3D points into the image and sample the 2D features at the corresponding pixel locations. The four segmentation outputs consist of the main predictions and the mimicry predictions . We transfer knowledge across modalities using KL divergences

, where the objective of the 2D mimicry prediction is to estimate the main 3D prediction, and, vice versa,


2 Related Work

2.1 Unsupervised Domain Adaptation

The past few years have seen an increasing interest in unsupervised domain adaptation (UDA) for complex perception tasks like object detection and semantic segmentation. Under the hood of such methods lies the same spirit of learning domain-invariant representations, i.e., features coming from different domains should introduce insignificant discrepancies. Some works promote adversarial training to minimize the source-target distribution shift, either on pixel- [19], feature- [20] or output-space [40, 42]

. Revisited from semi-supervised learning 

[22], self-training with pseudo-labels has also been recently proven effective for UDA [24, 51, 35].

Recent works start addressing UDA in the 3D world, i.e., for point clouds. PointDAN [31] proposes to jointly align local and global features used for classification. Achituve et al. [1] improve UDA performance using self-supervised learning. Wu et al. [47] adopt activation correlation alignment [29] for UDA in 3D segmentation from LiDAR point clouds. Yi et al. [49] address the domain discrepancies induced by different LiDAR sensors by recovering the canonical 3D surfaces, on top of which the segmentation downstream task is performed. In this work, we investigate the same task, but differently since our system operates on multi-modal input data, i.e., RGB + LiDAR.

To the best of our knowledge, there are no previous UDA works in 2D/3D semantic segmentation for multi-modal scenarios. Only some consider the extra modality, e.g., depth, solely available at training time on the source domain and leverage such privileged information to boost adaptation performance [23, 43]. Otherwise, we here assume all modalities are available at train and test time on both source and target domains.

2.2 Semi-supervised Domain Adaption

While UDA has become an active research topic, semi-supervised domain adaptation (SSDA) has so far been little-studied despite being highly relevant in practical applications. In SSDA, we would like to transfer knowledge from a source domain with labeled data to a target domain with partially labeled data.

Early approaches based on SVM [9] have addressed SSDA in image classification and object detection [11, 48, 3]; few has been done for deep networks. Recently, Saito et al. [34] propose an adversarial SSDA learning scheme to optimize a few-shot deep classification model with minimax entropy. Wang et al. [45] extend UDA techniques in 2D semantic segmentation to the SSDA setting by additionally aligning feature prototypes of labeled source and target samples. Our work is the first to address SSDA in point cloud segmentation.

2.3 Cross-modality learning

In our context, we define cross-modality learning as knowledge transfer between modalities. This is different from multi-modal fusion where a single model is trained supervisedly to combine complementary inputs, such as RGB-D [16, 41] or LiDAR and camera [25, 26, 28].

Castrejón et al. [6] address the task of cross-modal scene retrieval. Their goal is to learn a joint high-level feature representation which is agnostic to the input modality (real image, clip art, text, etc.). This is achieved by fine-tuning the shared weights of the final network’s layers and enforcing similar statistics across modalities. Gupta et al. adapt the more direct feature alignment technique of distillation [18] in a cross-modal setup. They apply an L2 loss between multi-modal features to transfer knowledge from a supervisedly trained RGB network to unlabeled depth or optical flow.

Self-supervised learning is the task of learning useful representations in the absence of labels. This can be achieved by forcing networks with different input modalities to predict a similar output. Sayed et al. [36] minimize the cosine distance between RGB and optical flow features. Instead, Alwassel et al. [2] use clustering to generate pseudo labels and use them to mutually train an audio and video network.

Different from those related works, we address the task of domain adaptation, specifically in point cloud segmentation, using the modalities of RGB and LiDAR.

2.4 Point cloud segmentation

While images are dense tensors, 3D point clouds can be represented in multiple ways, which leads to competing network families evolving in parallel.

Voxels are similar to pixels, but very memory intense in their dense representation as most of them are usually empty. Some 3D CNNs [32, 38] rely on OctTree [27] to reduce memory usage but without addressing the problem of manifold dilation. Graham et al. [15] and similar implementation [7] address the latter by using hash tables to convolve only on active voxels. This allows for very high resolution with typically only one point per voxel.

Point-based networks perform computation in continuous 3D space and can thus directly accept point clouds as input. PointNet++ [30]

uses point-wise convolution, max-pooling to compute global features and local neighborhood aggregation for hierarchical learning akin to CNNs. Many improvements have been proposed in this direction, such as continuous convolutions 

[44] and deformable kernels [39].

In this work, we select SparseConvNet [15] as our 3D network. It is top performing on the ScanNet benchmark [10].

3 Cross-modal Learning for Domain Adaptation

Our aim is to exploit multi-modality as a source of knowledge for unsupervised learning in domain adaptation. Therefore, we propose a cross-modal learning objective, implemented as a mutual mimicking game between modalities, that drives toward consistency across predictions from different modalities.

Specifically, we investigate the modalities of 2D images and 3D point clouds for the task of 3D semantic segmentation as it is a core task for machine vision.

We present the network architecture in Sec. 3.1, our framework for the challenging cross-modal unsupervised domain adaptation, coined ‘xMUDA’, in Sec. 3.2 and its semi-supervised version, analogously called ‘xMoSSDA’, in Sec. 3.3.

3.1 Architecture

Our architecture predicts point-wise segmentation labels. It consists of two independent streams which respectively take a 2D image and a 3D point cloud as inputs, and output features of size and respectively, where is the number of 3D points within the camera field of view. An overview is depicted in Fig. 2.

As network backbones, we use SparseConvNet [15] for 3D and a modified version of U-Net [33] for 2D. Further implementation details are provided in Sec. 4.2.

3.1.1 Dual Segmentation Head

(a) Single head
(b) Dual head
Fig. 5: Single-head vs. dual-head architecture. LABEL:sub@fig:singleHead Naive way of enforcing consistency directly between main segmentation heads. LABEL:sub@fig:dualHead Our proposal of a dual-head architecture to uncouple the mimicry from the main segmentation head for more robustness.

We call segmentation head (depicted as ‘classify’ arrows in Fig. 


) the last linear layer in the network that transforms the output features into logits followed by a softmax function to produce class probabilities.

For cross-modal learning, we establish a mimicking game between the 2D and 3D output probabilities, i.e., each modality should predict the other modality’s output. The overall objective drives the two modalities toward an agreement, thus enforcing consistency between outputs.

In a naive approach, each modality has a single segmentation head (as depicted in Fig. (a)a) and the cross-modal optimization objective aligns the outputs of both modalities. Unfortunately, this setup is not robust as the mimicking objective is in direct competition with the main segmentation objective. This is why, in practice, one needs to down-weight the mimicry loss relative to the segmentation loss to observe a performance gain. However, this is a serious limitation, because down-weighting the mimicry loss also decreases its adaptation effect.

In order to address this problem, we propose to disentangle the mimicry from the main segmentation objective. Therefore, we propose a dual-head architecture as depicted in Figs. 2 and  (b)b. In this setup, the 2D and 3D streams both have two segmentation heads: one main head for the best possible prediction, and one mimicry head to estimate the other modality’s output.

The outputs of the four segmentation heads (see Fig.  2) are of size , where

is the number of classes such that we obtain a vector of class probabilities for each 3D point. The two main heads produce the best possible segmentation predictions,

and respectively for each branch. The two mimicry heads estimate the other modality’s output: 2D estimates 3D () and 3D estimates 2D ().

In the following, we introduce how we use the described architecture for cross-modal learning in unsupervised (Sec. 3.2) and semi-supervised (Sec. 3.3) domain adaptation, respectively.

3.2 Unsupervised Domain Adaptation (xMUDA)

(a) Proposed UDA training setup
(b) UDA on multi-modal data
Fig. 8: Details of the proposed cross-modal training with adaptation. LABEL:sub@fig:udaTrainingSetup xMUDA learns from supervision on the source domain (plain lines) and self-supervision on the target domain (dashed lines) thanks to cross-modal learning between 2D/3D. LABEL:sub@fig:dataLossesOverview We consider four data subsets: Source 2D, Target 2D, Source 3D and Target 3D. In contrast to existing techniques, xMUDA introduces a cross-modal self-training mechanism for UDA.

We propose xMUDA, cross-modal unsupervised domain adaptation, which considers a source-domain dataset , where each sample consists of a 2D image , a 3D point cloud and 3D segmentation labels with classes, as well as a target-domain dataset , lacking annotations, where each sample only consists of an image and a point cloud .

In the following, we define the usual supervised learning setup, our cross-modal loss , and an additional variant ‘xMUDAPL’ that further uses pseudo-labels to boost performance. An overview of the learning setup is depicted in Fig. (a)a. The difference between our proposed cross-modal learning and existing uni-modal UDA techniques, such as as Pseudo-labels [22], MinEnt [42] or Deep logCORAL [29] is visualized in Fig. (b)b.

3.2.1 Supervised Learning

The main goal of 3D segmentation is learned through cross-entropy in a classical supervised fashion on the source-domain data. Denoting the soft-classification map associated by the segmentation model to the 3D points of interest, for a given input , the segmentation loss of each network stream (2D and 3D) for a given training sample in reads:


where is either or and equals . We denote tensor entries’ indices as superscript.

3.2.2 Cross-Modal Learning

The objective of unsupervised learning across modalities is twofold. Firstly, we want to transfer knowledge from one modality to the other on the target-domain dataset. For example, let one modality be sensitive and the other more robust to the domain shift, then the robust modality should teach the sensitive modality the correct class in the target domain where no labels are available. Secondly, we want to design an auxiliary objective on source and target domains, where the task is to estimate the other modality’s prediction. By mimicking not only the class with maximum probability, but the whole distribution like in teacher-student distillation [18], more information is exchanged, leading to softer labels.

We choose the KL divergence for the cross-modal loss and define it as follows:


with where is the target distribution from the main prediction which is to be estimated by the mimicking prediction . This loss is applied on the source and the target domain as it does not require ground-truth labels and is the key to our proposed domain adaptation framework. In the source domain, can be seen as an auxiliary mimicry loss in addition to the main segmentation loss .

The complete objective for each network stream (2D and 3D) is the combination of the segmentation loss on source-domain data and the cross-modal loss on both domains:



are hyperparameters to weight

on source and target domain respectively and are the network weights of either the 2D or the 3D stream.

There are parallels between our approach and Deep Mutual Learning [50] in training two networks in collaboration and using the KL divergence as mimicry loss. However, unlike the aforementioned work, our cross-modal learning establishes consistency across modalities (2D/3D) without supervision.

3.2.3 Self-training with Pseudo-Labels

Cross-modal learning is complementary to pseudo-labeling [22] used originally in semi-supervised learning and recently in UDA  [24, 51]. To benefit from both, once having optimized a model with Eq.  4, we extract pseudo-labels offline, selecting highly-confident labels based on the predicted class probability. Then, we train again from scratch using the produced pseudo-labels for an additional segmentation loss on the target-domain training set. The optimization problem writes:


where weights the pseudo-label segmentation loss and are the pseudo-labels. For clarity, we will refer to the xMUDA variant that uses additional self-training with pseudo-labels as xMUDAPL.

3.3 Semi-supervised Domain Adaptation (xMoSSDA)

Cross-modal learning can also be used in semi-supervised domain adaptation, thus benefiting from a small portion of labeled data in the target domain.

Formally, we consider in xMoSSDA a labeled source-domain dataset where each sample contains an image , a point cloud and labels . Different from unsupervised learning, the target-domain set consists of a usually small labeled part  where each sample holds an image , a point cloud and labels , as well as an, often larger, unlabeled part  where each sample consists only of an image and a point cloud .

3.3.1 Supervised Learning

Unlike xMUDA, we do not only apply the segmentation loss of Eq. 1 on the source-domain dataset , but also on the labeled target-domain dataset : The segmentation loss in Eq. 1 thus applies both to samples in  and to samples in . Note that, in practice, we train on source and target domains at the same time by concatenating examples from both in a batch.

3.3.2 Cross-Modal Learning

We apply the unsupervised cross-modal loss of Eq. 2 on all datasets, i.e., (labeled) source-domain , labeled target-domain , and unlabeled target-domain dataset . The latter, , is a typically large portion of unlabeled data compared to a usually much smaller labeled portion . Subsequently, it is beneficial to also exploit  with an unsupervised loss, such as cross-modal learning. The complete objective is a combination of supervised segmentation loss where labels are available (i.e., in sets and ), and unsupervised cross-modal loss everywhere (i.e., on and ) which enforces consistency between the 2D and 3D predictions. It writes:


where , and are the weighting hyperparameters for . In practice we choose for simplicity.

3.3.3 Self-training with Pseudo-Labels

As in the unsupervised setting, we extend semi-supervised cross-modal learning to also benefit from pseudo-labels. After having trained a model with Eq. 6, we use the model to generate predictions on the unlabeled target-domain dataset  and extract highly-confident pseudo-labels which are used to train again from scratch with the following objective:


where is weighting the pseudo-label segmentation loss and are the pseudo-labels. We call this variant xMoSSDAPL.

4 Experiments

For evaluation, we identified five domain adaptation (DA) scenarios relevant to autonomous driving, shown in Fig. 9, and evaluated our cross-modal proposals against recent baselines.

In the following, we first describe the datasets (Sec. 4.1), the implementation backbone and training details (Sec. 4.2), and then evaluate xMUDA (Sec. 4.3) and xMoSSDA (Sec. 4.4). Finally, we extend our cross-modal framework to fusion (Sec. 4.5), demonstrating its global benefit.

4.1 Datasets

To compose our domain adaptation scenarios displayed in Fig. 9, we leveraged public datasets nuScenes [5], VirtualKITTI [12], SemanticKITTI [4], A2D2 [14] and Waymo Open Dataset (Waymo OD) [37]. The split details are in Tab. I. Our scenarios cover typical DA challenges like change in scene layout, between right and left-hand-side driving in the nuScenes: USA/Singapore scenario, lighting changes, between day and night in nuScenes: Day/Night, synthetic-to-real data, between simulated depth and RGB to real LiDAR and camera in VirtualKITTI/SemanticKITTI, different sensor setups and characteristics like resolution/FoV in A2D2/SemanticKITTI and weather changes between sunny San Francisco, Phoenix, Mountain View and rainy Kirkland in Waymo OD: SF,PHX,MTV/KRK.

In all datasets, the LiDAR and the camera are synchronized and calibrated, allowing 2D/3D projections. For consistency across datasets, we only use the front camera’s images, even when multiple cameras are available.

Waymo OD and nuScenes do not provide point-wise 3D segmentation labels, but we produce them by leveraging the 3D object bounding-box labels111The nuScenes lidar segmentation dataset was not released at the time of our work.. Points lying inside a box are labeled as that class and points outside of all boxes are labeled as background.

As some classes have too few samples, we merge or ignore them. If class definitions differ between source and target domains (e.g., VirtualKITTI/SemanticKITTI), we define a custom class mapping. Note that VirtualKITTI contains depth maps in the camera’s reference frame. In order to simulate LiDAR scanning, we uniformly sample points in the depth map.

We provide further details about the datasets in App. A. The training data and splits can be reproduced with our code.

nuScenes[5]: USA/Singapore nuScenes[5]: Day/Night Virt.KITTI[12]/ Sem.KITTI[4] A2D2[14]/ Sem.KITTI[4] Waymo OD[37]: SF,PHX,MTV/KRK
valign=cSource valign=c valign=c valign=c valign=c valign=c
Target valign=c valign=c valign=c valign=c valign=c


Fig. 9: Proposed DA scenarios for 3D semantic segmentation. Overview of the five DA scenarios used in our experiments. We generate the nuScenes [5] splits using metadata. The third and fourth DA scenarios use both SemanticKITTI [4] as target-domain dataset and either the synthetic VirtualKITTI [12] or the real A2D2 dataset [14] as source-domain dataset. Note that we show the A2D2/SemanticKITTI scenario with LiDAR overlay to visualize the density difference and resulting domain gap. Last, Waymo OD [37] features a source-domain dataset in the cities of San Francisco (SF), Phoenix (PHX) and Mountain View (MTV) and a target-domain dataset in Kirkland (KRK). We evaluate xMUDA on scenarios 1-4 and xMoSSDA on scenarios 4-5.
Source  Target 
Scenario Train Train Val/Test
UDA nuSc: USA/Singap. 15,695 9,665 2,770/2,929
nuSc: Day/Night 24,745 2,779 606/602
Virt.KITTI/Sem.KITTI 2,126 18,029 1,101/4,071
A2D2/Sem.KITTI 27,695 18,029 1,101/4,071
SSDA A2D2/Sem.KITTI 27,695 5,642 32,738 1,101/4,071
Waymo OD: SF,PHX,MTV/KRK 158,081 11,853 94,624 3,943/3,932
TABLE I: Size of the splits in frames for all proposed DA scenarios. While there is a single target-domain training set in UDA (), there are two in SSDA: a labeled target-domain training set  and a (much larger) unlabeled set .

4.2 Implementation Details

In the following, we briefly introduce our implementation. Please refer to our code for further details.

2D Network. We use a modified version of U-Net [33] with ResNet34 [17] encoder and a decoder with transposed convolutions and skip connections. To lift the 2D features to 3D, we subsample the output feature map of size at the pixel locations where the 3D points project. Hence, the 2D network takes an image as input and outputs features of size .

3D Network. We use the official SparseConvNet [15] implementation and a U-Net architecture with 6 times downsampling. The voxel size is set to 5cm which is small enough to only have one 3D point per voxel. Thus, the 3D network takes a point cloud as input and outputs features of size .


We employ standard 2D/3D data augmentation and log-smoothed class weights to address class-imbalance. In PyTorch, to compute the KL divergence for the cross-modal loss, we


the target variable to only backpropagate in either the 2D or the 3D network. We train with a batch size of 8 and the Adam optimizer with

, and train 30k iterations for the scenario with the small VirtualKITTI dataset and 100k iterations for all other scenarios. At each iteration we compute and accumulate gradients on the source and target batch, jointly training the 2D and 3D stream. To fit the training into a single GPU with 11GB of memory, we resize the images and additionally crop them in VirtualKITTI and SemanticKITTI.

For the pseudo-label variants, xMUDAPL and xMoSSDAPL, we generate the pseudo-labels offline as in [24] with trained models xMUDA and xMoSSDA, respectively. Then, we retrain from scratch, additionally using the pseudo-labels, optimizing Eqs. 5 and 7, respectively. Importantly, we only use the last checkpoint to generate the pseudo-labels – as opposed to using the best weights which would provide a supervised signal.

4.3 xMUDA

nuSc: USA/Singap. nuSc: Day/Night Virt.KITTI/Sem.KITTI A2D2/Sem.KITTI
Method 2D 3D 2D+3D 2D 3D 2D+3D 2D 3D 2D+3D 2D 3D 2D+3D
Baseline (src only) 53.4 46.5 61.3 42.2 41.2 47.8 26.8 42.0 42.2 34.2 35.9 40.4
Deep logCORAL [29] 52.6 47.1 59.1 41.4 42.8 51.8 41.4* 36.8 47.0* 35.1* 41.0 42.2*
MinEnt [42] 53.4 47.0 59.7 44.9 43.5 51.3 39.2 43.3 47.1 37.8 39.6 42.6
PL [24] 55.5 51.8 61.5 43.7 45.1 48.6 21.5 44.3 35.6 34.7 41.7 45.2
xMUDA 59.3 52.0 62.7 46.2 44.2 50.0 42.1 46.7 48.2 38.3 46.0 44.0
xMUDAPL 61.1 54.1 63.2 47.1 46.7 50.8 45.8 51.4 52.0 41.2 49.8 47.5
Oracle 66.4 63.8 71.6 48.6 47.1 55.2 66.3 78.4 80.1 59.3 71.9 73.6
Domain gap (O-B) 12.9 17.3 10.3 6.5 5.9 7.4 39.5 36.4 37.9 25.1 36.0 33.2
  • The 2D network is trained with batch size 6 instead of 8 to fit into GPU memory.

TABLE II: xMUDA experiments on 3D semantic segmentation. We report the mIoU result (with best and 2nd best) on the target set for each network stream (2D and 3D) as well as the ensembling result taking the mean of the 2D and 3D probabilities (‘2D+3D’). We provide the lower bound ‘Baseline (src only)’ which is trained on the source set , but not on the target set , as well as the upper bound ‘Oracle’ which is trained supervisedly on the target set  using labels. We further indicate the ‘Domain gap’ which is the difference between the Oracle and Baseline score. ‘Deep logCORAL’ [29], ‘MinEnt’ [42] and ‘PL’ [24] are uni-modal UDA baselines. The two variants ‘xMUDA’ and ‘xMUDAPL’ are our methods. We evaluate on four different cross-modal UDA scenarios (see Fig. 9). For the nuScenes dataset [5] (‘nuSc’), we generate the splits with different locations (USA/Singapore) and different time (Day/Night). The domain gap between the two real datasets A2D2 [14] and SemanticKITTI [4] (‘Sem.KITTI’) lies mainly in sensor resolution. Adaptation from VirtualKITTI [12] (‘Virt.KITTI’) to SemanticKITTI [4] explores the challenging scenario of synthetic-to-real adaptation.

We evaluate xMUDA on four unsupervised domain adaptation scenarios and compare against uni-modal UDA methods: Deep logCORAL [29], entropy minimization (MinEnt) [42] and pseudo-labeling (PL) [24]. For [24] the image-2-image translation part was excluded due to its instability, high training complexity and incompatibility with LiDAR data. Regarding the two other uni-modal techniques, we adapt the published implementations to our settings. For all, we searched for the best respective hyperparameters.

We report mean Intersection over Union (mIoU) of the target test set for 3D segmentation in Tab.  II. We evaluate on the test set using the checkpoint that achieved the best score on the validation set. In addition to the scores of the 2D and 3D model, we show the ensembling result (‘2D+3D’) which is obtained by taking the mean of the predicted 2D and 3D probabilities after softmax. The uni-modal UDA baselines [24, 29, 42] are applied separately on each modality.

Furthermore, we provide the results of a lower bound, ‘Baseline (src only)’, which is only trained on the source-domain dataset and an upper bound, ‘Oracle’, trained only on target with labels222Except for the Day/Night oracle, where we use batches of 50%/50% source/target to prevent overfitting due to the small target set size.. We also indicate the ‘Domain gap (O-B)’, computed as the difference between Oracle and Baseline. It shows that the intra-dataset domain gaps (nuScenes: USA/Singapore, Day/Night) are much smaller than the inter-dataset domain gaps (A2D2/SemanticKITTI, VirtualKITTI/SemanticKITTI). It suggests that a change in sensor setup (A2D2/SemanticKITTI) is actually a very hard domain adaptation problem, similar to the synthetic-to-real case (VirtualKITTI/SemanticKITTI). Note that the scores are not comparable between A2D2/SemanticKITTI and VirtualKITTI/SemanticKITTI, because they use a different number of classes, 10 and 6 respectively.

xMUDA –using the cross-modal loss but not PL– brings a significant adaptation effect on all four UDA scenarios compared to ‘Baseline’ and almost always outperforms the uni-modal UDA baselines. xMUDAPL achieves the best score everywhere with the only exception of Day/Night 2D+3D. Further, cross-modal learning and self-training with pseudo-labels (PL) are complementary as their combination in xMUDAPL consistently yields a higher score than each separate technique. The 2D/3D oracle scores indicate that camera (2D) is the strongest modality on nuScenes, but LiDAR (3D) is better on SemanticKITTI, probably thanks to the high LiDAR resolution. However, xMUDA consistently improves both modalities (2D and 3D), i.e., even the strong modality can learn from the weaker one. The dual-head architecture might be key here: each modality can improve its main segmentation head independently from the other modality, because the consistency is achieved indirectly through the mimicking heads.

We also observe a regularization effect thanks to xMUDA. For example on VirtualKITTI/SemanticKITTI, the methods ‘Baseline’ and ‘PL’ perform very poorly on the 2D modality due to overfitting on the very small VirtualKITTI dataset, while 3D is more stable. In contrast, xMUDA performs better as 3D can regularize 2D. Furthermore, this regularization enables the benefit of pseudo-labels, because xMUDAPL achieves an even better score.

Fig. 10: Qualitative results for xMUDA. We show the ensembling result (2D+3D) on the target test set for UDA Baseline (PL) and xMUDAPL.
– nuScenes: USA/Singapore: Delivery motorcycles with a storage box on the back are common in Singapore, but not in USA. The 3D shape might resemble a vehicle. However, 2D appearance information is leveraged in xMUDAPL to improve the recognition.
– nuScenes: Day/Night: The visual appearance of a car at night with headlights turned on is very different than during day. The uni-modal UDA baseline is not able to learn this new appearance. However, if information between camera and robust-at-night LiDAR is exchanged in xMUDAPL, it is possible to detect the car correctly at night.
– A2D2/SemanticKITTI: xMUDAPL helps to stabilize and increase segmentation performance when there are sensor changes (3x16-layer LiDAR with different angles to 64-layer LiDAR).
– VirtualKITTI/SemanticKITTI: The UDA baseline (PL) has difficulty segmenting the building and road while xMUDAPL succeeds.

Qualitative results are presented in Fig.  10, showing the versatility of xMUDA across all proposed UDA scenarios. We provide additional qualitative results in App. B and a video at

4.4 xMoSSDA

Method Train set 2D 3D 2D+3D 2D 3D 2D+3D
Baseline (src only) 37.9 32.8 43.3 61.4 50.8 64.4
Baseline (lab. trg only) 51.3 57.7 59.2 56.5 57.1 60.3
Baseline (src and lab. trg)  +  54.8 62.4 66.2 64.5 56.3 69.3
xMUDA  +  38.6 44.5 44.4 61.8 54.0 66.7
xMUDAPL  +  41.4 49.5 48.6 68.3 55.2 71.9
Deep logCORAL [29]  +  55.1* 62.2 64.7* 61.4 56.5 66.1
MinEnt [42]  +  56.3 62.5 65.0 64.3 56.6 69.1
PL [24]  +  57.2 66.9 68.5 67.4 56.7 70.2
xMoSSDA  +  56.5 63.4 65.9 65.2 57.4 69.4
xMoSSDAPL  +  59.1 68.2 70.7 70.1 58.5 73.1
Unsupervised advantage 4.3 5.8 4.5 5.6 2.2 3.8
(relative) (+7.8%) (+9.3%) (+6.8%) (+8.7%) (+3.9%) (+5.5%)
  • The 2D network is trained with batch size 6 instead of 8 to fit into GPU memory.

TABLE III: xMoSSDA experiments on 3D semantic segmentation. We report the mIoU result (with best and 2nd best) on the target set for each network stream (2D and 3D) as well as the ensembling result taking the mean of the 2D and 3D probabilities (2D+3D). In the semi-supervised adaptation scenario (SSDA), we are provided with a source set like in UDA, while, unlike UDA, the target-domain dataset has a small labeled part  and a large unlabeled part . We provide three baselines, where we train either on source only (), on labeled target only () or on both ( + ) in a ratio of 50%/50% in each training batch. For comparison, we report the ‘xMUDA’ and ‘xMUDAPL’ results that do not make use of the labeled part of the target-domain dataset. The three uni-modal SSDA baselines ‘Deep logCORAL’ [29], ‘MinEnt’ [42] and ‘PL’ [24] as well as our cross-modal methods ‘xMoSSDA’ and ‘xMoSSDAPL’ are trained supervisedly on the source and labeled target set ( + ) in a ratio of 50%/50% in each training batch and unsupervisedly on the unlabeled target set . Note that, it is impossible to train an Oracle like in the UDA experiments in Tab. II, because there are no labels available on . Instead, we report the ‘Unsupervised advantage’ which is the difference between xMoSSDAPL and ‘Baseline (src and lab. trg)’ as well as the relative improvement. We evaluate on two cross-modal SSDA scenarios: A2D2[14]/SemanticKITTI[4] and Waymo OD[37].
Fig. 11: Qualitative results for xMoSSDA. We show the ensembling result (2D+3D) on the target test set for the supervised baseline (trained on  + ), xMUDAPL (trained on  + ) and xMoSSDAPL (trained on  + ).
– A2D2/SemanticKITTI: The bike in the center is not distinguished from ‘Nature’ background by the supervised baseline, but is so by xMUDAPL, although still wrongly classified, while xMoSSDAPL is correct.
– Waymo OD: SF,PHX,MTV/KRK: Segmentation of the pedestrian with xMUDAPL is better than with the supervised baseline while it is best with xMoSSDAPL.

We evaluate semi-supervised cross-modal learning on two domain adaptation scenarios (A2D2/SemanticKITTI, Waymo OD) and compare xMoSSDA against eight baselines.

Three baselines are purely supervised, either trained on source only (), labeled target only () or on source and labeled target333The latter is trained with 50%/50% examples from and , i.e., a training batch of size 8 contains 4 random examples from and 4 from . ( + ).

Additionally we report two UDA baselines, xMUDA and xMUDAPL, which use source and unlabeled target ( + ).

Last, we report three SSDA baselines (trained on  + ) adapted from uni-modal UDA baselines [29, 42, 24] as follows: we train similarly to the supervised baseline on  +  with 50%/50% batches, but add the respective domain adaptation loss on . Our semi-supervised proposals, xMoSSDA and xMoSSDAPL, are also trained in this manner.

Note that it is impossible to train an Oracle like in Tab. II, as there are no labels available on , and subsequently we can not compute the domain gap. Instead, we answer the question: “How much can we improve over the supervised baseline by additionally training on the unlabeled target-domain data ?”. We call this the ‘unsupervised advantage’ in the following and compute it as the difference between xMoSSDAPL trained on all data ( +  + ), and the supervised baseline trained on ( + ).

We report the mIoU for 3D segmentation in Tab. III. We observe, similar to the xMUDA experiments, that the gap between ‘Baseline (src only)’ and ‘Baseline (src and lab. trg)’ is much larger in the inter-dataset adaptation scenario A2D2/SemanticKITTI than in intra-dataset adaptation on Waymo OD.

As expected, xMUDA and xMUDAPL improve over ‘Baseline (src only)’, but are (with two exceptions) worse than the baseline that uses the small labeled target-domain dataset.

Similar to Tab. II, xMoSSDAPL outperforms all baselines, including the uni-modal baselines [29, 42, 24].

Qualitative results are shown in Fig. 11.

4.5 Extension to Fusion

(a) Vanilla Fusion
(b) xMUDA Fusion
Fig. 14: Architectures for fusion. LABEL:sub@fig:architectureFusionVanilla

In Vanilla Fusion the 2D and 3D features are concatenated, fed into a linear layer with ReLU to mix them and followed by another linear layer and softmax to obtain a fused prediction

. LABEL:sub@fig:architectureFusionXmuda In xMUDA Fusion, we add two uni-modal outputs and that are used to mimic the fusion output .

So far, we used an architecture with independent 2D/3D streams. However, can xMUDA also be applied in a fusion setup where both modalities make a joint prediction? A common fusion architecture is late fusion where the features from different sources are concatenated (see Fig.  (a)a). However, when merging the main 2D/3D branches into a unique fused head, we can no longer apply cross-modal learning (as in Fig.  (a)a). To address this problem, we propose ‘xMUDA Fusion’ where we add an additional segmentation head to both 2D and 3D network streams prior to the fusion layer with the purpose of mimicking the central fusion head (see Fig.  (b)b). Note that this idea could also be applied on top of other fusion architectures.

Method Archi. nuSc:USA/Singap. A2D2/Sem.KITTI
Baseline (src only) Vanilla 59.9 34.2
Deep logCORAL [29] Vanilla 58.2 36.2
MinEnt [42] Vanilla 60.8 39.8
PL [24] Vanilla 65.2 38.6
xMUDA Fusion xMUDA 61.9 42.6
xMUDAPL Fusion xMUDA 66.6 42.2
Oracle xMUDA 72.2 65.7
TABLE IV: Comparison of the fusion methods. Performance in mIoU for the two UDA scenarios: nuScenes [5]: USA/Singapore and A2D2[14]/SemanticKITTI[4]. We adapt the supervised baseline ‘Baseline (src only)’ and the UDA baselines (‘Deep logCORAL’, ‘MinEnt’, ‘PL’) to the vanilla fusion architecture depicted in Fig. (a)a. We propose ‘xMUDA Fusion’ which uses the architecture of Fig. (b)b.

In Tab.  IV we show results for different fusion approaches where we specify which architecture was used (Vanilla late fusion from Fig.  (a)a or xMUDA Fusion from Fig.  (b)b). We observe that the xMUDA fusion architecture leads to better results than the UDA baselines with the Vanilla architecture. This demonstrates how cross-modal learning can be applied effectively in fusion setups.

5 Ablation Studies

5.1 Single vs. Dual Segmentation Head

Fig. 15: Single vs. Dual Head Architecture. mIoU of both architectures on nuScenes: USA/Singapore for different values of the target loss weight while fixing .

In the following we justify our dual head over the simpler single-head architecture. Both are shown in Fig. 5.

In the single-head architecture (Fig. (a)a), the cross-modal loss is directly applied between the 2D and 3D main heads. This enforces consistency by aligning the two outputs in addition to the supervised segmentation loss . Thus, the heads must satisfy the two objectives –segmentation and consistency– at the same time. To showcase the disadvantage of this architecture, we train xMUDA (as in Eq. 4) and vary the weight for the cross-modal loss on target, which is the main driver for UDA. The results in Fig. 15 for the single-head architecture (Fig. (a)a) show that increasing from 0.001 to 0.01 slightly improves the mIoU, but that increasing further to 0.1 and 1.0, has a hugely negative effect. In the extreme case , 2D and 3D always predict the same class, thus only satisfying the consistency, but not the segmentation objective.

The dual-head architecture (Fig. (b)b) addresses this problem by introducing a secondary mimicking head which purpose is to mimic the main head of the other modality during the training and can be discarded afterwards. This effectively disentangles the mimicking objective which is applied to the mimicking head from the segmentation objective which is applied to the main head. Fig. 15 shows that increasing to 0.1 for dual-head produces the best results overall –better than any value for for single head– and that the results are robust even at .

5.2 Cross-Modal Learning on Source

In Eq. 4, the cross-modal loss is applied on source and target domains, although we already have the supervised segmentation loss on source domain. We observe a gain of 4.8 mIoU on 2D and 4.4 on 3D when adding on source domain as opposed to applying it on target domain only. This shows that it is important to train the mimicking head on source-domain data, stabilizing the predictions, which can be exploited during adaptation on target-domain inputs.

5.3 Cross-modal Supervised Learning

nuScenes: Singapore () Waymo OD: KRK ()
Loss 2D 3D 2D+3D 2D 3D 2D+3D
65.79 63.21 71.14 51.3 57.7 59.2
66.37 63.77 71.61 57.4 57.6 61.1
TABLE V: Benefit of the proposed cross-modal loss in supervised learning. Performance in mIoU of supervised learning with and without cross-modal loss on nuScenes [5] (Singapore) and on Waymo OD [37] (KRK), using only the labeled target-domain dataset . In nuScenes-Singapore experiment, the model trained with the cross-modal loss amounts to the oracle on this dataset in Tab. II.

To evaluate the possible benefits of cross-modal learning for purely supervised settings, we conducted experiments with and without adding the cross-modal loss on two different target-domain datasets: nuScenes [5] and Waymo OD [37]. The results are shown in Tab. V and show a performance gain when adding . We hypothesize that the extra cross-modal objective can be beneficial, similar to multi-task learning. On the Waymo OD dataset, we observe a strong improvement on 2D. We observe in the training curve (validation) that cross-modal learning reduces overfitting in 2D. We hypothesize that 3D, which suffers less from overfitting, can have a regularizing effect on 2D.

6 Conclusion

In this work, we proposed cross-modal learning for domain adaptation in unsupervised (xMUDA) and semi-supervised (xMoSSDA) settings. To this end, we designed a two-stream, dual-head architecture and applied a cross-modal loss to the image and point-cloud modalities in the task of 3D semantic segmentation. The cross-modal loss consists of KL divergence applied between the predictions of the two modalities and thereby enforces consistency.

Experiments on four unsupervised and two semi-supervised domain adaptation scenarios show that cross-modal learning outperforms uni-modal adaptation baselines and is complementary to learning with pseudo-labels.

We think that cross-modal learning could generalize to many tasks that involve multi-modal input data and is not constrained neither to domain adaptation tasks nor to image and point-cloud modalities.

In the following we provide details about the dataset splits and additional qualitative results.

Appendix A Dataset Splits

a.1 nuScenes (UDA)

The nuScenes dataset [5] consists of 1000 driving scenes, each of 20 seconds, which corresponds to 40k annotated keyframes taken at 2Hz. The scenes are split into train (28,130 keyframes), validation (6,019 keyframes) and hidden test set. The point-wise 3D semantic labels are obtained from 3D boxes like in [46]. We propose the following splits destined for domain adaptation with the respective source/target domains: Day/Night and Boston/Singapore. Therefore, we use the official validation split as test set and divide the training set into train/val for the target set. As the number of object instances in the target split can be very small (e.g. for night), we merge the objects into 5 categories: vehicle (car, truck, bus, trailer, construction vehicle), pedestrian, bike (motorcycle, bicycle), traffic boundary (traffic cone, barrier) and background.

a.2 VirtualKITTI/SemanticKITTI (UDA)

VirtualKITTI (v.1.3.1) [12] consists of 5 driving scenes which were created with the Unity game engine by real-to-virtual cloning of the scenes 1, 2, 6, 18 and 20 of the real KITTI dataset [13], i.e. bounding box annotations of the real dataset were used to place cars in the virtual world. Different from real KITTI, VirtualKITTI does not simulate LiDAR, but rather provides a dense depth map, alongside semantic, instance and flow ground truth. Each of the 5 scenes contains between 233 and 837 frames, i.e. in total 2126 for the 5 scenes. Each frame is rendered with 6 different weather/lighting variants (clone, morning, sunset, overcast, fog, rain) which we use all. Note that we do not use the renderings with different horizontal rotations. We use the whole VirtualKITTI dataset as source training set.

The SemanticKITTI dataset [4] provides 3D point cloud labels for the Odometry dataset of KITTI [13] which features large-angle front camera and a 64-layer LiDAR. The annotation of the 28 classes has been carried out directly in 3D.

We use scenes as train set, as validation and as test set.

We select 6 shared classes between the 2 datasets by merging or ignoring them (see Tab. VI). The 6 final classes are vegetation_terrain, building, road, object, truck, car.

class VirtualKITTI mapped class class SemanticKITTI mapped class
Terrain vegetation_terrain unlabeled ignore
Tree vegetation_terrain outlier ignore
Vegetation vegetation_terrain car car
Building building bicycle ignore
Road road bus ignore
TrafficSign object motorcycle ignore
TrafficLight object on-rails ignore
Pole object truck truck
Misc object other-vehicle ignore
Truck truck person ignore
Car car bicyclist ignore
Van ignore motorcyclist ignore
Don’t care ignore road road
parking ignore
sidewalk ignore
other-ground ignore
building building
fence object
other-structure ignore
lane-marking road
vegetation vegetation_terrain
trunk vegetation_terrain
terrain vegetation_terrain
pole object
traffic-sign object
other-object object
moving-car car
moving-bicyclist ignore
moving-person ignore
moving-motorcyclist ignore
moving-on-rails ignore
moving-bus ignore
moving-truck truck
moving-other-vehicle ignore
TABLE VI: Class mapping for VirtualKITTI/SemanticKITTI UDA scenario.

a.3 A2D2/SemanticKITTI (UDA+SSDA)

The A2D2 dataset [14] features 20 drives, which corresponds to 28,637 frames. The point cloud comes from three 16-layer front LiDARs (left, center, right) where the left and right front LiDARS are inclined. The semantic labeling was carried out in the 2D image for 38 classes and we compute the 3D labels by projection of the point cloud into the labeled image. We keep scene 20180807_145028 as test set and use the rest for training.

Please refer to Sec. A.2 for details on SemanticKITTI. For UDA, we use the same split as in VirtualKITTI/SemanticKITTI, i.e. scenes as train set, as validation and as test set. For SSDA, we use the scenes as labeled train set , as unlabeled train set , as validation and as test set.

We select 10 shared classes between the 2 datasets by merging or ignoring them (see Tab. VII). The 10 final classes are car, truck, bike, person, road, parking, sidewalk, building, nature, other-objects.

A2D2 class mapped class SemanticKITTI class mapped class
Car 1 car unlabeled ignore
Car 2 car outlier ignore
Car 3 car car car
Car 4 car bicycle bike
Bicycle 1 bike bus ignore
Bicycle 2 bike motorcycle bike
Bicycle 3 bike on-rails ignore
Bicycle 4 bike truck truck
Pedestrian 1 person other-vehicle ignore
Pedestrian 2 person person person
Pedestrian 3 person bicyclist bike
Truck 1 truck motorcyclist bike
Truck 2 truck road road
Truck 3 truck parking parking
Small vehicles 1 bike sidewalk sidewalk
Small vehicles 2 bike other-ground ignore
Small vehicles 3 bike building building
Traffic signal 1 other-objects fence other-objects
Traffic signal 2 other-objects other-structure ignore
Traffic signal 3 other-objects lane-marking road
Traffic sign 1 other-objects vegetation nature
Traffic sign 2 other-objects trunk nature
Traffic sign 3 other-objects terrain nature
Utility vehicle 1 ignore pole other-objects
Utility vehicle 2 ignore traffic-sign other-objects
Sidebars other-objects other-object other-objects
Speed bumper other-objects moving-car car
Curbstone sidewalk moving-bicyclist bike
Solid line road moving-person person
Irrelevant signs other-objects moving-motorcyclist bike
Road blocks other-objects moving-on-rails ignore
Tractor ignore moving-bus ignore
Non-drivable street ignore moving-truck truck
Zebra crossing road moving-other-vehicle ignore
Obstacles / trash other-objects
Poles other-objects
RD restricted area road
Animals other-objects
Grid structure other-objects
Signal corpus other-objects
Drivable cobbleston road
Electronic traffic other-objects
Slow drive area road
Nature object nature
Parking area parking
Sidewalk sidewalk
Ego car car
Painted driv. instr. road
Traffic guide obj. other-objects
Dashed line road
RD normal street road
Sky ignore
Buildings building
Blurred area ignore
Rain dirt ignore
TABLE VII: Class mapping for A2D2/SemanticKITTI UDA and SSDA scenario.

a.4 Waymo OD (SSDA)

The Waymo Open Dataset (v.1.2.0) provides 1150 scenes of 20s each. For simplicity and consistency with other UDA scenarios, we only use the top, but not the 4 side LiDARs, and only the front, but not the 4 side cameras. Similar to nuScenes (Sec. A.1), we obtain segmentation labels from 3D bounding boxes.

There is a main dataset which we use as source dataset and a partially labeled domain adaptation dataset of which we use the labeled part as labeled target set and the unlabeled part as unlabeled target set .

We ignore the cyclist class, because there are no cyclist labels available in the target data, i.e. we only keep the classes vehicle, pedestrian, sign, unknown.

Appendix B Additional qualitative Results (UDA)

We provide additional qualitative results for UDA for the scenarios nuScenes: Day/Night and A2D2/SemanticKITTI in Fig. 16, where we show the output of the 2D and 3D stream individually to illustrate their respective strengths and weaknesses, e.g. that 3D works much better at night.

Fig. 16: Qualitative results on two UDA scenarios. For UDA Baseline (PL) and xMUDAPL, we separately show the predictions of the 2D and 3D network stream.
A2D2/SemanticKITTI: For the uni-modal UDA baseline (PL), the 2D prediction lacks consistency on the road and 3D is unable to recognize the bike and the building on the left correctly. In xMUDAPL, both modalities can stabilize each other and obtain better performance on the bike, the road, the sidewalk and the building.
Day/Night: For the UDA Baseline, 2D can only partly recognize one car out of three while the 3D prediction is almost correct, with one false positive car on the left. With xMUDAPL, the 2D and 3D predictions are both correct.


  • [1] I. Achituve, H. Maron, and G. Chechik (2021) Self-supervised learning for domain adaptation on point clouds. WACV. Cited by: §2.1.
  • [2] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran (2020) Self-supervised learning by cross-modal audio-video clustering. NeuRIPS. Cited by: §2.3.
  • [3] S. Ao, X. Li, and C. X. Ling (2017) Fast generalized distillation for semi-supervised domain adaptation. In AAAI, Cited by: §2.2.
  • [4] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019) SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV, Cited by: §A.2, §1, Fig. 9, §4.1, TABLE II, TABLE III, TABLE IV.
  • [5] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) nuScenes: a multimodal dataset for autonomous driving. CVPR. Cited by: §A.1, Fig. 9, §4.1, TABLE II, TABLE IV, §5.3, TABLE V.
  • [6] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Learning aligned cross-modal representations from weakly aligned data. In CVPR, Cited by: §2.3.
  • [7] C. Choy, J. Gwak, and S. Savarese (2019)

    4D spatio temporal convnet: minkowski convolutional neural networks

    In CVPR, Cited by: §2.4.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §1.
  • [9] C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning. Cited by: §2.2.
  • [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §2.4.
  • [11] J. Donahue, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell (2013) Semi-supervised domain adaptation with instance constraints. In CVPR, Cited by: §2.2.
  • [12] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In CVPR, Cited by: §A.2, Fig. 9, §4.1, TABLE II.
  • [13] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, Cited by: §A.2, §A.2.
  • [14] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V. H. Pham, M. Mühlegg, S. Dorn, T. Fernandez, M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, and P. Schuberth (2019) A2D2: AEV autonomous driving dataset. Audi Electronics Venture GmbH. Note: Cited by: §A.3, Fig. 9, §4.1, TABLE II, TABLE III, TABLE IV.
  • [15] B. Graham, M. Engelcke, and L. van der Maaten (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, Cited by: Fig. 2, §2.4, §2.4, §3.1, §4.2.
  • [16] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In ACCV, Cited by: §2.3.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2.
  • [18] G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the knowledge in a neural network. In NIPS Workshops, Cited by: §2.3, §3.2.2.
  • [19] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §1, §2.1.
  • [20] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. Cited by: §1, §2.1.
  • [21] M. Jaritz, T. Vu, R. d. Charette, E. Wirbel, and P. Pérez (2020) XMUDA: cross-modal unsupervised domain adaptation for 3d semantic segmentation. In CVPR, Cited by: §1.
  • [22] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshops, Cited by: 2nd item, §2.1, §3.2.3, §3.2.
  • [23] K. Lee, G. Ros, J. Li, and A. Gaidon (2019) Spigan: privileged adversarial learning from simulation. In ICLR, Cited by: §2.1.
  • [24] Y. Li, L. Yuan, and N. Vasconcelos (2019) Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, Cited by: §1, §2.1, §3.2.3, §4.2, §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, TABLE IV.
  • [25] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3D object detection. In CVPR, Cited by: §2.3.
  • [26] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In ECCV, Cited by: §2.3.
  • [27] D. Meagher (1982) Geometric modeling using octree encoding. Computer graphics and image processing 19 (2), pp. 129–147. Cited by: §2.4.
  • [28] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-Gonzalez (2019) Sensor fusion for joint 3d object detection and semantic segmentation. In CVPR Workshop, Cited by: §2.3.
  • [29] P. Morerio, J. Cavazza, and V. Murino (2018) Minimal-entropy correlation alignment for unsupervised deep domain adaptation. In ICLR, Cited by: §2.1, §3.2, §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, TABLE IV.
  • [30] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §2.4.
  • [31] C. Qin, H. You, L. Wang, C. J. Kuo, and Y. Fu (2019) PointDAN: a multi-scale 3d domain adaption network for point cloud representation. In Advances in Neural Information Processing Systems, pp. 7192–7203. Cited by: §2.1.
  • [32] G. Riegler, A. Osman Ulusoy, and A. Geiger (2017) Octnet: learning deep 3d representations at high resolutions. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3577–3586. Cited by: §2.4.
  • [33] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: Fig. 2, §3.1, §4.2.
  • [34] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In CVPR, Cited by: §2.2.
  • [35] A. Saporta, T. Vu, M. Cord, and P. Pérez (2020) ESL: entropy-guided self-supervised learning for domain adaptation in semantic segmentation. In CVPR Workshop, Cited by: §2.1.
  • [36] N. Sayed, B. Brattoli, and B. Ommer (2018) Cross and learn: cross-modal self-supervision. In German Conference on Pattern Recognition, Cited by: §2.3.
  • [37] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. (2020) Scalability in perception for autonomous driving: waymo open dataset. In CVPR, Cited by: Fig. 9, §4.1, TABLE III, §5.3, TABLE V.
  • [38] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.4.
  • [39] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) KPConv: flexible and deformable convolution for point clouds. In ICCV, Cited by: §2.4.
  • [40] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §1, §2.1.
  • [41] A. Valada, R. Mohan, and W. Burgard (2019) Self-supervised model adaptation for multimodal semantic segmentation. IJCV. Cited by: §2.3.
  • [42] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, Cited by: §1, §2.1, §3.2, §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, TABLE IV.
  • [43] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) DADA: depth-aware domain adaptation in semantic segmentation. In ICCV, Cited by: §2.1.
  • [44] S. Wang, S. Suo, W. Ma, A. Pokrovsky, and R. Urtasun (2018) Deep parametric continuous convolutional neural networks. In CVPR, Cited by: §2.4.
  • [45] Z. Wang, Y. Wei, R. Feris, J. Xiong, W. Hwu, T. S. Huang, and H. Shi (2020) Alleviating semantic-level shift: a semi-supervised domain adaptation method for semantic segmentation. In CVPR Workshops, Cited by: §2.2.
  • [46] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d LiDAR point cloud. In ICRA, Cited by: §A.1.
  • [47] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, Cited by: §1, §2.1.
  • [48] T. Yao, Y. Pan, C. Ngo, H. Li, and T. Mei (2015) Semi-supervised domain adaptation with subspace learning for visual recognition. In CVPR, Cited by: §2.2.
  • [49] L. Yi, B. Gong, and T. Funkhouser (2020) Complete & label: a domain adaptation approach to semantic segmentation of lidar point clouds. arXiv preprint arXiv:2007.08488. Cited by: §2.1.
  • [50] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In CVPR, Cited by: §3.2.2.
  • [51] Y. Zou, Z. Yu, X. Liu, B.V.K. V. Kumar, and J. Wang (2019) Confidence regularized self-training. In ICCV, Cited by: §2.1, §3.2.3.