Scene understanding plays a central role in a number of modern applications and, among other tasks, semantic segmentation from images has been extensively studied. However, for applications that involve interaction with the 3D world, e.g., robots, self-driving cars or virtual reality, scenes should be understood in 3D. In this context, 3D semantic segmentation is gaining attention and an increasing number of datasets provide jointly annotated 3D point clouds and 2D images. The modalities are complementary since point clouds provide geometry while images capture texture and color.
Our goal in this work is to alleviate this problem with transfer learning, in particular domain adaptation (DA), where a model is trained by leveraging data from a different source domain to improve performance on the desired target domain, while benefiting from multi-modal 2D/3D data.
We consider both unsupervised and semi-supervised DA, that is when labels are available in the source domain, but not (or only partially) in the target domain. Most of DA literature investigated the image modality [19, 20, 42, 40, 24], but only a few address the point-cloud modality . Different from these, we perform DA on images and point clouds simultaneously with the aim to explicitly exploit multi-modality for the DA goal.
We use self-driving data from cameras and LiDAR point-cloud sensors, and want to profit from the fact that the domain gaps differ across these sensors. For example, a LiDAR is more robust to lighting changes (e.g., day/night) than a camera. On the other hand, LiDAR sensing density varies with the sensor setup while cameras always output dense images. Our work takes advantage of the cross-modal discrepancies while preserving the best performance of each sensor – thus avoiding that the limitations of one modality negatively affect the other modality’s performance.
We propose a cross-modal loss which enforces consistency between multi-modal predictions, as depicted in Fig. 1. Our specifically designed dual-head architecture enables robust training by decoupling the supervised main segmentation loss from the unsupervised cross-modal loss.
We demonstrate that our cross-modal framework proposal can be either applied in the unsupervised setting (coined xMUDA), or semi-supervised setting (coined xMoSSDA).
This paper is an extension of our work  which covered only UDA evaluated on three scenarios. Besides the significant expansion of the experimental evaluation (Sec. 4) including the addition of two new DA scenarios (see Fig. 9), we add a completely new use case of semi-supervised DA (SSDA) in Secs. 3.3 and 4.4. The original code-base of  will be extended with new experiments and the SSDA set up.
In summary our contributions are:
We introduce new domain adaptation scenarios (4 unsupervised and 2 semi-supervised), for the task of 3D semantic segmentation, leveraging recent 2D-3D driving datasets with cameras and LiDARs.
We propose a new DA approach with an unsupervised cross-modal loss which enforces multi-modal consistency and is complementary to other existing unsupervised techniques .
We design a robust dual-head architecture which uncouples the cross-modal loss from the main segmentation objective.
We evaluate xMUDA and xMoSSDA, our unsupervised and semi-supervised DA scenarios respectively, and demonstrate their superior performance.
2 Related Work
2.1 Unsupervised Domain Adaptation
The past few years have seen an increasing interest in unsupervised domain adaptation (UDA) for complex perception tasks like object detection and semantic segmentation. Under the hood of such methods lies the same spirit of learning domain-invariant representations, i.e., features coming from different domains should introduce insignificant discrepancies. Some works promote adversarial training to minimize the source-target distribution shift, either on pixel- , feature-  or output-space [40, 42]
. Revisited from semi-supervised learning, self-training with pseudo-labels has also been recently proven effective for UDA [24, 51, 35].
Recent works start addressing UDA in the 3D world, i.e., for point clouds. PointDAN  proposes to jointly align local and global features used for classification. Achituve et al.  improve UDA performance using self-supervised learning. Wu et al.  adopt activation correlation alignment  for UDA in 3D segmentation from LiDAR point clouds. Yi et al.  address the domain discrepancies induced by different LiDAR sensors by recovering the canonical 3D surfaces, on top of which the segmentation downstream task is performed. In this work, we investigate the same task, but differently since our system operates on multi-modal input data, i.e., RGB + LiDAR.
To the best of our knowledge, there are no previous UDA works in 2D/3D semantic segmentation for multi-modal scenarios. Only some consider the extra modality, e.g., depth, solely available at training time on the source domain and leverage such privileged information to boost adaptation performance [23, 43]. Otherwise, we here assume all modalities are available at train and test time on both source and target domains.
2.2 Semi-supervised Domain Adaption
While UDA has become an active research topic, semi-supervised domain adaptation (SSDA) has so far been little-studied despite being highly relevant in practical applications. In SSDA, we would like to transfer knowledge from a source domain with labeled data to a target domain with partially labeled data.
Early approaches based on SVM  have addressed SSDA in image classification and object detection [11, 48, 3]; few has been done for deep networks. Recently, Saito et al.  propose an adversarial SSDA learning scheme to optimize a few-shot deep classification model with minimax entropy. Wang et al.  extend UDA techniques in 2D semantic segmentation to the SSDA setting by additionally aligning feature prototypes of labeled source and target samples. Our work is the first to address SSDA in point cloud segmentation.
2.3 Cross-modality learning
In our context, we define cross-modality learning as knowledge transfer between modalities. This is different from multi-modal fusion where a single model is trained supervisedly to combine complementary inputs, such as RGB-D [16, 41] or LiDAR and camera [25, 26, 28].
Castrejón et al.  address the task of cross-modal scene retrieval. Their goal is to learn a joint high-level feature representation which is agnostic to the input modality (real image, clip art, text, etc.). This is achieved by fine-tuning the shared weights of the final network’s layers and enforcing similar statistics across modalities. Gupta et al. adapt the more direct feature alignment technique of distillation  in a cross-modal setup. They apply an L2 loss between multi-modal features to transfer knowledge from a supervisedly trained RGB network to unlabeled depth or optical flow.
Self-supervised learning is the task of learning useful representations in the absence of labels. This can be achieved by forcing networks with different input modalities to predict a similar output. Sayed et al.  minimize the cosine distance between RGB and optical flow features. Instead, Alwassel et al.  use clustering to generate pseudo labels and use them to mutually train an audio and video network.
Different from those related works, we address the task of domain adaptation, specifically in point cloud segmentation, using the modalities of RGB and LiDAR.
2.4 Point cloud segmentation
While images are dense tensors, 3D point clouds can be represented in multiple ways, which leads to competing network families evolving in parallel.
Voxels are similar to pixels, but very memory intense in their dense representation as most of them are usually empty. Some 3D CNNs [32, 38] rely on OctTree  to reduce memory usage but without addressing the problem of manifold dilation. Graham et al.  and similar implementation  address the latter by using hash tables to convolve only on active voxels. This allows for very high resolution with typically only one point per voxel.
Point-based networks perform computation in continuous 3D space and can thus directly accept point clouds as input. PointNet++ 
uses point-wise convolution, max-pooling to compute global features and local neighborhood aggregation for hierarchical learning akin to CNNs. Many improvements have been proposed in this direction, such as continuous convolutions and deformable kernels .
3 Cross-modal Learning for Domain Adaptation
Our aim is to exploit multi-modality as a source of knowledge for unsupervised learning in domain adaptation. Therefore, we propose a cross-modal learning objective, implemented as a mutual mimicking game between modalities, that drives toward consistency across predictions from different modalities.
Specifically, we investigate the modalities of 2D images and 3D point clouds for the task of 3D semantic segmentation as it is a core task for machine vision.
We present the network architecture in Sec. 3.1, our framework for the challenging cross-modal unsupervised domain adaptation, coined ‘xMUDA’, in Sec. 3.2 and its semi-supervised version, analogously called ‘xMoSSDA’, in Sec. 3.3.
Our architecture predicts point-wise segmentation labels. It consists of two independent streams which respectively take a 2D image and a 3D point cloud as inputs, and output features of size and respectively, where is the number of 3D points within the camera field of view. An overview is depicted in Fig. 2.
3.1.1 Dual Segmentation Head
We call segmentation head (depicted as ‘classify’ arrows in Fig.2
For cross-modal learning, we establish a mimicking game between the 2D and 3D output probabilities, i.e., each modality should predict the other modality’s output. The overall objective drives the two modalities toward an agreement, thus enforcing consistency between outputs.
In a naive approach, each modality has a single segmentation head (as depicted in Fig. (a)a) and the cross-modal optimization objective aligns the outputs of both modalities. Unfortunately, this setup is not robust as the mimicking objective is in direct competition with the main segmentation objective. This is why, in practice, one needs to down-weight the mimicry loss relative to the segmentation loss to observe a performance gain. However, this is a serious limitation, because down-weighting the mimicry loss also decreases its adaptation effect.
In order to address this problem, we propose to disentangle the mimicry from the main segmentation objective. Therefore, we propose a dual-head architecture as depicted in Figs. 2 and (b)b. In this setup, the 2D and 3D streams both have two segmentation heads: one main head for the best possible prediction, and one mimicry head to estimate the other modality’s output.
The outputs of the four segmentation heads (see Fig. 2) are of size , where
is the number of classes such that we obtain a vector of class probabilities for each 3D point. The two main heads produce the best possible segmentation predictions,and respectively for each branch. The two mimicry heads estimate the other modality’s output: 2D estimates 3D () and 3D estimates 2D ().
3.2 Unsupervised Domain Adaptation (xMUDA)
We propose xMUDA, cross-modal unsupervised domain adaptation, which considers a source-domain dataset , where each sample consists of a 2D image , a 3D point cloud and 3D segmentation labels with classes, as well as a target-domain dataset , lacking annotations, where each sample only consists of an image and a point cloud .
In the following, we define the usual supervised learning setup, our cross-modal loss , and an additional variant ‘xMUDAPL’ that further uses pseudo-labels to boost performance. An overview of the learning setup is depicted in Fig. (a)a. The difference between our proposed cross-modal learning and existing uni-modal UDA techniques, such as as Pseudo-labels , MinEnt  or Deep logCORAL  is visualized in Fig. (b)b.
3.2.1 Supervised Learning
The main goal of 3D segmentation is learned through cross-entropy in a classical supervised fashion on the source-domain data. Denoting the soft-classification map associated by the segmentation model to the 3D points of interest, for a given input , the segmentation loss of each network stream (2D and 3D) for a given training sample in reads:
where is either or and equals . We denote tensor entries’ indices as superscript.
3.2.2 Cross-Modal Learning
The objective of unsupervised learning across modalities is twofold. Firstly, we want to transfer knowledge from one modality to the other on the target-domain dataset. For example, let one modality be sensitive and the other more robust to the domain shift, then the robust modality should teach the sensitive modality the correct class in the target domain where no labels are available. Secondly, we want to design an auxiliary objective on source and target domains, where the task is to estimate the other modality’s prediction. By mimicking not only the class with maximum probability, but the whole distribution like in teacher-student distillation , more information is exchanged, leading to softer labels.
We choose the KL divergence for the cross-modal loss and define it as follows:
with where is the target distribution from the main prediction which is to be estimated by the mimicking prediction . This loss is applied on the source and the target domain as it does not require ground-truth labels and is the key to our proposed domain adaptation framework. In the source domain, can be seen as an auxiliary mimicry loss in addition to the main segmentation loss .
The complete objective for each network stream (2D and 3D) is the combination of the segmentation loss on source-domain data and the cross-modal loss on both domains:
are hyperparameters to weighton source and target domain respectively and are the network weights of either the 2D or the 3D stream.
There are parallels between our approach and Deep Mutual Learning  in training two networks in collaboration and using the KL divergence as mimicry loss. However, unlike the aforementioned work, our cross-modal learning establishes consistency across modalities (2D/3D) without supervision.
3.2.3 Self-training with Pseudo-Labels
Cross-modal learning is complementary to pseudo-labeling  used originally in semi-supervised learning and recently in UDA [24, 51]. To benefit from both, once having optimized a model with Eq. 4, we extract pseudo-labels offline, selecting highly-confident labels based on the predicted class probability. Then, we train again from scratch using the produced pseudo-labels for an additional segmentation loss on the target-domain training set. The optimization problem writes:
where weights the pseudo-label segmentation loss and are the pseudo-labels. For clarity, we will refer to the xMUDA variant that uses additional self-training with pseudo-labels as xMUDAPL.
3.3 Semi-supervised Domain Adaptation (xMoSSDA)
Cross-modal learning can also be used in semi-supervised domain adaptation, thus benefiting from a small portion of labeled data in the target domain.
Formally, we consider in xMoSSDA a labeled source-domain dataset where each sample contains an image , a point cloud and labels . Different from unsupervised learning, the target-domain set consists of a usually small labeled part where each sample holds an image , a point cloud and labels , as well as an, often larger, unlabeled part where each sample consists only of an image and a point cloud .
3.3.1 Supervised Learning
Unlike xMUDA, we do not only apply the segmentation loss of Eq. 1 on the source-domain dataset , but also on the labeled target-domain dataset : The segmentation loss in Eq. 1 thus applies both to samples in and to samples in . Note that, in practice, we train on source and target domains at the same time by concatenating examples from both in a batch.
3.3.2 Cross-Modal Learning
We apply the unsupervised cross-modal loss of Eq. 2 on all datasets, i.e., (labeled) source-domain , labeled target-domain , and unlabeled target-domain dataset . The latter, , is a typically large portion of unlabeled data compared to a usually much smaller labeled portion . Subsequently, it is beneficial to also exploit with an unsupervised loss, such as cross-modal learning. The complete objective is a combination of supervised segmentation loss where labels are available (i.e., in sets and ), and unsupervised cross-modal loss everywhere (i.e., on and ) which enforces consistency between the 2D and 3D predictions. It writes:
where , and are the weighting hyperparameters for . In practice we choose for simplicity.
3.3.3 Self-training with Pseudo-Labels
As in the unsupervised setting, we extend semi-supervised cross-modal learning to also benefit from pseudo-labels. After having trained a model with Eq. 6, we use the model to generate predictions on the unlabeled target-domain dataset and extract highly-confident pseudo-labels which are used to train again from scratch with the following objective:
where is weighting the pseudo-label segmentation loss and are the pseudo-labels. We call this variant xMoSSDAPL.
For evaluation, we identified five domain adaptation (DA) scenarios relevant to autonomous driving, shown in Fig. 9, and evaluated our cross-modal proposals against recent baselines.
In the following, we first describe the datasets (Sec. 4.1), the implementation backbone and training details (Sec. 4.2), and then evaluate xMUDA (Sec. 4.3) and xMoSSDA (Sec. 4.4). Finally, we extend our cross-modal framework to fusion (Sec. 4.5), demonstrating its global benefit.
To compose our domain adaptation scenarios displayed in Fig. 9, we leveraged public datasets nuScenes , VirtualKITTI , SemanticKITTI , A2D2  and Waymo Open Dataset (Waymo OD) . The split details are in Tab. I. Our scenarios cover typical DA challenges like change in scene layout, between right and left-hand-side driving in the nuScenes: USA/Singapore scenario, lighting changes, between day and night in nuScenes: Day/Night, synthetic-to-real data, between simulated depth and RGB to real LiDAR and camera in VirtualKITTI/SemanticKITTI, different sensor setups and characteristics like resolution/FoV in A2D2/SemanticKITTI and weather changes between sunny San Francisco, Phoenix, Mountain View and rainy Kirkland in Waymo OD: SF,PHX,MTV/KRK.
In all datasets, the LiDAR and the camera are synchronized and calibrated, allowing 2D/3D projections. For consistency across datasets, we only use the front camera’s images, even when multiple cameras are available.
Waymo OD and nuScenes do not provide point-wise 3D segmentation labels, but we produce them by leveraging the 3D object bounding-box labels111The nuScenes lidar segmentation dataset was not released at the time of our work.. Points lying inside a box are labeled as that class and points outside of all boxes are labeled as background.
As some classes have too few samples, we merge or ignore them. If class definitions differ between source and target domains (e.g., VirtualKITTI/SemanticKITTI), we define a custom class mapping. Note that VirtualKITTI contains depth maps in the camera’s reference frame. In order to simulate LiDAR scanning, we uniformly sample points in the depth map.
We provide further details about the datasets in App. A. The training data and splits can be reproduced with our code.
|nuScenes: USA/Singapore||nuScenes: Day/Night||Virt.KITTI/ Sem.KITTI||A2D2/ Sem.KITTI||Waymo OD: SF,PHX,MTV/KRK|
|Waymo OD: SF,PHX,MTV/KRK||158,081||11,853 94,624||3,943/3,932|
4.2 Implementation Details
In the following, we briefly introduce our implementation. Please refer to our code for further details.
2D Network. We use a modified version of U-Net  with ResNet34  encoder and a decoder with transposed convolutions and skip connections. To lift the 2D features to 3D, we subsample the output feature map of size at the pixel locations where the 3D points project. Hence, the 2D network takes an image as input and outputs features of size .
3D Network. We use the official SparseConvNet  implementation and a U-Net architecture with 6 times downsampling. The voxel size is set to 5cm which is small enough to only have one 3D point per voxel. Thus, the 3D network takes a point cloud as input and outputs features of size .
We employ standard 2D/3D data augmentation and log-smoothed class weights to address class-imbalance. In PyTorch, to compute the KL divergence for the cross-modal loss, wedetach
the target variable to only backpropagate in either the 2D or the 3D network. We train with a batch size of 8 and the Adam optimizer with, and train 30k iterations for the scenario with the small VirtualKITTI dataset and 100k iterations for all other scenarios. At each iteration we compute and accumulate gradients on the source and target batch, jointly training the 2D and 3D stream. To fit the training into a single GPU with 11GB of memory, we resize the images and additionally crop them in VirtualKITTI and SemanticKITTI.
For the pseudo-label variants, xMUDAPL and xMoSSDAPL, we generate the pseudo-labels offline as in  with trained models xMUDA and xMoSSDA, respectively. Then, we retrain from scratch, additionally using the pseudo-labels, optimizing Eqs. 5 and 7, respectively. Importantly, we only use the last checkpoint to generate the pseudo-labels – as opposed to using the best weights which would provide a supervised signal.
|nuSc: USA/Singap.||nuSc: Day/Night||Virt.KITTI/Sem.KITTI||A2D2/Sem.KITTI|
|Baseline (src only)||53.4||46.5||61.3||42.2||41.2||47.8||26.8||42.0||42.2||34.2||35.9||40.4|
|Deep logCORAL ||52.6||47.1||59.1||41.4||42.8||51.8||41.4*||36.8||47.0*||35.1*||41.0||42.2*|
|Domain gap (O-B)||12.9||17.3||10.3||6.5||5.9||7.4||39.5||36.4||37.9||25.1||36.0||33.2|
The 2D network is trained with batch size 6 instead of 8 to fit into GPU memory.
We evaluate xMUDA on four unsupervised domain adaptation scenarios and compare against uni-modal UDA methods: Deep logCORAL , entropy minimization (MinEnt)  and pseudo-labeling (PL) . For  the image-2-image translation part was excluded due to its instability, high training complexity and incompatibility with LiDAR data. Regarding the two other uni-modal techniques, we adapt the published implementations to our settings. For all, we searched for the best respective hyperparameters.
We report mean Intersection over Union (mIoU) of the target test set for 3D segmentation in Tab. II. We evaluate on the test set using the checkpoint that achieved the best score on the validation set. In addition to the scores of the 2D and 3D model, we show the ensembling result (‘2D+3D’) which is obtained by taking the mean of the predicted 2D and 3D probabilities after softmax. The uni-modal UDA baselines [24, 29, 42] are applied separately on each modality.
Furthermore, we provide the results of a lower bound, ‘Baseline (src only)’, which is only trained on the source-domain dataset and an upper bound, ‘Oracle’, trained only on target with labels222Except for the Day/Night oracle, where we use batches of 50%/50% source/target to prevent overfitting due to the small target set size.. We also indicate the ‘Domain gap (O-B)’, computed as the difference between Oracle and Baseline. It shows that the intra-dataset domain gaps (nuScenes: USA/Singapore, Day/Night) are much smaller than the inter-dataset domain gaps (A2D2/SemanticKITTI, VirtualKITTI/SemanticKITTI). It suggests that a change in sensor setup (A2D2/SemanticKITTI) is actually a very hard domain adaptation problem, similar to the synthetic-to-real case (VirtualKITTI/SemanticKITTI). Note that the scores are not comparable between A2D2/SemanticKITTI and VirtualKITTI/SemanticKITTI, because they use a different number of classes, 10 and 6 respectively.
xMUDA –using the cross-modal loss but not PL– brings a significant adaptation effect on all four UDA scenarios compared to ‘Baseline’ and almost always outperforms the uni-modal UDA baselines. xMUDAPL achieves the best score everywhere with the only exception of Day/Night 2D+3D. Further, cross-modal learning and self-training with pseudo-labels (PL) are complementary as their combination in xMUDAPL consistently yields a higher score than each separate technique. The 2D/3D oracle scores indicate that camera (2D) is the strongest modality on nuScenes, but LiDAR (3D) is better on SemanticKITTI, probably thanks to the high LiDAR resolution. However, xMUDA consistently improves both modalities (2D and 3D), i.e., even the strong modality can learn from the weaker one. The dual-head architecture might be key here: each modality can improve its main segmentation head independently from the other modality, because the consistency is achieved indirectly through the mimicking heads.
We also observe a regularization effect thanks to xMUDA. For example on VirtualKITTI/SemanticKITTI, the methods ‘Baseline’ and ‘PL’ perform very poorly on the 2D modality due to overfitting on the very small VirtualKITTI dataset, while 3D is more stable. In contrast, xMUDA performs better as 3D can regularize 2D. Furthermore, this regularization enables the benefit of pseudo-labels, because xMUDAPL achieves an even better score.
|A2D2/Sem.KITTI||Waymo OD: SF,PHX,MTV/KRK|
|Baseline (src only)||37.9||32.8||43.3||61.4||50.8||64.4|
|Baseline (lab. trg only)||51.3||57.7||59.2||56.5||57.1||60.3|
|Baseline (src and lab. trg)||+||54.8||62.4||66.2||64.5||56.3||69.3|
|Deep logCORAL ||+ +||55.1*||62.2||64.7*||61.4||56.5||66.1|
|MinEnt ||+ +||56.3||62.5||65.0||64.3||56.6||69.1|
|PL ||+ +||57.2||66.9||68.5||67.4||56.7||70.2|
The 2D network is trained with batch size 6 instead of 8 to fit into GPU memory.
We evaluate semi-supervised cross-modal learning on two domain adaptation scenarios (A2D2/SemanticKITTI, Waymo OD) and compare xMoSSDA against eight baselines.
Three baselines are purely supervised, either trained on source only (), labeled target only () or on source and labeled target333The latter is trained with 50%/50% examples from and , i.e., a training batch of size 8 contains 4 random examples from and 4 from . ( + ).
Additionally we report two UDA baselines, xMUDA and xMUDAPL, which use source and unlabeled target ( + ).
Last, we report three SSDA baselines (trained on + + ) adapted from uni-modal UDA baselines [29, 42, 24] as follows: we train similarly to the supervised baseline on + with 50%/50% batches, but add the respective domain adaptation loss on . Our semi-supervised proposals, xMoSSDA and xMoSSDAPL, are also trained in this manner.
Note that it is impossible to train an Oracle like in Tab. II, as there are no labels available on , and subsequently we can not compute the domain gap. Instead, we answer the question: “How much can we improve over the supervised baseline by additionally training on the unlabeled target-domain data ?”. We call this the ‘unsupervised advantage’ in the following and compute it as the difference between xMoSSDAPL trained on all data ( + + ), and the supervised baseline trained on ( + ).
We report the mIoU for 3D segmentation in Tab. III. We observe, similar to the xMUDA experiments, that the gap between ‘Baseline (src only)’ and ‘Baseline (src and lab. trg)’ is much larger in the inter-dataset adaptation scenario A2D2/SemanticKITTI than in intra-dataset adaptation on Waymo OD.
As expected, xMUDA and xMUDAPL improve over ‘Baseline (src only)’, but are (with two exceptions) worse than the baseline that uses the small labeled target-domain dataset.
Qualitative results are shown in Fig. 11.
4.5 Extension to Fusion
In Vanilla Fusion the 2D and 3D features are concatenated, fed into a linear layer with ReLU to mix them and followed by another linear layer and softmax to obtain a fused prediction. LABEL:sub@fig:architectureFusionXmuda In xMUDA Fusion, we add two uni-modal outputs and that are used to mimic the fusion output .
So far, we used an architecture with independent 2D/3D streams. However, can xMUDA also be applied in a fusion setup where both modalities make a joint prediction? A common fusion architecture is late fusion where the features from different sources are concatenated (see Fig. (a)a). However, when merging the main 2D/3D branches into a unique fused head, we can no longer apply cross-modal learning (as in Fig. (a)a). To address this problem, we propose ‘xMUDA Fusion’ where we add an additional segmentation head to both 2D and 3D network streams prior to the fusion layer with the purpose of mimicking the central fusion head (see Fig. (b)b). Note that this idea could also be applied on top of other fusion architectures.
|Baseline (src only)||Vanilla||59.9||34.2|
|Deep logCORAL ||Vanilla||58.2||36.2|
In Tab. IV we show results for different fusion approaches where we specify which architecture was used (Vanilla late fusion from Fig. (a)a or xMUDA Fusion from Fig. (b)b). We observe that the xMUDA fusion architecture leads to better results than the UDA baselines with the Vanilla architecture. This demonstrates how cross-modal learning can be applied effectively in fusion setups.
5 Ablation Studies
5.1 Single vs. Dual Segmentation Head
In the following we justify our dual head over the simpler single-head architecture. Both are shown in Fig. 5.
In the single-head architecture (Fig. (a)a), the cross-modal loss is directly applied between the 2D and 3D main heads. This enforces consistency by aligning the two outputs in addition to the supervised segmentation loss . Thus, the heads must satisfy the two objectives –segmentation and consistency– at the same time. To showcase the disadvantage of this architecture, we train xMUDA (as in Eq. 4) and vary the weight for the cross-modal loss on target, which is the main driver for UDA. The results in Fig. 15 for the single-head architecture (Fig. (a)a) show that increasing from 0.001 to 0.01 slightly improves the mIoU, but that increasing further to 0.1 and 1.0, has a hugely negative effect. In the extreme case , 2D and 3D always predict the same class, thus only satisfying the consistency, but not the segmentation objective.
The dual-head architecture (Fig. (b)b) addresses this problem by introducing a secondary mimicking head which purpose is to mimic the main head of the other modality during the training and can be discarded afterwards. This effectively disentangles the mimicking objective which is applied to the mimicking head from the segmentation objective which is applied to the main head. Fig. 15 shows that increasing to 0.1 for dual-head produces the best results overall –better than any value for for single head– and that the results are robust even at .
5.2 Cross-Modal Learning on Source
In Eq. 4, the cross-modal loss is applied on source and target domains, although we already have the supervised segmentation loss on source domain. We observe a gain of 4.8 mIoU on 2D and 4.4 on 3D when adding on source domain as opposed to applying it on target domain only. This shows that it is important to train the mimicking head on source-domain data, stabilizing the predictions, which can be exploited during adaptation on target-domain inputs.
5.3 Cross-modal Supervised Learning
|nuScenes: Singapore ()||Waymo OD: KRK ()|
To evaluate the possible benefits of cross-modal learning for purely supervised settings, we conducted experiments with and without adding the cross-modal loss on two different target-domain datasets: nuScenes  and Waymo OD . The results are shown in Tab. V and show a performance gain when adding . We hypothesize that the extra cross-modal objective can be beneficial, similar to multi-task learning. On the Waymo OD dataset, we observe a strong improvement on 2D. We observe in the training curve (validation) that cross-modal learning reduces overfitting in 2D. We hypothesize that 3D, which suffers less from overfitting, can have a regularizing effect on 2D.
In this work, we proposed cross-modal learning for domain adaptation in unsupervised (xMUDA) and semi-supervised (xMoSSDA) settings. To this end, we designed a two-stream, dual-head architecture and applied a cross-modal loss to the image and point-cloud modalities in the task of 3D semantic segmentation. The cross-modal loss consists of KL divergence applied between the predictions of the two modalities and thereby enforces consistency.
Experiments on four unsupervised and two semi-supervised domain adaptation scenarios show that cross-modal learning outperforms uni-modal adaptation baselines and is complementary to learning with pseudo-labels.
We think that cross-modal learning could generalize to many tasks that involve multi-modal input data and is not constrained neither to domain adaptation tasks nor to image and point-cloud modalities.
In the following we provide details about the dataset splits and additional qualitative results.
Appendix A Dataset Splits
a.1 nuScenes (UDA)
The nuScenes dataset  consists of 1000 driving scenes, each of 20 seconds, which corresponds to 40k annotated keyframes taken at 2Hz. The scenes are split into train (28,130 keyframes), validation (6,019 keyframes) and hidden test set. The point-wise 3D semantic labels are obtained from 3D boxes like in . We propose the following splits destined for domain adaptation with the respective source/target domains: Day/Night and Boston/Singapore. Therefore, we use the official validation split as test set and divide the training set into train/val for the target set. As the number of object instances in the target split can be very small (e.g. for night), we merge the objects into 5 categories: vehicle (car, truck, bus, trailer, construction vehicle), pedestrian, bike (motorcycle, bicycle), traffic boundary (traffic cone, barrier) and background.
a.2 VirtualKITTI/SemanticKITTI (UDA)
VirtualKITTI (v.1.3.1)  consists of 5 driving scenes which were created with the Unity game engine by real-to-virtual cloning of the scenes 1, 2, 6, 18 and 20 of the real KITTI dataset , i.e. bounding box annotations of the real dataset were used to place cars in the virtual world. Different from real KITTI, VirtualKITTI does not simulate LiDAR, but rather provides a dense depth map, alongside semantic, instance and flow ground truth. Each of the 5 scenes contains between 233 and 837 frames, i.e. in total 2126 for the 5 scenes. Each frame is rendered with 6 different weather/lighting variants (clone, morning, sunset, overcast, fog, rain) which we use all. Note that we do not use the renderings with different horizontal rotations. We use the whole VirtualKITTI dataset as source training set.
The SemanticKITTI dataset  provides 3D point cloud labels for the Odometry dataset of KITTI  which features large-angle front camera and a 64-layer LiDAR. The annotation of the 28 classes has been carried out directly in 3D.
We use scenes as train set, as validation and as test set.
We select 6 shared classes between the 2 datasets by merging or ignoring them (see Tab. VI). The 6 final classes are vegetation_terrain, building, road, object, truck, car.
|class VirtualKITTI||mapped class||class SemanticKITTI||mapped class|
a.3 A2D2/SemanticKITTI (UDA+SSDA)
The A2D2 dataset  features 20 drives, which corresponds to 28,637 frames. The point cloud comes from three 16-layer front LiDARs (left, center, right) where the left and right front LiDARS are inclined. The semantic labeling was carried out in the 2D image for 38 classes and we compute the 3D labels by projection of the point cloud into the labeled image. We keep scene 20180807_145028 as test set and use the rest for training.
Please refer to Sec. A.2 for details on SemanticKITTI. For UDA, we use the same split as in VirtualKITTI/SemanticKITTI, i.e. scenes as train set, as validation and as test set. For SSDA, we use the scenes as labeled train set , as unlabeled train set , as validation and as test set.
We select 10 shared classes between the 2 datasets by merging or ignoring them (see Tab. VII). The 10 final classes are car, truck, bike, person, road, parking, sidewalk, building, nature, other-objects.
|A2D2 class||mapped class||SemanticKITTI class||mapped class|
|Small vehicles 1||bike||sidewalk||sidewalk|
|Small vehicles 2||bike||other-ground||ignore|
|Small vehicles 3||bike||building||building|
|Traffic signal 1||other-objects||fence||other-objects|
|Traffic signal 2||other-objects||other-structure||ignore|
|Traffic signal 3||other-objects||lane-marking||road|
|Traffic sign 1||other-objects||vegetation||nature|
|Traffic sign 2||other-objects||trunk||nature|
|Traffic sign 3||other-objects||terrain||nature|
|Utility vehicle 1||ignore||pole||other-objects|
|Utility vehicle 2||ignore||traffic-sign||other-objects|
|Obstacles / trash||other-objects|
|RD restricted area||road|
|Slow drive area||road|
|Painted driv. instr.||road|
|Traffic guide obj.||other-objects|
|RD normal street||road|
a.4 Waymo OD (SSDA)
The Waymo Open Dataset (v.1.2.0) provides 1150 scenes of 20s each. For simplicity and consistency with other UDA scenarios, we only use the top, but not the 4 side LiDARs, and only the front, but not the 4 side cameras. Similar to nuScenes (Sec. A.1), we obtain segmentation labels from 3D bounding boxes.
There is a main dataset which we use as source dataset and a partially labeled domain adaptation dataset of which we use the labeled part as labeled target set and the unlabeled part as unlabeled target set .
We ignore the cyclist class, because there are no cyclist labels available in the target data, i.e. we only keep the classes vehicle, pedestrian, sign, unknown.
Appendix B Additional qualitative Results (UDA)
We provide additional qualitative results for UDA for the scenarios nuScenes: Day/Night and A2D2/SemanticKITTI in Fig. 16, where we show the output of the 2D and 3D stream individually to illustrate their respective strengths and weaknesses, e.g. that 3D works much better at night.
-  (2021) Self-supervised learning for domain adaptation on point clouds. WACV. Cited by: §2.1.
-  (2020) Self-supervised learning by cross-modal audio-video clustering. NeuRIPS. Cited by: §2.3.
-  (2017) Fast generalized distillation for semi-supervised domain adaptation. In AAAI, Cited by: §2.2.
-  (2019) SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV, Cited by: §A.2, §1, Fig. 9, §4.1, TABLE II, TABLE III, TABLE IV.
-  (2020) nuScenes: a multimodal dataset for autonomous driving. CVPR. Cited by: §A.1, Fig. 9, §4.1, TABLE II, TABLE IV, §5.3, TABLE V.
-  (2016) Learning aligned cross-modal representations from weakly aligned data. In CVPR, Cited by: §2.3.
4D spatio temporal convnet: minkowski convolutional neural networks. In CVPR, Cited by: §2.4.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §1.
-  (1995) Support-vector networks. Machine learning. Cited by: §2.2.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §2.4.
-  (2013) Semi-supervised domain adaptation with instance constraints. In CVPR, Cited by: §2.2.
-  (2016) Virtual worlds as proxy for multi-object tracking analysis. In CVPR, Cited by: §A.2, Fig. 9, §4.1, TABLE II.
-  (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, Cited by: §A.2, §A.2.
-  (2019) A2D2: AEV autonomous driving dataset. Audi Electronics Venture GmbH. Note: http://www.a2d2.audi Cited by: §A.3, Fig. 9, §4.1, TABLE II, TABLE III, TABLE IV.
-  (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, Cited by: Fig. 2, §2.4, §2.4, §3.1, §4.2.
-  (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In ACCV, Cited by: §2.3.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2.
-  (2014) Distilling the knowledge in a neural network. In NIPS Workshops, Cited by: §2.3, §3.2.2.
-  (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §1, §2.1.
-  (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. Cited by: §1, §2.1.
-  (2020) XMUDA: cross-modal unsupervised domain adaptation for 3d semantic segmentation. In CVPR, Cited by: §1.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshops, Cited by: 2nd item, §2.1, §3.2.3, §3.2.
-  (2019) Spigan: privileged adversarial learning from simulation. In ICLR, Cited by: §2.1.
-  (2019) Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, Cited by: §1, §2.1, §3.2.3, §4.2, §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, TABLE IV.
-  (2019) Multi-task multi-sensor fusion for 3D object detection. In CVPR, Cited by: §2.3.
-  (2018) Deep continuous fusion for multi-sensor 3d object detection. In ECCV, Cited by: §2.3.
-  (1982) Geometric modeling using octree encoding. Computer graphics and image processing 19 (2), pp. 129–147. Cited by: §2.4.
-  (2019) Sensor fusion for joint 3d object detection and semantic segmentation. In CVPR Workshop, Cited by: §2.3.
-  (2018) Minimal-entropy correlation alignment for unsupervised deep domain adaptation. In ICLR, Cited by: §2.1, §3.2, §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, TABLE IV.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §2.4.
-  (2019) PointDAN: a multi-scale 3d domain adaption network for point cloud representation. In Advances in Neural Information Processing Systems, pp. 7192–7203. Cited by: §2.1.
Octnet: learning deep 3d representations at high resolutions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586. Cited by: §2.4.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: Fig. 2, §3.1, §4.2.
-  (2019) Semi-supervised domain adaptation via minimax entropy. In CVPR, Cited by: §2.2.
-  (2020) ESL: entropy-guided self-supervised learning for domain adaptation in semantic segmentation. In CVPR Workshop, Cited by: §2.1.
-  (2018) Cross and learn: cross-modal self-supervision. In German Conference on Pattern Recognition, Cited by: §2.3.
-  (2020) Scalability in perception for autonomous driving: waymo open dataset. In CVPR, Cited by: Fig. 9, §4.1, TABLE III, §5.3, TABLE V.
-  (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.4.
-  (2019) KPConv: flexible and deformable convolution for point clouds. In ICCV, Cited by: §2.4.
-  (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §1, §2.1.
-  (2019) Self-supervised model adaptation for multimodal semantic segmentation. IJCV. Cited by: §2.3.
-  (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, Cited by: §1, §2.1, §3.2, §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, TABLE IV.
-  (2019) DADA: depth-aware domain adaptation in semantic segmentation. In ICCV, Cited by: §2.1.
-  (2018) Deep parametric continuous convolutional neural networks. In CVPR, Cited by: §2.4.
-  (2020) Alleviating semantic-level shift: a semi-supervised domain adaptation method for semantic segmentation. In CVPR Workshops, Cited by: §2.2.
-  (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d LiDAR point cloud. In ICRA, Cited by: §A.1.
-  (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, Cited by: §1, §2.1.
-  (2015) Semi-supervised domain adaptation with subspace learning for visual recognition. In CVPR, Cited by: §2.2.
-  (2020) Complete & label: a domain adaptation approach to semantic segmentation of lidar point clouds. arXiv preprint arXiv:2007.08488. Cited by: §2.1.
-  (2018) Deep mutual learning. In CVPR, Cited by: §3.2.2.
-  (2019) Confidence regularized self-training. In ICCV, Cited by: §2.1, §3.2.3.