CALI: Coarse-to-Fine ALIgnments Based Unsupervised Domain Adaptation of Traversability Prediction for Deployable Autonomous Navigation

by   Zheng Chen, et al.
Indiana University

Traversability prediction is a fundamental perception capability for autonomous navigation. The diversity of data in different domains imposes significant gaps to the prediction performance of the perception model. In this work, we make efforts to reduce the gaps by proposing a novel coarse-to-fine unsupervised domain adaptation (UDA) model - CALI. Our aim is to transfer the perception model with high data efficiency, eliminate the prohibitively expensive data labeling, and improve the generalization capability during the adaptation from easy-to-obtain source domains to various challenging target domains. We prove that a combination of a coarse alignment and a fine alignment can be beneficial to each other and further design a first-coarse-then-fine alignment process. This proposed work bridges theoretical analyses and algorithm designs, leading to an efficient UDA model with easy and stable training. We show the advantages of our proposed model over multiple baselines in several challenging domain adaptation setups. To further validate the effectiveness of our model, we then combine our perception model with a visual planner to build a navigation system and show the high reliability of our model in complex natural environments where no labeled data is available.


page 1

page 7

page 8

page 9

page 10

page 11


Unsupervised Domain Adaptation by Uncertain Feature Alignment

Unsupervised domain adaptation (UDA) deals with the adaptation of models...

Domain Agnostic Real-Valued Specificity Prediction

Sentence specificity quantifies the level of detail in a sentence, chara...

Unsupervised domain adaptation via coarse-to-fine feature alignment method using contrastive learning

Previous feature alignment methods in Unsupervised domain adaptation(UDA...

A Generalized Framework for Domain Adaptation of PLDA in Speaker Recognition

This paper proposes a generalized framework for domain adaptation of Pro...

Gradual Domain Adaptation without Indexed Intermediate Domains

The effectiveness of unsupervised domain adaptation degrades when there ...

Domain Adaptation Using Adversarial Learning for Autonomous Navigation

Autonomous navigation has become an increasingly popular machine learnin...

Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition

First person action recognition is becoming an increasingly researched a...

I Introduction

We consider the deployment of autonomous robots in the real-world unstructured field environments, where the environments can be extremely complex involving random obstacles (e.g., big rocks, tree stumps, man-made objects), cross-domain terrains (e.g., combinations of gravel, sand, wet, uneven surfaces), as well as dense vegetation (tall and low grasses, shrubs, trees). Whenever a robot is deployed in such an environment, it needs to understand which area of the captured scene is navigable. A typical solution to this problem is the visual traversability prediction that can be achieved by learning the scene semantic segmentation.

Visual traversability prediction has been tackled by using deep neural networks where the models are typically trained offline with well-labeled datasets in limited scenarios. However, there is a gap between the data used to train the model and the real world. It is usually challenging for existing datasets to well approximate the true distributions of the unseen

target environments where the robot is deployed. Even incrementally collecting and adding new training data on the fly cannot guarantee the target environments to be well in-distribution included. In addition, annotating labels for dense predictions, e.g., semantic segmentation, is prohibitively expensive. Therefore, developing a generalization-aware deep model is crucial for robotic systems considering the demands of the practical deployment of deep perception models and the costs/limits of collecting new data in many robotic applications, e.g., autonomous driving, search and rescue, and environmental monitoring.

Fig. 1: Transferring models from the available domain to the target domain. The existing available data might be from either a simulator or collecting data in certain environments, at a certain time, and with certain sensors. In contrast, the target deployment might have significantly varying environments, time, and sensors.

To tackle this challenge, a broadly studied framework is transfer learning [24] which aims to transfer models between two domains – source domain and target domain – that have related but different data distributions. The prediction on target domain can be considered as a strong generalization since testing data (in target domain) might fall out of the independently and identically distributed (i.i.d.) assumption and follow a very different distribution than the training data (in source domain). The “transfer” process has significant meaning to our model development since we can view the available public datasets [29, 8, 35, 17] as the source domain and treat the data in the to-be-deployed environments as the target domain. In this case, we have access to images and corresponding labels in source domain and images in target domain, but no access to labels in target domain. Transferring models, in this set-up, is called Unsupervised Domain Adaptation (UDA) [36, 40].

Domain Alignment (DA) [10, 13, 12, 32, 33] and Class Alignment (CA) [31]

are two conventional ways to tackle the UDA problem. DA treats the deep features as a whole. It works well for image-level tasks such as image classification, but has issues with pixel-level tasks such as semantic segmentation, as the alignment of whole distributions ignores the class features and might misalign class distributions, even the whole features from the source domain and target domain are already well-aligned. CA is proposed to solve this issue for dense predictions with multiple classes.

It is natural and necessary to use CA to tackle the UDA of semantic segmentation as we need to consider aligning class features. However, CA can be problematic and might fail to outperform the DA for segmentation, and in a worse case, might have unacceptable negative transfer, which means the performance with adaptation is even degraded than that without adaptation. Intuitively, we need to consider more alignments in CA than in DA. Thus the searching space might be more complicated, and training might be more unstable and harder to converge to an expected minima, leading to larger prediction errors.

To solve the issue of CA, we investigate the relationship of the upper bounds of the prediction error on target domain between DA and CA and provide a theoretical analysis of the upper bounds of target prediction error in UDA setup, and bridge the theoretical analysis and algorithm design for UDA of traversability prediction.

In summary, our contributions include

  • We prove that with proper assumptions, the upper bound of CA is upper bounded by the upper bound of DA. This indicates that constraining the training of CA using DA can be beneficial. We then propose a novel concept of pseudo-trilateral game structure (PTGS) for integrating DA and CA.

  • We propose an efficient coarse-to-fine alignments based UDA model, named CALI, for traversability prediction. The new proposal includes a trilateral network structure, novel training losses, and an alternative training process. Our model design is well supported by theoretical analysis. It is also easy and stable to train and converge.

  • We show significant advantages of our proposed model compared to several baselines in multiple challenging public datasets and one self-collected dataset. We combine the proposed segmentation model and a visual planner to build a visual navigation system. The results show high safety and effectiveness of our model.

Ii Related Work

Semantic Segmentation: Semantic segmentation aims to predict a unique human-defined semantic class for each pixel in the given images. With the prosperity of deep neural networks, the performance of semantic segmentation has been boosted significantly, especially by the advent of FCN [20] that first proposes to use deep convolutional neural nets to predict segmentation. Following works try to improve the FCN performance by multiple proposals, e.g., using different sizes of kernels or dilation rates to aggregate multi-scale features [6, 7, 38]; building image pyramids to create multi-resolution inputs [41]; applying probabilistic graph to smooth the prediction [19]; compensating features in deeper level by an encoder-decoder structure [30], and employing attention mechanism to capture the long-range dependencies among pixels in a more compact and efficient way [28]. We can also see how excellent the current semantic segmentation SOTA performance is from very recently released work [42, 37]

. However, all of those methods belong to fully-supervised learning and the performance might catastrophically be degraded when a domain shift exists between the training data and the data when deploying. Considering the possible domain shift and developing

adaptation-aware models is extremely practical and urgent.

Unsupervised Domain Adaptation: The main approaches to tackle UDA include adversarial training (a.k.a., distribution alignment[10, 13, 12, 32, 31, 33, 21, 34] and self-training [43, 39, 23, 16]. Although self-training is becoming another dominant method for segmentation UDA in terms of the empirical results, it still lacks a sound theoretical foundation. In this paper, we only focus on the alignment-based methods that not only keep close to the UDA state-of-the-art (SOTA) performance but also are well supported by sound theoretical analyses [2, 3, 1].

The alignment-based methods adapt models via aligning the distributions from the source domain and target domain in an adversarial training process, i.e., making the deep features of source images and target images indistinguishable to a discriminator net. Typical approaches to UDA include Domain Alignment (DA) [10, 13, 12, 32, 33]

, which aligns the two domains using global features (aligning the feature tensor from source or target

as a whole) and Class Alignment (CA) [31, 21, 34], which only considers aligning features of each class from source and target, no matter whether the domain distributions are aligned or not. In [31], the authors are inspired by the theoretical analysis of [1] and propose a discrepancy-based model for aligning class features. There is a clear relation between the theory guidance [1] and the design of network, loss, and training methods. There are some recent works [21, 34] similar to the proposed work in spirit and show improved results compared to [31], but it is still unclear to relate the proposed algorithms with theory and to understand why the structure/loss/training is designed as the presented way.

Iii Background and Preliminary Materials

We consider segmentation tasks where the input space is , representing the input RGB images, and the label space is , representing the ground-truth -class segmentation images, where the label for a single pixel at

is denoted by a one-hot vector

whose elements are by-default 0-valued except at location which is labeled as 1. Domain adaptation has two domain distributions over , named source domain and target domain . In the setting of UDA for segmentation, we have access to i.i.d. samples with labels from and i.i.d. samples without labels from .

In the UDA problem, we need to reduce the prediction error on the target domain. A hypothesis is a function . We denote the space of as

. With the loss function

, the expected error of on is defined as


Similarly, we can define the expected error of on as


Two important upper bounds related to the source and target error are given in [1]. Basically,

Theorem 1 For a hypothesis ,


where is the divergence for two distributions, and the constant term does not depend on any . However, it is claimed in [1] that the bound with

divergence cannot be accurately estimated from finite samples, and using

divergence can unnecessarily inflate the bound. Another divergence measure is thus introduced to replace the divergence with a new bound derived.

Definition 1 Given two domain distributions and over , and a hypothesis space that has finite VC dimension, the -divergence between and is defined as



represents the probability of

belonging to . Same to .

The -divergence resolves the issues in the divergence. If we replace in Eq. (3) with , then a new upper bound for , named as , can be written as


An approach to compute the empirical -divergence is also proposed in [1].

Lemma 1 For a symmetric hypothesis class (one where for every , the inverse hypothesis is also in ) and two sample sets and .


where is an indicator function which is 1 if is true, and otherwise.

The second upper bound is based on a new hypothesis called symmetric difference hypothesis.

Definition 2 For a hypothesis space , the symmetric difference hypothesis space is the set of hypotheses


where denotes an XOR operation. Then we can define the -distance as


Similar to Eq. (5), if we replace with , the second upper bound for , named as , can be expressed as


where is the same term as in Eq. (3).

A standard way to achieve the alignment for deep models is to use the adversarial training method, which is also used in Generative Adversarial Networks (GANs). Therefore we explain the key concepts of adversarial training using the example of GANs.

GAN is proposed to learn the distribution of a set of given data in an adversarial manner. The architecture consists of two networks - a generator , and a discriminator . The is responsible for generating fake data (with distribution ) from random noises to fool the discriminator that is instead to accurately distinguish between the fake data and the given data. Optimization of a GAN involves a mini-maximization over a joint loss for and .


where we use as the real label and as the fake label. Training with Eq. (10) is a bilateral game where the distribution is aligned with the distribution .

The two bounds (Eq. (5) and Eq. (9)) for the target domain error are separately given in [1]. It has been independently demonstrated that domain alignment corresponds to optimizing over [10], where optimization over the upper bound (Eq. (5) with the divergence Eq. (6)) is proved as equivalent to Eq. (10) with a supervised learning in the source domain, and that class alignment corresponds to optimizing over [31], where the

is approximated by the discrepancy between two different classifiers.

Training DA is straightforward since we can easily define binary labels for each domain, e.g., we can use 1 as the source domain label and 0 as the target domain label. Adversarial training over the domain labels can achieve domain alignment. For CA, however, it is difficult to implement as we do not have target labels, hence the target class features are completely unknown to us, thus leading naively using adversarial training over each class impossible. The existing way well supported by theory to perform CA [31] is to indirectly align class features by devising two different classifier hypotheses. The two classifiers have to be well trained on the source domain and are able to classify different classes in the source domain with different decision boundaries. Then considering the shift between source and target domain, the trained two classifiers might have disagreements on target domain classes. Note since the two classifiers are already well trained on the source domain, the agreements of the two classifiers represent those features in the target domain that are close to the source domain, while in contrast, the features where disagreements happen indicate that there is a large shift between source and target. We use the disagreements to approximate the distance between source and target. If we are able to minimize the disagreements of the two classifiers, then features of each class between source and target will be enforced to be well aligned.

Fig. 2: An ideal iterative training process by integrating DA and CA.

Iv Methodology

In this work we further investigate the relation between the and and prove that turns out to be an upper bound of , meaning DA can be a necessary constraint to CA. This is also consistent to our intuition: DA aligns features globally in a coarse way while CA aligns features locally in a finer way. Constraining CA with DA is actually a coarse-to-fine process, which makes the alignment process efficient and stable. By carefully studying the internal structure of existing DA and CA work, we propose a novel concept, pseudo-trilateral game structure, for efficiently integrating DA and CA. We follow our theoretical analysis and proposed PTGS to guide the development of CALI, including designs of model structure, losses and training process.

Fig. 3: Pseudo-trilateral game structure (PTGS). Three players are in the game, a feature extractor , a domain discriminator and a family of classifiers . The game between and is the CA while the game between and is the DA. The DA and CA are connected by sharing the same feature extractor . Both and are trying to adjust the such that the features between source and target generated from could be well aligned globally and locally.

Notations used in this paper is explained as follows. we denote the segmentation model as which consists of a feature extractor parameterized by and a classifier parameterized by , and is a sample from or . If multiple classifiers are used, we will denote the classifier as . We denote the discriminator as parameterized by .

Iv-a Bounds Relation

We start by examining the relationship between the DA and the CA from the perspective of target error bound. We propose to use this relation to improve the segmentation performance of class alignment, which is desired for dense prediction tasks. We provide the following theorem:

Theorem 2 If we assume there is a hypothesis space for segmentation model and a hypothesis space for domain classifiers , and , then we have


The proof of this theorem is provided in Appendix. VI-A.

Essentially, we limit the hypothesis space and in Eq. (11) into the space of deep neural networks. Directly optimizing over might be hard to converge since is a tighter upper bound for the prediction error on target domain. The bounds relation in Eq. (11) shows that the is an upper bound of . This provides us a clue to improve the training process of class alignment, i.e., the domain alignment can be a global constraint and narrow down the searching space for the class alignment. This also implies that integrating the domain alignment and class alignment might boost the training efficiency as well as the prediction performance of UDA. An ideal training process is illustrated in Fig. 2 where the searching space of  (CA) is constantly bounded by that of  (DA), ensuring the whole training process converge stably. This inspires us to design a new model, and we are explaining next in details about our model structures, losses and training process.

Iv-B CALI Structure

The existing DA or CA works usually involve a bilateral game. In CA, the game is between a feature extractor and a family of classifiers. The two players are optimized over the discrepancy of the two classifiers (note here the two players are the two classifiers vs. the feature extractor) in an opposite manner. In DA, the game happens between a segmentation net and a domain discriminator. The two players are optimized over the domain discrimination in an opposite way. It has been empirically showed [33, 32] that DA performs well if the domain alignment happens to the prediction probability (after Softmax()). However, according to the identified relation in Eq. (11), the two upper bounds and need to use the same feature, hence we connect the domain alignment and class alignment using a shared feature extractor and propose a novel concept called PTGS (see Fig. 3) to illustrate an interesting structure to integrate DA and CA. Both and have game with , but there is no game between and , hence we call this game as pseudo-trilateral game. Furthermore, as defined in Eq. (8), and are two different hypotheses, thus we have to ensure the classifiers in are different during the training.

Following the concept of PTGS, we design the structure of our CALI model as shown in Fig. 4. Four networks are involved, a shared feature extractor , a domain discriminator and two classifiers and . represents the shared features; and are the probability/class predictions for and , respectively; represent the source domain label (1) and target domain label (0); and represents the

distance measure between two probability distributions. The one-way solid arrows indicate the forward propagation of the data flow while the two-way dashed arrows indicate losses are generated. The red arrows represent the source-related data while the blue ones represent the target-related data. The orange two-way dashed line indicates the structural regularization loss between the

and .

Fig. 4: CALI network structure. See Section. IV-B for more details.

Iv-C CALI Losses

We denote raw images from source or target domain as , and the label from source domain as . We use semantic labels in source domain to train all of the nets, but the domain discriminator, in a supervised way, see the solid red one-way arrow in Fig. 4. We need to minimize the supervised segmentation loss since Eq. (11) and other related Eqs suggest that the source prediction error is also part of the upper bound of target error. The supervised segmentation loss for training CALI is defined as


where represents the element-wise multiplication between two tensors.

To perform domain alignment, we need to define the joint loss function for and


where no segmentation labels but domain labels are used, and we use the standard cross-entropy to compute the domain classification loss for both source () and target data (). We have




Note we include in Eq. (14) since both the source data and target data are passed through the feature extractor. This is different than standard GAN, where the real data is directly fed to , without passing through the generator.

To perform class alignment, we need to define the joint loss function for , , and


where is the distance measure between two distributions from the two classifiers. In this paper, we use the same distance in [31] as the measure, thus , where and are two distributions and is the number of label classes.

To prevent and

from converging to the same network throughout the training, we use the cosine similarity as a weight regularization to maximize the difference of the weights from

and , i.e.,


where and are the weight vectors of and , respectively.

Iv-D CALI Training

We integrate the training processes of domain alignment and class alignment to systematically train our CALI model. To be consistent with Eq. (11), we adopt an iterative mechanism that alternates between domain alignment and class alignment. Our training process is pseudo-coded in Algorithm 1.

1 Input: Source dataset ; Target dataset ; Initial model and ; Maximum iterations ; Iteration interval .
2 Output: Updated model parameters and
3 Initialization: is_domain=True; is_class=False;
4 for m 1 to M do
5        if  and  then
6               is_domain = not is_domain;
7               is_class = not is_class;
       // Eq. (12)
        // Eq. (17)
11        if is_domain then
               // Eq. (13)
14       if is_class then
               // Eq. 16
Return , , and ;
Algorithm 1 CALI Training Process

Note the adversarial training order of in Algorithm 1 is , instead of the , meaning in each training iteration we first train the feature extractor and then the discriminator. The reason for this order is because we empirically find that the feature from is relatively easy for to discriminate, hence if we train first, then the might become an accurate discriminator in the early stage of training and there will be no adversarial signals for training , thus making the whole training fail. The same order applies to training of the pair of and with .

Fig. 5: Qualitative results on adaptation GTA5Cityscapes. Results of our proposed model is listed in the last second column. GT represents the ground-truth labels.

Iv-E Visual Planner

We design a visual receding horizon planner to achieve feasible visual navigation by combining the learned image segmentation. Specifically, first we compute a library of motion primitives [15, 14] where each is a single primitive. We use to denote a robot pose. Then we project the motion primitives to the image plane and compute the navigation cost function for each primitive based on the evaluation of collision risk in image space and target progress. Finally, we select the primitive with minimal cost to execute. The trajectory selection problem can be defined as:


where and are the collision cost and target cost of one primitive , and , are corresponding weights, respectively.

To efficiently evaluate the collision risk in the learned segmentation images, we first classify the classes in terms of their navigability, e.g., in off-road environments, grass and mulch are classified as navigable while tree and bush are classified as non-navigable. In this case, we are able to extract the boundary between the navigable space and the non-navigable space. We treat the boundary part close to the bottom line of the image as the obstacle boundary. We further use the obstacle boundary to generate a Scaled Euclidean Distance Field (SEDF), where the values fall in , representing the risk level at the pixel position. Examples of different SEDF with different scale factors can be seen in Fig. 6.

Assume is the pose in one primitive and its image coordinates are , then the collision risk for is


where represents the SEDF image.

Fig. 6: Different SEDFs with varying scale factors of (a) , (b)  and (c) . Values range from 0 to 1 by the color from blue to yellow.
Fig. 7: Qualitative results on adaptation RUGDRELLIS. Results of our proposed model is listed in the last second column. GT represents the ground-truth labels.

To evaluate target progress during the navigation progress, we propose to use the distance on as the metric. We define three types of frames: world frame , primitive pose frame , and goal frame . The transformation of in is denoted as while that of in is . A typical approach to represent the distance is to split a pose into a position and an orientation and define two distances on and . Then the two distances can be fused in a weighted manner with two strictly positive scaling factors and and with an exponent parameter  [5]:


We use the Euclidean distance as , the Riemannian distance over as and set as . Then the distance (target cost) between two transformation matrices can be defined [25] as:


V Experiments

Fig. 8: Qualitative results on adaptation RUGDMESH. Results of our proposed model is listed in the last column.

V-a Datasets

We evaluate CALI together with several baseline methods on a few challenging domain adaptation scenarios, where several public datasets, e.g., GTA5 [29], Cityscapes [8], RUGD [35], RELLIS [17], as well as a small self-collected dataset, named MESH (see the first column of Fig. 8), are investigated. The GTA5 dataset contains synthesized high-resolution images in the urban environments from a video game and pixel-wise semantic annotations of 33 classes. The Cityscapes dataset consists of finely annotated images whose label is given for 19 commonly seen categories in urban environments, e.g., road, sidewalk, tree, person, car, etc. The RUGD and RELLIS are two recently released datasets that aim to evaluate segmentation performance in off-road environments. The RUGD and the RELLIS contain 24 and 20 classes with and images, respectively. RUGD and RELLIS cover various scenes like trails, creeks, parks, villages, and puddle terrains. Our dataset, MESH, includes features like grass, trees (particularly challenging in winter due to foliage loss and monochromatic colors), mulch, etc. It helps us to further validate the performance of our proposed model for traversability prediction in challenging scenes, particularly the off-road environments.

Road 38.86 52.80 78.56 75.36
Sidewalk 17.47 18.95 2.79 27.12
Building 63.60 61.73 43.51 67.00
Sky 58.08 54.35 46.59 60.49
Vegetation 67.21 64.69 41.48 67.50
Terrain 7.63 7.04 8.37 9.56
Person 16.89 15.45 13.48 15.03
Car 30.32 43.41 31.64 52.25
Pole 11.61 12.38 9.68 11.91
mIoU* 34.63 36.76 30.68 42.91
TABLE I: Quantitative comparison of different methods in UDA of GTA5Cityscapes. mIoU* represents the average mIoU over all of classes.

V-B Implementation Details

To be consistent with our theoretical analysis, the implementation of CALI only adopts the necessary indications by Eq. (11). First, Eq. (11) requires that the input of the two upper bounds (one for DA and the other one for CA) should be the same. Second, nothing else but only domain classification and hypotheses discrepancy are involved in Eq. (11) and other related analyses (Eq. (3) - Eq. (9)). Accordingly, we strictly follow the guidance of our theoretical analyses. First, CALI performs DA in the intermediate-feature level ( in Fig. 4), instead of the output-feature level used in [33]. Second, we exclude the multiple additional tricks, e.g., entropy-based and multi-level features based alignment, and class-ratio priors in [33] and multi-steps training for feature extractor in [31]. We also implement baseline methods without those techniques for a fair comparison. To avoid possible degraded performance bought by a class imbalance in the used datasets, we regroup those rare classes into classes with a higher pixel ratio. For example, we treat the building, wall, and fence as the same class; the person and rider as the same class in the adaptation of GTA5Cityscapes. In the adaptation of RUGDRELLIS, we treat the tree, bush, and log as the same class, and the rock and rockbed as the same class. Details about remapping can be seen in Fig. 14 and Fig. 15 in Appendix. VI-B.

Dirt 0.00 0.53 3.23 0.01
Grass 64.78 61.63 65.35 67.08
Tree 40.79 45.93 41.51 55.80
Sky 45.07 67.00 2.31 72.99
Building 10.90 12.29 10.91 10.28
mIoU* 32.31 37.48 24.66 41.23
TABLE II: Quantitative comparison of different methods in UDA of RUGDRELLIS. mIoU* is the average mIoU over all of classes.
Fig. 9: Target discrepancy changes during training process of (a) GTA5Cityscapes; (b) RUGDRELLIS; and (c) RUGDMESH.
Fig. 10: Using minmax can cause the collapse of training.
Fig. 11: An example of collapsed trained model using minmax.

We use the PyTorch 


framework for implementation. Training images from source and target domains are cropped to be half of their original image dimensions. The batch size is set to 1 and the weights of all batch normalization layers are fixed. We use the ResNet-101 


pretrained on ImageNet 

[9] as the model for extracting features. We use the ASPP module in DeepLab-V2 [6] as the structure for and . We use the similar structure in [27] as the discriminator , which consists of 5 convolution layers with kernel and with channel size

and stride of 2. Each convolution layer is followed by a Leaky-ReLU 


parameterized by 0.2, but only the last convolution layer is follwed by a Sigmoid function. During the training, we use SGD 

[4] as the optimizer for and with a momentum of 0.9, and use Adam [18] to optimize with . We set all SGD optimizers a weight decay of . The initial learning rates of all SGDs for performing domain alignment are set to and the one of Adam is set as . For class alignment, the initial learning rate of SGDs is set to . All of the learning rates are decayed by a poly learning rate policy, where the initial learning rate is multiplied by with . All experiments are conducted on a single Nvidia Geforce RTX 2080 Super GPU.

V-C Comparative Studies

We present comparative experimental results of our proposed model, CALI, compared to different baseline methods – Source-Only (SO) method, Domain-Alignment (DA) [33] method, and Class-Alignment [31] method. Specifically, we first perform evaluations on a sim2real UDA in city-like environments, where the source domain is represented by GTA5 while the target domain is the Cityscapes. Then we consider a transfer of real2real in forest environments, where the source domain and target domain are set as RUGD and RELLIS, respectively. All models are trained with full access to the images and labels in the source domain and with only access to the images in the target domain. The labels in target datasets are only used for evaluation purposes. Finally, we further validate our model performance for adapting from RUGD to our self-collected dataset MESH.

To ensure a fair comparison, all the methods use the same feature extractor ; both DA and CALI have the same domain discriminator ; both CA and CALI have the same two classifiers and

. We also use the same optimizers and optimization-related hyperparameters if any is used for models under comparison.

We use the mean of Intersection over Union (mIoU) as the metric to evaluate each class and overall segmentation performance on testing images. IoU is computed as , where and are true positive, true negative, false positive and false negative, respectively.

V-C1 Gta5Cityscapes

Quantitative comparison results of GTA5Cityscapes are shown in Table. I, where segmentations are evaluated on 9 classes (as regrouped in Fig. 14). Our proposed method has significant advantages over multiple baseline methods for most categories and overall performance (mIoU*).

Fig. 12: Navigation behaviors in MESH

environment. The left-most column: top-down view of the environment; Purple triangle: the starting point; Blue star: the target point; We also show the segmentation (top row) and planning results (bottom row) at four different moments during the navigation, as shown from the second column to the last one.

Fig. 13: Navigation behaviors in MESH environment. Same legends with Fig. 12.

In our testing case, SO achieves the highest score for the class person even without any domain adaptation. One possible reason for this is the deep features of the source person and the target person from the model solely trained on source domain, are already well-aligned. If we try to interfere this well-aligned relation using unnecessary additional efforts, the target prediction error might be increased (see the mIoU values of the person from the other three methods). We call this phenomenon as negative transfer, which also happens to other classes if we compare SO and DA/CA, e.g., sidewalk, building, sky, vegetation, and so on. In contrast, CALI maintains an improved performance compared to either SO or DA/CA. We validate our analytical method for DA and CA (Section. IV-A) by a comparison between CALI and baselines. This indicates either single DA or CA is problematic for semantic segmentation, particularly when we strictly follow what the theory supports and do not include any other training tricks (that might increase the training complexity and make the training unstable). This implies that integration of DA and CA is beneficial to each other with significant improvements, and more importantly, CALI is well theoretically supported, and the training process is easy and stable.

Fig. 5 shows the examples of qualitative comparison for UDA of GTA5Cityscapes. We find that CALI prediction is less noisy compared to the baselines methods as shown in the second and third columns (sidewalk or car on-road), and shows better completeness (part of the car is missing, see the fourth column).

V-C2 RugdRellis

We show quantitative results of RUGDRELLIS in Table. II, where only 5 classes222This is because other classes (in Fig. 15) frequently appearing in source domain (RUGD) are extremely rare in target domain (RELLIS), hence no prediction for those classes occurs especially considering the domain shift. are evaluated. It shows the same trend as Table. I. Both tables show that CA has the negative transfer (compared with SO) issue for either sim2real or real2real UDA. However, if we constrain the training of CA with DA, as in our proposed model, then the performance will be remarkably improved. Some qualitative results are shown in Fig. 7.

V-C3 RugdMesh

Our MESH dataset contains only unlabeled images that restrict us to show only a qualitative comparison for the UDA of RUGDMESH, as shown in Fig. 8. We have collected data in winter forest environments, which are significantly different than the images in the source domain (RUGD) - collected in a different season, e.g., summer or spring. These cross-season scenarios make the prediction more challenging. However, it is more practical to evaluate the UDA performance of cross-season scenarios, as we might have to deploy our robot at any time, even with extreme weather conditions, but our available datasets might be far from covering every season and every weather condition. From Fig. 8, we can still see the obvious advantages of our proposed CALI model over other baselines.

V-D Discussions

In this section, we aim to discuss our model behaviors in more details. Specifically, first we will explain the advantages of CALI over CA from the perspective of training process. Second, we will show the vital influence of mistakenly using wrong order of adversarial training.

The most important part in CA is the discrepancy between the two classifiers, which is the only training force for the functionality of CA. It has been empirically studied in [31] that the target prediction accuracy will increase as the target discrepancy is decreasing, hence the discrepancy is also an indicator showing if the training is on the right track. We compare the target discrepancy changes of CALI and our baseline CA in Fig. 9, where the curves for the three UDA scenarios are presented from (a) to (c) and we only show the data before iteration 30k. It can be seen that before around iteration 2k, the target discrepancy of both CALI and CA are drastically decreasing, but thereafter, the discrepancy of CA starts to increase. On the other hand, if we impose a DA constraint over the same CA (iteratively), leading to our proposed CALI, then the target discrepancy will be decreasing as expected. This validates that integrating DA and CA will make the training process of CA more stable, thus improving the target prediction accuracy.

As mentioned in Algorithm 1, we have to use adversarial training order of , instead of . The reason for this is related to our designed net structure. Following the guidance of Eq. (11), we use the same input to the two classifiers and the domain discriminator, hence the discriminator has to receive the intermediate-level feature as the input. If we use the order of in CALI, then the outputs of the discriminator will be like Fig. 10, where the domain discriminator of CALI will quickly converge to the optimal state and it can accurately discriminate if the feature is from source or target domain. In this case, the adversarial loss for updating the feature extractor will be near 0, hence the whole training fails, which is validated by changes of the target discrepancy curve, as shown in Fig. 10, where the discrepancy value is decreasing in a small amount in the first few iterations and then quickly increase to a high level that shows the training is divergent and the model is collapsed. This is also verified by the prediction results at (and after) around iteration 1k, as shown in Fig. 11, where the first row is the source images while the second row is the target images.

V-E Navigation Missions

To further show the effectiveness of our proposed model for real deployments, we build a navigation system by combining the proposed CALI (trained with RUGDMESH set-up) segmentation model with our visual planner. We test behaviors of our navigation system in two different forest environments (named MESH in Fig. 12 and MESH in Fig. 13), where our navigation system shows high reliability. In navigation tasks, the image resolution is , and the inference time for pure segmentation inference is around frame per second (FPS). However, since a complete perception system requires several post-processing steps, such as navigability definition, noise filtering, Scaled Euclidean Distance Field computation, motion primitive evaluation and so on, the response time for the whole perception pipeline (in python) is around FPS without any engineering optimization. The inference of segmentation for navigation is performed on an Nvidia Tesla T4 GPU. We set the linear velocity as and control the angular velocity to track the selected motion primitive. The path length is in Fig. 12 and in Fig. 13. Although the motion speed is slow in navigation tasks, as a proof of concept and with a very basic motion planner, the system behavior is as expected, and we have validated that the proposed CALI model is able to well accomplish the navigation tasks in unstructured environments.

Vi Conclusion

We present CALI, a novel unsupervised domain adaptation model specifically designed for semantic segmentation, which requires fine-grained alignments in the level of class features. We carefully investigate the relationship between a coarse alignment and a fine alignment in theory. The theoretical analysis guides the design of the model structure, losses, and training process. We have validated that the coarse alignment can serve as a constraint to the fine alignment and integrating the two alignments can boost the UDA performance for segmentation. The resultant model shows significant advantages over baselines in various challenging UDA scenarios, e.g., sim2real and real2real. We also demonstrate the proposed segmentation model can be well integrated with our proposed visual planner to enable highly efficient navigation in off-road environments.


  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine learning 79 (1), pp. 151–175. Cited by: §II, §II, §III, §III, §III, §III, §VI-A, §VI-A.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, et al. (2007) Analysis of representations for domain adaptation. Advances in neural information processing systems 19, pp. 137. Cited by: §II.
  • [3] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman (2008) Learning bounds for domain adaptation. Cited by: §II.
  • [4] L. Bottou (2010)

    Large-scale machine learning with stochastic gradient descent

    In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §V-B.
  • [5] R. Brégier, F. Devernay, L. Leyrit, and J. L. Crowley (2018) Defining the pose of any 3d rigid object and an associated distance.

    International Journal of Computer Vision

    126 (6), pp. 571–596.
    Cited by: §IV-E.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §II, §V-B.
  • [7] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §II.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §I, §V-A.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §V-B.
  • [10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The journal of machine learning research 17 (1), pp. 2096–2030. Cited by: §I, §II, §II, §III.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §V-B.
  • [12] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989–1998. Cited by: §I, §II, §II.
  • [13] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §I, §II, §II.
  • [14] T. M. Howard, C. J. Green, A. Kelly, and D. Ferguson (2008) State space sampling of feasible motions for high-performance mobile robot navigation in complex environments. Journal of Field Robotics 25 (6-7), pp. 325–345. Cited by: §IV-E.
  • [15] T. M. Howard and A. Kelly (2007) Optimal rough terrain trajectory generation for wheeled mobile robots. The International Journal of Robotics Research 26 (2), pp. 141–166. Cited by: §IV-E.
  • [16] L. Hoyer, D. Dai, and L. Van Gool (2021) DAFormer: improving network architectures and training strategies for domain-adaptive semantic segmentation. arXiv preprint arXiv:2111.14887. Cited by: §II.
  • [17] P. Jiang, P. Osteen, M. Wigness, and S. Saripalli (2020) RELLIS-3d dataset: data, benchmarks and analysis. External Links: 2011.12954 Cited by: §I, §V-A.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
  • [19] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang (2017) Deep learning markov random field for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence 40 (8), pp. 1814–1828. Cited by: §II.
  • [20] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §II.
  • [21] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang (2019) Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2507–2516. Cited by: §II, §II.
  • [22] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al. (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §V-B.
  • [23] K. Mei, C. Zhu, J. Zou, and S. Zhang (2020) Instance adaptive self-training for unsupervised domain adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pp. 415–430. Cited by: §II.
  • [24] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §I.
  • [25] F. C. Park (1995) Distance metrics on the rigid-body motions with applications to mechanism design. Cited by: §IV-E.
  • [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: §V-B.
  • [27] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §V-B.
  • [28] R. Ranftl, A. Bochkovskiy, and V. Koltun (2021) Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188. Cited by: §II.
  • [29] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In European conference on computer vision, pp. 102–118. Cited by: §I, §V-A.
  • [30] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §II.
  • [31] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3723–3732. Cited by: §I, §II, §II, §III, §III, §IV-C, §V-B, §V-C, §V-D.
  • [32] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7472–7481. Cited by: §I, §II, §II, §IV-B.
  • [33] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526. Cited by: §I, §II, §II, §IV-B, §V-B, §V-C.
  • [34] H. Wang, T. Shen, W. Zhang, L. Duan, and T. Mei (2020) Classes matter: a fine-grained adversarial approach to cross-domain semantic segmentation. In European Conference on Computer Vision, pp. 642–659. Cited by: §II, §II.
  • [35] M. Wigness, S. Eum, J. G. Rogers, D. Han, and H. Kwon (2019) A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §I, §V-A.
  • [36] G. Wilson and D. J. Cook (2020) A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST) 11 (5), pp. 1–46. Cited by: §I.
  • [37] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203. Cited by: §II.
  • [38] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §II.
  • [39] Y. Zhang, P. David, and B. Gong (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE international conference on computer vision, pp. 2020–2030. Cited by: §II.
  • [40] Y. Zhang (2021) A survey of unsupervised domain adaptation for visual recognition. arXiv preprint arXiv:2112.06745. Cited by: §I.
  • [41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §II.
  • [42] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890. Cited by: §II.
  • [43] Y. Zou, Z. Yu, B. Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pp. 289–305. Cited by: §II.


Vi-a Proof of Theorem 2

For a hypothesis ,


where and is the ideal joint hypothesis (see the Definition 2 in Section 4.2 of [1]).

We have the , and the line because of the Lemma 3 [1]; the line because of the Theorem 2 [1]; the last second line because of the Lemma 2 [1]. We have the line because

Vi-B Remapping of Label Space

We regroup the original label classes according to the semantic similarities among classes. In GTA5 and Cityscapes, we cluster the building, wall and fence as the same category; traffic light, traffic sign and pole as the same group; car, train. bicycle, motorcycle, bus and truck as the same class; and treat the person and rider as the same one. See Fig. 14. Similarly, we also have regroupings for classes in RUGD and RELLIS, as can be seen in Fig. 15.

Fig. 14: Lable remapping for GTA5Cityscapes. Name of each new group is marked as bold.
Fig. 15: Lable remapping for RUGDRELLIS and RUGDMESH. Name of each new group is marked as bold.