I Introduction
We consider the deployment of autonomous robots in the realworld unstructured field environments, where the environments can be extremely complex involving random obstacles (e.g., big rocks, tree stumps, manmade objects), crossdomain terrains (e.g., combinations of gravel, sand, wet, uneven surfaces), as well as dense vegetation (tall and low grasses, shrubs, trees). Whenever a robot is deployed in such an environment, it needs to understand which area of the captured scene is navigable. A typical solution to this problem is the visual traversability prediction that can be achieved by learning the scene semantic segmentation.
Visual traversability prediction has been tackled by using deep neural networks where the models are typically trained offline with welllabeled datasets in limited scenarios. However, there is a gap between the data used to train the model and the real world. It is usually challenging for existing datasets to well approximate the true distributions of the unseen
target environments where the robot is deployed. Even incrementally collecting and adding new training data on the fly cannot guarantee the target environments to be well indistribution included. In addition, annotating labels for dense predictions, e.g., semantic segmentation, is prohibitively expensive. Therefore, developing a generalizationaware deep model is crucial for robotic systems considering the demands of the practical deployment of deep perception models and the costs/limits of collecting new data in many robotic applications, e.g., autonomous driving, search and rescue, and environmental monitoring.To tackle this challenge, a broadly studied framework is transfer learning [24] which aims to transfer models between two domains – source domain and target domain – that have related but different data distributions. The prediction on target domain can be considered as a strong generalization since testing data (in target domain) might fall out of the independently and identically distributed (i.i.d.) assumption and follow a very different distribution than the training data (in source domain). The “transfer” process has significant meaning to our model development since we can view the available public datasets [29, 8, 35, 17] as the source domain and treat the data in the tobedeployed environments as the target domain. In this case, we have access to images and corresponding labels in source domain and images in target domain, but no access to labels in target domain. Transferring models, in this setup, is called Unsupervised Domain Adaptation (UDA) [36, 40].
Domain Alignment (DA) [10, 13, 12, 32, 33] and Class Alignment (CA) [31]
are two conventional ways to tackle the UDA problem. DA treats the deep features as a whole. It works well for imagelevel tasks such as image classification, but has issues with pixellevel tasks such as semantic segmentation, as the alignment of whole distributions ignores the class features and might misalign class distributions, even the whole features from the source domain and target domain are already wellaligned. CA is proposed to solve this issue for dense predictions with multiple classes.
It is natural and necessary to use CA to tackle the UDA of semantic segmentation as we need to consider aligning class features. However, CA can be problematic and might fail to outperform the DA for segmentation, and in a worse case, might have unacceptable negative transfer, which means the performance with adaptation is even degraded than that without adaptation. Intuitively, we need to consider more alignments in CA than in DA. Thus the searching space might be more complicated, and training might be more unstable and harder to converge to an expected minima, leading to larger prediction errors.
To solve the issue of CA, we investigate the relationship of the upper bounds of the prediction error on target domain between DA and CA and provide a theoretical analysis of the upper bounds of target prediction error in UDA setup, and bridge the theoretical analysis and algorithm design for UDA of traversability prediction.
In summary, our contributions include

We prove that with proper assumptions, the upper bound of CA is upper bounded by the upper bound of DA. This indicates that constraining the training of CA using DA can be beneficial. We then propose a novel concept of pseudotrilateral game structure (PTGS) for integrating DA and CA.

We propose an efficient coarsetofine alignments based UDA model, named CALI, for traversability prediction. The new proposal includes a trilateral network structure, novel training losses, and an alternative training process. Our model design is well supported by theoretical analysis. It is also easy and stable to train and converge.

We show significant advantages of our proposed model compared to several baselines in multiple challenging public datasets and one selfcollected dataset. We combine the proposed segmentation model and a visual planner to build a visual navigation system. The results show high safety and effectiveness of our model.
Ii Related Work
Semantic Segmentation: Semantic segmentation aims to predict a unique humandefined semantic class for each pixel in the given images. With the prosperity of deep neural networks, the performance of semantic segmentation has been boosted significantly, especially by the advent of FCN [20] that first proposes to use deep convolutional neural nets to predict segmentation. Following works try to improve the FCN performance by multiple proposals, e.g., using different sizes of kernels or dilation rates to aggregate multiscale features [6, 7, 38]; building image pyramids to create multiresolution inputs [41]; applying probabilistic graph to smooth the prediction [19]; compensating features in deeper level by an encoderdecoder structure [30], and employing attention mechanism to capture the longrange dependencies among pixels in a more compact and efficient way [28]. We can also see how excellent the current semantic segmentation SOTA performance is from very recently released work [42, 37]
. However, all of those methods belong to fullysupervised learning and the performance might catastrophically be degraded when a domain shift exists between the training data and the data when deploying. Considering the possible domain shift and developing
adaptationaware models is extremely practical and urgent.Unsupervised Domain Adaptation: The main approaches to tackle UDA include adversarial training (a.k.a., distribution alignment) [10, 13, 12, 32, 31, 33, 21, 34] and selftraining [43, 39, 23, 16]. Although selftraining is becoming another dominant method for segmentation UDA in terms of the empirical results, it still lacks a sound theoretical foundation. In this paper, we only focus on the alignmentbased methods that not only keep close to the UDA stateoftheart (SOTA) performance but also are well supported by sound theoretical analyses [2, 3, 1].
The alignmentbased methods adapt models via aligning the distributions from the source domain and target domain in an adversarial training process, i.e., making the deep features of source images and target images indistinguishable to a discriminator net. Typical approaches to UDA include Domain Alignment (DA) [10, 13, 12, 32, 33]
, which aligns the two domains using global features (aligning the feature tensor from source or target
as a whole) and Class Alignment (CA) [31, 21, 34], which only considers aligning features of each class from source and target, no matter whether the domain distributions are aligned or not. In [31], the authors are inspired by the theoretical analysis of [1] and propose a discrepancybased model for aligning class features. There is a clear relation between the theory guidance [1] and the design of network, loss, and training methods. There are some recent works [21, 34] similar to the proposed work in spirit and show improved results compared to [31], but it is still unclear to relate the proposed algorithms with theory and to understand why the structure/loss/training is designed as the presented way.Iii Background and Preliminary Materials
We consider segmentation tasks where the input space is , representing the input RGB images, and the label space is , representing the groundtruth class segmentation images, where the label for a single pixel at
is denoted by a onehot vector
whose elements are bydefault 0valued except at location which is labeled as 1. Domain adaptation has two domain distributions over , named source domain and target domain . In the setting of UDA for segmentation, we have access to i.i.d. samples with labels from and i.i.d. samples without labels from .In the UDA problem, we need to reduce the prediction error on the target domain. A hypothesis is a function . We denote the space of as
. With the loss function
, the expected error of on is defined as(1) 
Similarly, we can define the expected error of on as
(2) 
Two important upper bounds related to the source and target error are given in [1]. Basically,
Theorem 1 For a hypothesis ,
(3) 
where is the divergence for two distributions, and the constant term does not depend on any . However, it is claimed in [1] that the bound with
divergence cannot be accurately estimated from finite samples, and using
divergence can unnecessarily inflate the bound. Another divergence measure is thus introduced to replace the divergence with a new bound derived.Definition 1 Given two domain distributions and over , and a hypothesis space that has finite VC dimension, the divergence between and is defined as
(4)  
where
represents the probability of
belonging to . Same to .The divergence resolves the issues in the divergence. If we replace in Eq. (3) with , then a new upper bound for , named as , can be written as
(5)  
An approach to compute the empirical divergence is also proposed in [1].
Lemma 1 For a symmetric hypothesis class (one where for every , the inverse hypothesis is also in ) and two sample sets and .
(6) 
where is an indicator function which is 1 if is true, and otherwise.
The second upper bound is based on a new hypothesis called symmetric difference hypothesis.
Definition 2 For a hypothesis space , the symmetric difference hypothesis space is the set of hypotheses
(7) 
where denotes an XOR operation. Then we can define the distance as
(8)  
Similar to Eq. (5), if we replace with , the second upper bound for , named as , can be expressed as
(9)  
where is the same term as in Eq. (3).
A standard way to achieve the alignment for deep models is to use the adversarial training method, which is also used in Generative Adversarial Networks (GANs). Therefore we explain the key concepts of adversarial training using the example of GANs.
GAN is proposed to learn the distribution of a set of given data in an adversarial manner. The architecture consists of two networks  a generator , and a discriminator . The is responsible for generating fake data (with distribution ) from random noises to fool the discriminator that is instead to accurately distinguish between the fake data and the given data. Optimization of a GAN involves a minimaximization over a joint loss for and .
(10)  
where we use as the real label and as the fake label. Training with Eq. (10) is a bilateral game where the distribution is aligned with the distribution .
The two bounds (Eq. (5) and Eq. (9)) for the target domain error are separately given in [1]. It has been independently demonstrated that domain alignment corresponds to optimizing over [10], where optimization over the upper bound (Eq. (5) with the divergence Eq. (6)) is proved as equivalent to Eq. (10) with a supervised learning in the source domain, and that class alignment corresponds to optimizing over [31], where the
is approximated by the discrepancy between two different classifiers.
Training DA is straightforward since we can easily define binary labels for each domain, e.g., we can use 1 as the source domain label and 0 as the target domain label. Adversarial training over the domain labels can achieve domain alignment. For CA, however, it is difficult to implement as we do not have target labels, hence the target class features are completely unknown to us, thus leading naively using adversarial training over each class impossible. The existing way well supported by theory to perform CA [31] is to indirectly align class features by devising two different classifier hypotheses. The two classifiers have to be well trained on the source domain and are able to classify different classes in the source domain with different decision boundaries. Then considering the shift between source and target domain, the trained two classifiers might have disagreements on target domain classes. Note since the two classifiers are already well trained on the source domain, the agreements of the two classifiers represent those features in the target domain that are close to the source domain, while in contrast, the features where disagreements happen indicate that there is a large shift between source and target. We use the disagreements to approximate the distance between source and target. If we are able to minimize the disagreements of the two classifiers, then features of each class between source and target will be enforced to be well aligned.
Iv Methodology
In this work we further investigate the relation between the and and prove that turns out to be an upper bound of , meaning DA can be a necessary constraint to CA. This is also consistent to our intuition: DA aligns features globally in a coarse way while CA aligns features locally in a finer way. Constraining CA with DA is actually a coarsetofine process, which makes the alignment process efficient and stable. By carefully studying the internal structure of existing DA and CA work, we propose a novel concept, pseudotrilateral game structure, for efficiently integrating DA and CA. We follow our theoretical analysis and proposed PTGS to guide the development of CALI, including designs of model structure, losses and training process.
Notations used in this paper is explained as follows. we denote the segmentation model as which consists of a feature extractor parameterized by and a classifier parameterized by , and is a sample from or . If multiple classifiers are used, we will denote the classifier as . We denote the discriminator as parameterized by .
Iva Bounds Relation
We start by examining the relationship between the DA and the CA from the perspective of target error bound. We propose to use this relation to improve the segmentation performance of class alignment, which is desired for dense prediction tasks. We provide the following theorem:
Theorem 2 If we assume there is a hypothesis space for segmentation model and a hypothesis space for domain classifiers , and , then we have
(11)  
The proof of this theorem is provided in Appendix. VIA.
Essentially, we limit the hypothesis space and in Eq. (11) into the space of deep neural networks. Directly optimizing over might be hard to converge since is a tighter upper bound for the prediction error on target domain. The bounds relation in Eq. (11) shows that the is an upper bound of . This provides us a clue to improve the training process of class alignment, i.e., the domain alignment can be a global constraint and narrow down the searching space for the class alignment. This also implies that integrating the domain alignment and class alignment might boost the training efficiency as well as the prediction performance of UDA. An ideal training process is illustrated in Fig. 2 where the searching space of (CA) is constantly bounded by that of (DA), ensuring the whole training process converge stably. This inspires us to design a new model, and we are explaining next in details about our model structures, losses and training process.
IvB CALI Structure
The existing DA or CA works usually involve a bilateral game. In CA, the game is between a feature extractor and a family of classifiers. The two players are optimized over the discrepancy of the two classifiers (note here the two players are the two classifiers vs. the feature extractor) in an opposite manner. In DA, the game happens between a segmentation net and a domain discriminator. The two players are optimized over the domain discrimination in an opposite way. It has been empirically showed [33, 32] that DA performs well if the domain alignment happens to the prediction probability (after Softmax()). However, according to the identified relation in Eq. (11), the two upper bounds and need to use the same feature, hence we connect the domain alignment and class alignment using a shared feature extractor and propose a novel concept called PTGS (see Fig. 3) to illustrate an interesting structure to integrate DA and CA. Both and have game with , but there is no game between and , hence we call this game as pseudotrilateral game. Furthermore, as defined in Eq. (8), and are two different hypotheses, thus we have to ensure the classifiers in are different during the training.
Following the concept of PTGS, we design the structure of our CALI model as shown in Fig. 4. Four networks are involved, a shared feature extractor , a domain discriminator and two classifiers and . represents the shared features; and are the probability/class predictions for and , respectively; represent the source domain label (1) and target domain label (0); and represents the
distance measure between two probability distributions. The oneway solid arrows indicate the forward propagation of the data flow while the twoway dashed arrows indicate losses are generated. The red arrows represent the sourcerelated data while the blue ones represent the targetrelated data. The orange twoway dashed line indicates the structural regularization loss between the
and .IvC CALI Losses
We denote raw images from source or target domain as , and the label from source domain as . We use semantic labels in source domain to train all of the nets, but the domain discriminator, in a supervised way, see the solid red oneway arrow in Fig. 4. We need to minimize the supervised segmentation loss since Eq. (11) and other related Eqs suggest that the source prediction error is also part of the upper bound of target error. The supervised segmentation loss for training CALI is defined as
(12)  
where represents the elementwise multiplication between two tensors.
To perform domain alignment, we need to define the joint loss function for and
(13) 
where no segmentation labels but domain labels are used, and we use the standard crossentropy to compute the domain classification loss for both source () and target data (). We have
(14)  
and
(15)  
Note we include in Eq. (14) since both the source data and target data are passed through the feature extractor. This is different than standard GAN, where the real data is directly fed to , without passing through the generator.
To perform class alignment, we need to define the joint loss function for , , and
(16) 
where is the distance measure between two distributions from the two classifiers. In this paper, we use the same distance in [31] as the measure, thus , where and are two distributions and is the number of label classes.
To prevent and
from converging to the same network throughout the training, we use the cosine similarity as a weight regularization to maximize the difference of the weights from
and , i.e.,(17) 
where and are the weight vectors of and , respectively.
IvD CALI Training
We integrate the training processes of domain alignment and class alignment to systematically train our CALI model. To be consistent with Eq. (11), we adopt an iterative mechanism that alternates between domain alignment and class alignment. Our training process is pseudocoded in Algorithm 1.
Note the adversarial training order of in Algorithm 1 is , instead of the , meaning in each training iteration we first train the feature extractor and then the discriminator. The reason for this order is because we empirically find that the feature from is relatively easy for to discriminate, hence if we train first, then the might become an accurate discriminator in the early stage of training and there will be no adversarial signals for training , thus making the whole training fail. The same order applies to training of the pair of and with .
IvE Visual Planner
We design a visual receding horizon planner to achieve feasible visual navigation by combining the learned image segmentation. Specifically, first we compute a library of motion primitives [15, 14] where each is a single primitive. We use to denote a robot pose. Then we project the motion primitives to the image plane and compute the navigation cost function for each primitive based on the evaluation of collision risk in image space and target progress. Finally, we select the primitive with minimal cost to execute. The trajectory selection problem can be defined as:
(18) 
where and are the collision cost and target cost of one primitive , and , are corresponding weights, respectively.
To efficiently evaluate the collision risk in the learned segmentation images, we first classify the classes in terms of their navigability, e.g., in offroad environments, grass and mulch are classified as navigable while tree and bush are classified as nonnavigable. In this case, we are able to extract the boundary between the navigable space and the nonnavigable space. We treat the boundary part close to the bottom line of the image as the obstacle boundary. We further use the obstacle boundary to generate a Scaled Euclidean Distance Field (SEDF), where the values fall in , representing the risk level at the pixel position. Examples of different SEDF with different scale factors can be seen in Fig. 6.
Assume is the pose in one primitive and its image coordinates are , then the collision risk for is
(19) 
where represents the SEDF image.
To evaluate target progress during the navigation progress, we propose to use the distance on as the metric. We define three types of frames: world frame , primitive pose frame , and goal frame . The transformation of in is denoted as while that of in is . A typical approach to represent the distance is to split a pose into a position and an orientation and define two distances on and . Then the two distances can be fused in a weighted manner with two strictly positive scaling factors and and with an exponent parameter [5]:
(20)  
We use the Euclidean distance as , the Riemannian distance over as and set as . Then the distance (target cost) between two transformation matrices can be defined [25] as:
(21)  
V Experiments
Va Datasets
We evaluate CALI together with several baseline methods on a few challenging domain adaptation scenarios, where several public datasets, e.g., GTA5 [29], Cityscapes [8], RUGD [35], RELLIS [17], as well as a small selfcollected dataset, named MESH (see the first column of Fig. 8), are investigated. The GTA5 dataset contains synthesized highresolution images in the urban environments from a video game and pixelwise semantic annotations of 33 classes. The Cityscapes dataset consists of finely annotated images whose label is given for 19 commonly seen categories in urban environments, e.g., road, sidewalk, tree, person, car, etc. The RUGD and RELLIS are two recently released datasets that aim to evaluate segmentation performance in offroad environments. The RUGD and the RELLIS contain 24 and 20 classes with and images, respectively. RUGD and RELLIS cover various scenes like trails, creeks, parks, villages, and puddle terrains. Our dataset, MESH, includes features like grass, trees (particularly challenging in winter due to foliage loss and monochromatic colors), mulch, etc. It helps us to further validate the performance of our proposed model for traversability prediction in challenging scenes, particularly the offroad environments.
Class  SO  DA  CA  CALI 

Road  38.86  52.80  78.56  75.36 
Sidewalk  17.47  18.95  2.79  27.12 
Building  63.60  61.73  43.51  67.00 
Sky  58.08  54.35  46.59  60.49 
Vegetation  67.21  64.69  41.48  67.50 
Terrain  7.63  7.04  8.37  9.56 
Person  16.89  15.45  13.48  15.03 
Car  30.32  43.41  31.64  52.25 
Pole  11.61  12.38  9.68  11.91 
mIoU*  34.63  36.76  30.68  42.91 
VB Implementation Details
To be consistent with our theoretical analysis, the implementation of CALI only adopts the necessary indications by Eq. (11). First, Eq. (11) requires that the input of the two upper bounds (one for DA and the other one for CA) should be the same. Second, nothing else but only domain classification and hypotheses discrepancy are involved in Eq. (11) and other related analyses (Eq. (3)  Eq. (9)). Accordingly, we strictly follow the guidance of our theoretical analyses. First, CALI performs DA in the intermediatefeature level ( in Fig. 4), instead of the outputfeature level used in [33]. Second, we exclude the multiple additional tricks, e.g., entropybased and multilevel features based alignment, and classratio priors in [33] and multisteps training for feature extractor in [31]. We also implement baseline methods without those techniques for a fair comparison. To avoid possible degraded performance bought by a class imbalance in the used datasets, we regroup those rare classes into classes with a higher pixel ratio. For example, we treat the building, wall, and fence as the same class; the person and rider as the same class in the adaptation of GTA5Cityscapes. In the adaptation of RUGDRELLIS, we treat the tree, bush, and log as the same class, and the rock and rockbed as the same class. Details about remapping can be seen in Fig. 14 and Fig. 15 in Appendix. VIB.
Class  SO  DA  CA  CALI 

Dirt  0.00  0.53  3.23  0.01 
Grass  64.78  61.63  65.35  67.08 
Tree  40.79  45.93  41.51  55.80 
Sky  45.07  67.00  2.31  72.99 
Building  10.90  12.29  10.91  10.28 
mIoU*  32.31  37.48  24.66  41.23 
We use the PyTorch
[26]framework for implementation. Training images from source and target domains are cropped to be half of their original image dimensions. The batch size is set to 1 and the weights of all batch normalization layers are fixed. We use the ResNet101
[11]pretrained on ImageNet
[9] as the model for extracting features. We use the ASPP module in DeepLabV2 [6] as the structure for and . We use the similar structure in [27] as the discriminator , which consists of 5 convolution layers with kernel and with channel sizeand stride of 2. Each convolution layer is followed by a LeakyReLU
[22]parameterized by 0.2, but only the last convolution layer is follwed by a Sigmoid function. During the training, we use SGD
[4] as the optimizer for and with a momentum of 0.9, and use Adam [18] to optimize with . We set all SGD optimizers a weight decay of . The initial learning rates of all SGDs for performing domain alignment are set to and the one of Adam is set as . For class alignment, the initial learning rate of SGDs is set to . All of the learning rates are decayed by a poly learning rate policy, where the initial learning rate is multiplied by with . All experiments are conducted on a single Nvidia Geforce RTX 2080 Super GPU.VC Comparative Studies
We present comparative experimental results of our proposed model, CALI, compared to different baseline methods – SourceOnly (SO) method, DomainAlignment (DA) [33] method, and ClassAlignment [31] method. Specifically, we first perform evaluations on a sim2real UDA in citylike environments, where the source domain is represented by GTA5 while the target domain is the Cityscapes. Then we consider a transfer of real2real in forest environments, where the source domain and target domain are set as RUGD and RELLIS, respectively. All models are trained with full access to the images and labels in the source domain and with only access to the images in the target domain. The labels in target datasets are only used for evaluation purposes. Finally, we further validate our model performance for adapting from RUGD to our selfcollected dataset MESH.
To ensure a fair comparison, all the methods use the same feature extractor ; both DA and CALI have the same domain discriminator ; both CA and CALI have the same two classifiers and
. We also use the same optimizers and optimizationrelated hyperparameters if any is used for models under comparison.
We use the mean of Intersection over Union (mIoU) as the metric to evaluate each class and overall segmentation performance on testing images. IoU is computed as , where and are true positive, true negative, false positive and false negative, respectively.
VC1 Gta5Cityscapes
Quantitative comparison results of GTA5Cityscapes are shown in Table. I, where segmentations are evaluated on 9 classes (as regrouped in Fig. 14). Our proposed method has significant advantages over multiple baseline methods for most categories and overall performance (mIoU*).
In our testing case, SO achieves the highest score for the class person even without any domain adaptation. One possible reason for this is the deep features of the source person and the target person from the model solely trained on source domain, are already wellaligned. If we try to interfere this wellaligned relation using unnecessary additional efforts, the target prediction error might be increased (see the mIoU values of the person from the other three methods). We call this phenomenon as negative transfer, which also happens to other classes if we compare SO and DA/CA, e.g., sidewalk, building, sky, vegetation, and so on. In contrast, CALI maintains an improved performance compared to either SO or DA/CA. We validate our analytical method for DA and CA (Section. IVA) by a comparison between CALI and baselines. This indicates either single DA or CA is problematic for semantic segmentation, particularly when we strictly follow what the theory supports and do not include any other training tricks (that might increase the training complexity and make the training unstable). This implies that integration of DA and CA is beneficial to each other with significant improvements, and more importantly, CALI is well theoretically supported, and the training process is easy and stable.
Fig. 5 shows the examples of qualitative comparison for UDA of GTA5Cityscapes. We find that CALI prediction is less noisy compared to the baselines methods as shown in the second and third columns (sidewalk or car onroad), and shows better completeness (part of the car is missing, see the fourth column).
VC2 RugdRellis
We show quantitative results of RUGDRELLIS in Table. II, where only 5 classes^{2}^{2}2This is because other classes (in Fig. 15) frequently appearing in source domain (RUGD) are extremely rare in target domain (RELLIS), hence no prediction for those classes occurs especially considering the domain shift. are evaluated. It shows the same trend as Table. I. Both tables show that CA has the negative transfer (compared with SO) issue for either sim2real or real2real UDA. However, if we constrain the training of CA with DA, as in our proposed model, then the performance will be remarkably improved. Some qualitative results are shown in Fig. 7.
VC3 RugdMesh
Our MESH dataset contains only unlabeled images that restrict us to show only a qualitative comparison for the UDA of RUGDMESH, as shown in Fig. 8. We have collected data in winter forest environments, which are significantly different than the images in the source domain (RUGD)  collected in a different season, e.g., summer or spring. These crossseason scenarios make the prediction more challenging. However, it is more practical to evaluate the UDA performance of crossseason scenarios, as we might have to deploy our robot at any time, even with extreme weather conditions, but our available datasets might be far from covering every season and every weather condition. From Fig. 8, we can still see the obvious advantages of our proposed CALI model over other baselines.
VD Discussions
In this section, we aim to discuss our model behaviors in more details. Specifically, first we will explain the advantages of CALI over CA from the perspective of training process. Second, we will show the vital influence of mistakenly using wrong order of adversarial training.
The most important part in CA is the discrepancy between the two classifiers, which is the only training force for the functionality of CA. It has been empirically studied in [31] that the target prediction accuracy will increase as the target discrepancy is decreasing, hence the discrepancy is also an indicator showing if the training is on the right track. We compare the target discrepancy changes of CALI and our baseline CA in Fig. 9, where the curves for the three UDA scenarios are presented from (a) to (c) and we only show the data before iteration 30k. It can be seen that before around iteration 2k, the target discrepancy of both CALI and CA are drastically decreasing, but thereafter, the discrepancy of CA starts to increase. On the other hand, if we impose a DA constraint over the same CA (iteratively), leading to our proposed CALI, then the target discrepancy will be decreasing as expected. This validates that integrating DA and CA will make the training process of CA more stable, thus improving the target prediction accuracy.
As mentioned in Algorithm 1, we have to use adversarial training order of , instead of . The reason for this is related to our designed net structure. Following the guidance of Eq. (11), we use the same input to the two classifiers and the domain discriminator, hence the discriminator has to receive the intermediatelevel feature as the input. If we use the order of in CALI, then the outputs of the discriminator will be like Fig. 10, where the domain discriminator of CALI will quickly converge to the optimal state and it can accurately discriminate if the feature is from source or target domain. In this case, the adversarial loss for updating the feature extractor will be near 0, hence the whole training fails, which is validated by changes of the target discrepancy curve, as shown in Fig. 10, where the discrepancy value is decreasing in a small amount in the first few iterations and then quickly increase to a high level that shows the training is divergent and the model is collapsed. This is also verified by the prediction results at (and after) around iteration 1k, as shown in Fig. 11, where the first row is the source images while the second row is the target images.
VE Navigation Missions
To further show the effectiveness of our proposed model for real deployments, we build a navigation system by combining the proposed CALI (trained with RUGDMESH setup) segmentation model with our visual planner. We test behaviors of our navigation system in two different forest environments (named MESH in Fig. 12 and MESH in Fig. 13), where our navigation system shows high reliability. In navigation tasks, the image resolution is , and the inference time for pure segmentation inference is around frame per second (FPS). However, since a complete perception system requires several postprocessing steps, such as navigability definition, noise filtering, Scaled Euclidean Distance Field computation, motion primitive evaluation and so on, the response time for the whole perception pipeline (in python) is around FPS without any engineering optimization. The inference of segmentation for navigation is performed on an Nvidia Tesla T4 GPU. We set the linear velocity as and control the angular velocity to track the selected motion primitive. The path length is in Fig. 12 and in Fig. 13. Although the motion speed is slow in navigation tasks, as a proof of concept and with a very basic motion planner, the system behavior is as expected, and we have validated that the proposed CALI model is able to well accomplish the navigation tasks in unstructured environments.
Vi Conclusion
We present CALI, a novel unsupervised domain adaptation model specifically designed for semantic segmentation, which requires finegrained alignments in the level of class features. We carefully investigate the relationship between a coarse alignment and a fine alignment in theory. The theoretical analysis guides the design of the model structure, losses, and training process. We have validated that the coarse alignment can serve as a constraint to the fine alignment and integrating the two alignments can boost the UDA performance for segmentation. The resultant model shows significant advantages over baselines in various challenging UDA scenarios, e.g., sim2real and real2real. We also demonstrate the proposed segmentation model can be well integrated with our proposed visual planner to enable highly efficient navigation in offroad environments.
References
 [1] (2010) A theory of learning from different domains. Machine learning 79 (1), pp. 151–175. Cited by: §II, §II, §III, §III, §III, §III, §VIA, §VIA.
 [2] (2007) Analysis of representations for domain adaptation. Advances in neural information processing systems 19, pp. 137. Cited by: §II.
 [3] (2008) Learning bounds for domain adaptation. Cited by: §II.

[4]
(2010)
Largescale machine learning with stochastic gradient descent
. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §VB. 
[5]
(2018)
Defining the pose of any 3d rigid object and an associated distance.
International Journal of Computer Vision
126 (6), pp. 571–596. Cited by: §IVE.  [6] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §II, §VB.
 [7] (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §II.

[8]
(2016)
The cityscapes dataset for semantic urban scene understanding
. InProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §I, §VA.  [9] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §VB.
 [10] (2016) Domainadversarial training of neural networks. The journal of machine learning research 17 (1), pp. 2096–2030. Cited by: §I, §II, §II, §III.
 [11] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §VB.
 [12] (2018) Cycada: cycleconsistent adversarial domain adaptation. In International conference on machine learning, pp. 1989–1998. Cited by: §I, §II, §II.
 [13] (2016) Fcns in the wild: pixellevel adversarial and constraintbased adaptation. arXiv preprint arXiv:1612.02649. Cited by: §I, §II, §II.
 [14] (2008) State space sampling of feasible motions for highperformance mobile robot navigation in complex environments. Journal of Field Robotics 25 (67), pp. 325–345. Cited by: §IVE.
 [15] (2007) Optimal rough terrain trajectory generation for wheeled mobile robots. The International Journal of Robotics Research 26 (2), pp. 141–166. Cited by: §IVE.
 [16] (2021) DAFormer: improving network architectures and training strategies for domainadaptive semantic segmentation. arXiv preprint arXiv:2111.14887. Cited by: §II.
 [17] (2020) RELLIS3d dataset: data, benchmarks and analysis. External Links: 2011.12954 Cited by: §I, §VA.
 [18] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VB.
 [19] (2017) Deep learning markov random field for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence 40 (8), pp. 1814–1828. Cited by: §II.
 [20] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §II.
 [21] (2019) Taking a closer look at domain shift: categorylevel adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2507–2516. Cited by: §II, §II.
 [22] (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §VB.
 [23] (2020) Instance adaptive selftraining for unsupervised domain adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pp. 415–430. Cited by: §II.
 [24] (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §I.
 [25] (1995) Distance metrics on the rigidbody motions with applications to mechanism design. Cited by: §IVE.
 [26] (2019) Pytorch: an imperative style, highperformance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: §VB.
 [27] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §VB.
 [28] (2021) Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188. Cited by: §II.
 [29] (2016) Playing for data: ground truth from computer games. In European conference on computer vision, pp. 102–118. Cited by: §I, §VA.
 [30] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §II.
 [31] (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3723–3732. Cited by: §I, §II, §II, §III, §III, §IVC, §VB, §VC, §VD.
 [32] (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7472–7481. Cited by: §I, §II, §II, §IVB.
 [33] (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526. Cited by: §I, §II, §II, §IVB, §VB, §VC.
 [34] (2020) Classes matter: a finegrained adversarial approach to crossdomain semantic segmentation. In European Conference on Computer Vision, pp. 642–659. Cited by: §II, §II.
 [35] (2019) A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §I, §VA.
 [36] (2020) A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST) 11 (5), pp. 1–46. Cited by: §I.
 [37] (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203. Cited by: §II.
 [38] (2015) Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §II.
 [39] (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE international conference on computer vision, pp. 2020–2030. Cited by: §II.
 [40] (2021) A survey of unsupervised domain adaptation for visual recognition. arXiv preprint arXiv:2112.06745. Cited by: §I.
 [41] (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §II.
 [42] (2021) Rethinking semantic segmentation from a sequencetosequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890. Cited by: §II.
 [43] (2018) Unsupervised domain adaptation for semantic segmentation via classbalanced selftraining. In Proceedings of the European conference on computer vision (ECCV), pp. 289–305. Cited by: §II.
Appendix
Via Proof of Theorem 2
For a hypothesis ,
(22)  
where and is the ideal joint hypothesis (see the Definition 2 in Section 4.2 of [1]).
ViB Remapping of Label Space
We regroup the original label classes according to the semantic similarities among classes. In GTA5 and Cityscapes, we cluster the building, wall and fence as the same category; traffic light, traffic sign and pole as the same group; car, train. bicycle, motorcycle, bus and truck as the same class; and treat the person and rider as the same one. See Fig. 14. Similarly, we also have regroupings for classes in RUGD and RELLIS, as can be seen in Fig. 15.