Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation

09/25/2018 ∙ by Yawei Luo, et al. ∙ Huazhong University of Science u0026 Technology 0

We consider the problem of unsupervised domain adaptation in semantic segmentation, in which the source domain is fully annotated, and the target domain is unlabeled. The key in this campaign consists in reducing the domain shift, i.e., enforcing the data distributions of the two domains to be similar. A popular strategy is to align the marginal distribution in the feature space through adversarial learning. However, this global alignment strategy does not consider the local category-level feature distribution. A possible consequence of the global movement is that some categories which are originally well aligned between the source and target may be incorrectly mapped. To address this problem, this paper introduces a category-level adversarial network, aiming to enforce local semantic consistency during the trend of global alignment. Our idea is to take a close look at the category-level data distribution and align each class with an adaptive adversarial loss. Specifically, we reduce the weight of the adversarial loss for category-level aligned features while increasing the adversarial force for those poorly aligned. In this process, we decide how well a feature is category-level aligned between source and target by a co-training approach. In two domain adaptation tasks, i.e., GTA5 -> Cityscapes and SYNTHIA -> Cityscapes, we validate that the proposed method matches the state of the art in segmentation accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation aims to assign each pixel of a photograph to a semantic class label. Currently, the achievement is at the price of large amount of dense pixel-level annotations obtained by expensive human labor  [4, 23, 26]. An alternative would be resorting to simulated data, such as computer generated scenes [30, 31], so that unlimited amount of labels are made available. However, models trained with the simulated images do not generalize well to realistic domains. The reason lies in the different data distributions of the two domains, typically known as domain shift [36]. To address this issue, domain adaptation approaches [34, 40, 14, 45, 17, 16, 13, 47] are proposed to bridge the gap between the source and target domains. A majority of recent methods [24, 39, 42, 41] aim to align the feature distributions of different domains. Works along this line are based on the theoretical insights in [1] that minimizing the divergence between domains lowers the upper bound of error on the target domain. Among this cohort of domain adaptation methods, a common and pivotal step is minimizing some distance metric between the source and target feature distributions [24, 39]. Another popular choice, which borrows the idea from adversarial learning [10]

, is to minimize the accuracy of domain prediction. Through a minimax game between two adversarial networks, the generator is trained to produce features that confuse the discriminator while the latter is required to correctly classify which domain the features are generated from.

Although the works along the path of adversarial learning have led to impressive results [38, 15, 22, 19, 42, 35], they suffer from a major limitation: when the generator network can perfectly fool the discriminator, it merely aligns the global marginal distribution of the features in the two domains ( i.e., , where and denote the features of source and target domain in latent space) while ignores the local joint distribution shift, which is closely related to the semantic consistency of each category (i.e., , where and denote the categories of the features). As a result, the de facto use of the adversarial loss may cause those target domain features, which are already well aligned to their semantic counterpart in source domain, to be mapped to an incorrect semantic category (negative transfer). This side effect becomes more severe when utilize a larger weight on the adversarial loss.

To address the limitation of the global adversarial learning, we propose a category-level adversarial network (CLAN), prioritizing category-level alignment which will naturally lead to global distribution alignment. The cartoon comparison of traditional adversarial learning and the proposed one is shown in Fig. 1. The key idea of CLAN is two-fold. First, we identify those classes whose features are already well aligned between the source and target domains, and protect this category-level alignment from the side effect of adversarial learning. Second, we identify the classes whose features are distributed differently between the two domains and increase the weight of the adversarial loss during training. In this process, we utilize co-training [46], which enables high-confidence predictions with two diverse classifiers, to predict how well each feature is semantically aligned between the source and target domains. Specifically, if the two classifiers give consistent predictions, it indicates that the feature is predictive and achieves good semantic alignment. In such case, we reduce the influence of the adversarial loss in order to encourage the network to generate invariant features that can keep semantic consistency between domains. On the contrary, if the predictions disagree with each other, which indicates that the target feature is far from being correctly mapped, we increase the weight of the adversarial loss on that feature so as to accelerate the alignment. Note that 1) Our adversarial learning scheme acts directly on the output space. By regarding the output predictions as features, the proposed method jointly promotes the optimization for both classifier and extractor; 2) Our method does not guarantee rigorous joint distribution alignment between domains. Yet, compared with marginal distribution alignment, our method can map the target features closer (or no negative transfer at worst) to the source features of the same categories. The main contributions are summarized below.

  • By proposing to adaptively weight the adversarial loss for different features, we emphasize the importance of category-level feature alignment in reducing domain shift.

  • Our results are on par with the state-of-the-art UDA methods on two transfer learning tasks,

    i.e., GTA5 [30] Cityscapes [8] and SYNTHIA [31] Cityscapes.

2 Related Works

This section will focus on adversarial learning and co-training techniques for unsupervised domain adaptation, which form the two main motivations of our method.

Adversarial learning. Ben-David et al. [1] had proven that the adaptation loss is bounded by three terms, e.g.

, the expect loss on source domain, the domain divergence, and the shared error of the ideal joint hypothesis on the source and target domain. Because the first term corresponds to the well-studied supervised learning problems and the third term is considered sufficiently low to achieve an accurate adaptation, the majority of recent works lay emphasis on the second term. Adversarial adaptation methods are good examples of this type of approaches and can be investigated on different levels. Some methods focus on the distribution shift in the latent feature space

[38, 15, 22, 19, 42, 35]. In an example, Hoffman et al. [15] appended category statistic constraints to the adversarial model, aiming to improve semantic consistency in target domain. Other methods address the adaption problem on the pixel level [21, 3], which relate to the style transfer approaches [48, 7] to make images indistinguishable across domains. A joint consideration of pixel and feature level domain adaptation is studied in [14]. Besides alignment in the bottom feature layers, Tsai et al. [40] found that aligning directly the output space is more effective in semantic segmentation. Domain adaptation in the output space enables the joint optimization for both prediction and representation, so our method utilizes this advantage.

Co-training. Co-training [46] belongs to multi-view learning in which learners are trained alternately on two distinct views with confident labels from the unlabeled data. In UDA, this line of methods [43, 5, 32, 25] are able to assign pseudo labels to unlabeled samples in the target domain, which enables direct measurement and minimization the classification loss on target domain. In general, co-training enforces the two classifiers to be diverse in the learned parameters, which can be achieved via dropout [33], consensus regularization [34] or parameter diverse [43], etc. Similar to co-training, tri-training keeps the two classifiers producing pseudo labels and uses these pseudo labels to train an extra classifier [32, 43]. Apart from assigning pseudo labels to unlabeled data, Saiko et al. [33, 34] maximized the consensus of two classifiers for domain adaptation.

Our work does not follow the strategy of global feature alignment [40, 15, 38] or classifiers consensus maximization [33, 34]. Instead, category-level feature alignment is enforced through co-training. To our knowledge, we are making an early attempt to adaptively weight the adversarial loss for features in segmentation task according to the local alignment situation.

Figure 2: Overview of the proposed category-level adversarial network. It consists of a feature extractor , two classifiers and , and a discriminator . and

are fed with the deep feature map extracted from

and predict semantic labels for each pixel from diverse views. In source flow, the sum of the two prediction maps is used to calculate a segmentation loss as well as an adversarial loss from . In target flow, the sum of the two prediction maps is forwarded to to produce a raw adversarial loss map. Additionally, we adopt the discrepancy of the two prediction maps to produce a local alignment score map. This map evaluates the category-level alignment degree of each feature and is used to adaptively weight the raw adversarial loss map.

3 Method

3.1 Problem Settings

We focus on the problem of unsupervised domain adaptation (UDA) in semantic segmentation, where we have access to the source data with pixel-level labels , and the target data without labels. The goal is to learn a model that can correctly predict the pixel-level labels for the target data . Traditional adversaries-based networks (TAN) consider two aspects for domain adaptation. First, these methods train a model that distills knowledge from labeled data in order to minimize the segmentation loss in the source domain, formalized as a fully supervised problem:

(1)

where denotes statistical expectation and

is an appropriate loss function, such as multi-class cross entropy.

Second, adversaries-based UDA methods also train to learn domain-invariant features by confusing a domain discriminator which is able to distinguish between samples of the source and target domains. This property is achieved by minimaxing an adversarial loss:

(2)

However, as mentioned above, there is a major limitation for traditional adversarial learning methods: even under perfect alignment in marginal distribution, there might be the negative transfer that causes the samples from different domains but of the same class label to be mapped farther away in the feature space. In some cases, some classes are already aligned between domains, but the adversarial loss might deconstruct the existing local alignment when pursuing the global marginal distribution alignment. In this paper, we call this phenomenon “lack of semantic consistency”, which is a critical cause of performance degradation.

3.2 Network Architecture

Our network architecture is illustrated in Fig. 2. It is composed of a generator and a discriminator . can be any FCN-based segmentation network [37, 23, 4] and is a CNN-based binary classifier with a fully-convolutional output [10]. As suggested in the standard co-training algorithm [46], generator is divided into feature extractor and two classifiers and . extracts features from input images; and classify features generated from into one of the pre-defined semantic classes, such as car, tree and road. Following the co-training practice, we enforce the weights of and to be diverse through a cosine distance loss. This will provide us with the distinct views / classifiers to make semantic predictions for each feature. The final prediction map

is obtained by summing up the two diverse prediction tensors

and and we call an ensemble prediction.

Given a source domain image , feature extractor outputs a feature map, which is input to classifiers and to yield the pixel-level ensemble prediction . On the one hand, is used to calculate a segmentation loss under the supervision of the ground-truth label . On the other hand, is input to to generate an adversarial loss.

Given a target domain image , we also forward it to and obtain an ensemble prediction . Different from the source data flow, we additionally generate a discrepancy map out of and , denoted as , where denotes some proper distance metric to measure the element-wise discrepancy between and . When using the cosine distance as an example, forms a shaped tensor with the element equaling to . Once produces an adversarial loss map , an element-wise multiplication is performed between and . As a result, the final adaptive adversarial loss on a target sample takes the form as , where traverses over all the pixels on the map. In this manner, each pixel on the segmentation map is differently weighted w.r.t the adversarial loss.

3.3 Training Objective

The proposed network is featured by three loss functions, i.e., the segmentation loss, the weight discrepancy loss and the self-adaptive adversarial loss. Given an image of shape and a label map of shape where is the number of semantic classes, the segmentation loss (multi-class cross-entropy loss) can be concretized from Eq. 1 as

(3)

where

denotes the predicted probability of class

on pixel . denotes the ground truth probability of class on the pixel . If pixel belongs to class , otherwise .

For the second loss, as suggested in the standard co-training algorithm [46], the two classifiers and should have possibly diverse parameters in order to provide two different views on a feature. Otherwise, the training degenerates to self-training. Specifically, we enforce divergence of the weights of the convolutional layers of the two classifiers by minimizing

their cosine similarity. Therefore, we have the following weight discrepancy loss:

(4)

where and are obtained by flattening and concatenating the weights of the convolution filters of and .

Third, we adopt the discrepancy between the two predictions and as an indicator to weight the adversarial loss. The self-adaptive adversarial loss can be extended from the traditional adversarial loss (Eq. 2) as

(5)

where and are predictions made by and , respectively, denotes the cosine distance, and controls the adaptive weight for adversarial loss. Note that in Eq. 5, to stabilize the training process, we add a small number to the self-adaptive weight.

With the above loss terms, the overall loss function of our approach can be written as

(6)

where and denote the hyper parameters that control the relative importance of the three losses. The training objective of CLAN is

(7)

We solve Eq. 7 by alternating between optimizing and until converges.

Figure 3: A contrastive analysis of CLAN and traditional adversarial network (TAN). (a): A target image, and we focus on the poles and traffic signs in orange boxes. (b): A non-adapted segmentation result. Although the global segmentation result is poor, the poles and traffic signs can be correctly segmented. It indicates that some classes are originally aligned between domains, even without any domain adaptation. (c): Adapted result of TAN, in which a decent segmentation map is produced but poles and traffic signs are poorly segmented. The reason is that the global alignment strategy tends to assign a conservative prediction to a feature and would lead some features to be predicted to other prevalent classes [11, 18], thus causing those infrequent features being negatively transferred. (d): Adapted result from CLAN. CLAN reduces the weight of adversarial loss for those aligned features. As a result, the original well-segmented class are well preserved. We then map the high-dimensional features of (b), (c) and (d) to a 2-D space with t-SNE [28] shown in (e), (f) and (g). The comparison of feature distributions further proves that CLAN can enforce category-level alignment during the trend of global alignment. (For a clear illustration, we only show 4 related classes, i.e., building in blue, traffic sign in orange, pole in red and vegetation in green.)

3.4 Analysis

The major difference between the proposed framework and traditional adversarial learning consists in two aspects: the discrepancy loss and the category-level adversarial loss. Accordingly, analysis will focus on the two differences.

First, the discrepancy (co-training) loss encourages to learn domain-invariant semantics instead of the domain specific elements such as illumination. In our network, classifiers and 1) are encouraged to capture possibly different characteristics of a feature, which is ensured by the discrepancy loss, and 2) are enforced to make the same prediction of any output (no matter the source or target), which is required by the segmentation loss and the adversarial loss. The two forces actually require that should capture the essential aspect of a pixel across the source and target domains, which, as we are aware of, is the pure semantics of a pixel, i.e., the domain-invariant aspect of a pixel. Without the discrepancy loss (co-training), force 1) is missing, and there is a weaker requirement for to learn domain-invariant information. On the other side, in our simulated real task, the two domains vary a lot at visual level, but overlap at semantic level. If and are input with visual-level features from , their predictions should be inaccurate in target domain and tend to be different, which will be punished by large adversarial losses. As a result, once our algorithm converges, and will be input with semantic-level features instead of visual-level features. That is, is encouraged to learn domain-invariant semantics. Therefore, the discrepancy loss serves as an implicit contributing factor for the improved adaptation ability.

Second, in our major contribution, we extend the traditional adversarial loss with an adaptive weight . On the one hand, when is large, feature maps of the same class do not have similar joint distributions between two domains: they suffer from the semantic inconsistency. Therefore, the weights are such assigned as to encourage to fool mainly on features that suffer from domain shift. On the other hand, when is small, the joint distribution would have a large overlap across domains, indicating that the semantic inconsistency problem is not severe. Under this circumstance, tends to ignore the adversarial punishment from . From the view of , the introduction of the adaptive weight encourages to distill more knowledge from examples suffering from semantic inconsistency rather than those well-aligned classes. As a result, CLAN is able to improve category-level alignment degree in adversarial training. This could be regarded as an explicit contributing factor for the adaptation ability. We additionally give a contrastive analysis between traditional adversarial network (TAN) and CLAN on their adaptation result in Fig. 3.

GTA5 Cityscapes

Arch.

Meth.

road

side.

buil.

wall

fence

pole

light

sign

vege.

terr.

sky

pers.

rider

car

truck

bus

train

motor

bike

mIoU

gain

Source only V - 64.0 22.1 68.6 13.3 8.7 19.9 15.5 5.9 74.9 13.4 37.0 37.7 10.3 48.2 6.1 1.2 1.8 10.8 2.9 24.3
CBST [49] V ST 90.4 50.8 72.0 18.3 9.5 27.2 28.6 14.1 82.4 25.1 70.8 42.6 14.5 76.9 5.9 12.5 1.2 14.0 28.6 36.1 11.8
Source only V - 25.9 10.9 50.5 3.3 12.2 25.4 28.6 13.0 78.3 7.3 63.9 52.1 7.9 66.3 5.2 7.8 0.9 13.7 0.7 24.9

MCD [34]
V AT 86.4 8.5 76.1 18.6 9.7 14.9 7.8 0.6 82.8 32.7 71.4 25.2 1.1 76.3 16.1 17.1 1.4 0.2 0.0 28.8 3.9
Source only V - 18.1 6.8 64.1 7.3 8.7 21.0 14.9 16.8 45.9 2.4 64.4 41.6 17.5 55.3 8.4 5.0 6.9 4.3 13.8 22.3
CDA [44] V AT 74.9 22.0 71.7 6.0 11.9 8.4 16.3 11.1 75.7 13.3 66.5 38.0 9.3 55.2 18.8 18.9 0.0 16.8 14.6 28.9 6.6
Source only V - 26.0 14.9 65.1 5.5 12.9 8.9 6.0 2.5 70.0 2.9 47.0 24.5 0.0 40.0 12.1 1.5 0.0 0.0 0.0 17.9
FCNs in the wild [15] V AT 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1 9.2

CyCADA (feature) [14]
V AT 85.6 30.7 74.7 14.4 13.0 17.6 13.7 5.8 74.6 15.8 69.9 38.2 3.5 72.3 16.0 5.0 0.1 3.6 0.0 29.2 11.3

Baseline (TAN) [40]
V AT 87.3 29.8 78.6 21.1 18.2 22.5 21.5 11.0 79.7 29.6 71.3 46.8 6.5 80.1 23.0 26.9 0.0 10.6 0.3 35.0 17.1

CLAN
V AT 88.0 30.6 79.2 23.4 20.5 26.1 23.0 14.8 81.6 34.5 72.0 45.8 7.9 80.5 26.6 29.9 0.0 10.7 0.0 36.6 18.7
Source only R - 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.6

Baseline (TAN) [40]
R AT 86.5 25.9 79.8 22.1 20.0 23.6 33.1 21.8 81.8 25.9 75.9 57.3 26.2 76.3 29.8 32.1 7.2 29.5 32.5 41.4 4.8

CLAN
R AT 87.0 27.1 79.6 27.3 23.3 28.3 35.5 24.2 83.6 27.4 74.2 58.6 28.0 76.2 33.1 36.7 6.7 31.9 31.4 43.2 6.6
Table 1: Adaptation from GTA5 [30] to Cityscapes [8]. We present per-class IoU and mean IoU. “V” and “R” represent the VGG16-FCN8s and ResNet101 backbones, respectively. “ST” and “AT” represent two lines of method, i.e., self training- and adversarial learning-based DA. We highlight the best result in each column in bold. To clearly showcase the effect of CLAN on infrequent classes, we highlight these classes in blue. Gain indicates the mIoU improvement over using the source only.
SYNTHIA Cityscapes

Arch.

Meth.

road

side.

buil.

light

sign

vege.

sky

pers.

rider

car

bus

motor

bike

mIoU

gain

Source only V - 17.2 19.7 47.3 3.0 9.1 71.8 78.3 37.6 4.7 42.2 9.0 0.1 0.9 26.2

CBST [49]
V ST 69.6 28.7 69.5 11.9 13.6 82.0 81.9 49.1 14.5 66.0 6.6 3.7 32.4 36.1 9.9


Source only
V - 6.4 17.7 29.7 0.0 7.2 30.3 66.8 51.1 1.5 47.3 3.9 0.1 0.0 20.2

FCNs in the wild [15]
V AT 11.5 19.6 30.8 0.1 11.7 42.3 68.7 51.2 3.8 54.0 3.2 0.2 0.6 22.9 2.7

Cross-city [6]
V AT 62.7 25.6 78.3 1.2 5.4 81.3 81.0 37.4 6.4 63.5 16.1 1.2 4.6 35.7 15.2

Baseline (TAN) [40]
V AT 78.9 29.2 75.5 0.1 4.8 72.6 76.7 43.4 8.8 71.1 16.0 3.6 8.4 37.6 17.4

CLAN
V AT 80.4 30.7 74.7 1.4 8.0 77.1 79.0 46.5 8.9 73.8 18.2 2.2 9.9 39.3 19.1


Source only
R - 55.6 23.8 74.6 6.1 12.1 74.8 79.0 55.3 19.1 39.6 23.3 13.7 25.0 38.6

Baseline (TAN) [40]
R AT 79.2 37.2 78.8 9.9 10.5 78.2 80.5 53.5 19.6 67.0 29.5 21.6 31.3 45.9 7.3

CLAN
R AT 81.3 37.0 80.1 16.1 13.7 78.2 81.5 53.4 21.2 73.0 32.9 22.6 30.7 47.8 9.2
Table 2: Adaptation from SYNTHIA [31] to Cityscapes [8]. We present per-class IoU and mean IoU for evaluation. CLAN and state-of-the-art domain adaptation methods are compared. For each backbone, the best accuracy is highlighted in bold. To clearly showcase the effect of CLAN on infrequent classes, we highlight these classes in blue. Gain indicates the mIoU improvement over using the source only.

4 Experiment

4.1 Datasets

We evaluate CLAN together with several state-of-the-art algorithms on two adaptation tasks, e.g., SYNTHIA [31] Cityscapes [8] and GTA5 [30] Cityscapes. Cityscapes is a real-world dataset with 5,000 street scenes. We use Cityscapes as the target domain. GTA5 contains 24,966 high-resolution images compatible with the Cityscapes annotated classes. SYNTHIA contains 9400 synthetic images. We use SYNTHIA or GTA5 as the source domain.

4.2 Implementation Details

We use PyTorch for implementation. We utilize the DeepLab-v2 

[4] framework with ResNet-101 [12]

pre-trained on ImageNet 

[9] as our source-only backbone for network . We use the single layer adversarial DA method proposed in [40] as the TAN baseline. For co-training, we duplicate two copies of the last classification module and arrange them in parallel after the feature extractor, as illustrated in Fig. 2. For a fair comparison to those methods with the VGG backbone, we also apply CLAN on VGG-16 based FCN8s [23]. For network , we adopt a similar structure with [29], which consists of 5 convolution layers with kernel with channel numbers

and stride of

. Each convolution layer is followed by a Leaky-ReLU 

[27] parameterized by except the last layer. Finally, we add an up-sampling layer to the last layer to rescale the output to the size of the input map, in order to match the size of local alignment score map. During training, we use SGD [2] as the optimizer for with a momentum of , while using Adam [20] to optimize with , . We set both optimizers a weight decay of . For SGD, the initial learning rate is set to and decayed by a poly learning rate policy, where the initial learning rate is multiplied by with . For Adam, we initialize the learning rate to and fix it during the training. We train the network for a total of iterations. We use a crop of during training, and for evaluation we up-sample the prediction map by a factor of 2 and then evaluate mIoU. In our best model, the hyper-parameters , , and are set to , , and 0.4 respectively.

4.3 Comparative Studies

We present the adaptation results on task GTA5 Cityscapes in Table 1 with comparisons to the state-of-the-art domain adaptation methods [34, 44, 15, 14, 40, 49]. We observe that CLAN significantly outperforms the source-only segmentation method by on VGG-16 and on ResNet-101. Besides, CLAN also outperforms the state-of-the-art methods, which improves the mIOU by over compared with MCD [34], CDA [44] and CyCADA [14]. Compared to traditional adversarial network (TAN) in the output space [40], CLAN brings over improvement in mIOU in both architectures of VGG-16 and ResNet-101. In some infrequent classes which are prone to suffer from the side effect of global alignment, e.g., fence, traffic light and pole, CLAN can significantly outperform TAN. Besides, we also compare CLAN with the self training-based methods, among which CBST [49] is the current state-of-the-art one. This series of explicit methods usually achieve higher mIoU then the implicit feature alignment. While in our experiment, we find that CLAN is on par with CBST. Some qualitative segmentation examples can be viewed in Fig. 5.

Table 2 provides the comparative results on the task SYNTHIA Cityscapes. On VGG-16, our final model yields in terms of mIOU, which significantly improves the non-adaptive segmentation result by . Besides, CLAN outperforms the current state-of-art method [15] by and [6] by . On ResNet-101, CLAN brings improvement to source only segmentation model. Compare to TAN [40], the use of adaptive adversarial loss also brings gain in terms of mIOU. Likewise, CLAN is more effective on those infrequent classes which are prone to be negatively transferred, such as traffic light and sign, bringing over improvement respectively. While on some prevalent classes, CLAN can also be par on with the baseline method. Note that on the “train” class, the improvement is not stable. This is due to the training samples that contain the “train” are very few. Finally, comparing with the self training-based method, CLAN outperforms CBST by in terms of mIOU. These observations are in consistent with our t-SNE analysis in Fig. 3, which further verifies that CLAN can actually boost the category-level alignment in segmentation-based DA task.

Figure 4: Left: Cluster center distance variation as training goes on. Right: Mean IoU (see bars & left y axis) and convergence performance (see lines & right y axis) variation when training with different and .

4.4 Feature Distribution

To further verify that CLAN is able to decrease the negative transfer effect for those well-aligned features, we designed an experiment to take a closer look at the category-level alignment degree of each class. Specifically, we randomly select 1K source and 1K target images and calculate the cluster center distance (CCD) of features of the same class between two domains, where and

is training epoch.

is normalized by (In this way, the CCD from the pre-trained model without any fine-tuning would be always normalized to 1). We report in Fig. 4 (left subfigure, taking the class “wall” as an example). First, we observe as training goes on, is monotonically decreasing in CLAN while not being monotone in TAN, suggesting CLAN prevents the well-aligned features from being incorrectly mapped. Second, converges to a smaller value in CLAN than TAN, suggesting CLAN achieves better feature alignment at semantic level.

We further report the final CCD of each class in Fig. 6. We can observe that CLAN can achieve a smaller CCD in most cases, especially in those infrequent classes which are prone to be negatively transferred. These quantitative results, together with the qualitative t-SNE [28] analysis in Fig. 3, indicate that CLAN can preferably align the two domains in semantic level. Such category-aligned feature distribution usually makes the subsequent classification easier.

Figure 5: Qualitative results of UDA segmentation for GTA5 Cityscapes. For each target image, we show the non-adapted (source only) result, adapted result with CLAN and the ground truth label map, respectively.
Figure 6: Quantitative analysis of the feature joint distributions. For each class, we show the distance of the feature cluster centers between source domain and target domain. These results are from 1) the model pre-trained on ImageNet [9] without any fine-tuning, 2) the model fine-tuned with source images only, 3) the adapted model using TAN and 4) the adapted model using CLAN, respectively.

4.5 Parameter Studies

In this experiment, we aim to study two problems: 1) whether the adaptive adversarial loss would cause instability (vanishing gradient) during adversarial training and 2) how much the adaptive adversarial loss would effect the performance. For the problem 1), we utilize the loss of to indicate the convergence performance and a stable adversarial training is achieved if loss converges around 0.5. First, we test our model using , with varying over a range {0.1, 0.2, 0.4, 0.8}. We do not use any larger than 0.8 since CLAN would degrade into TAN in that case. In the experiment, our model suffers from poor convergence when utilize a very small , e.g., 0.1 or 0.2. It indicates that a proper choice of is between 0.2 and 0.8. Motivated by this observation, we then test our model using with varying over a range {10, 20, 40, 80}. We observe that the convergence performance is not very sensitive to since the loss of converges to proper values in all the cases. The best performance is achieved when using and . Besides, we observe that the adaptation performance of CLAN can steadily outperform TAN when using parameters near the best value. We present the detailed performance variation in Fig. 4 (right subfigure). By comparing both the convergence and segmentation results under these different parameter settings, we can conclude that our proposed adaptive adversarial weight can significantly effect and improve the adaptation performance.

5 Conclusion

In this paper, we introduce the category-level adversarial network (CLAN), aiming to address the problem of semantic inconsistency incurred by global feature alignment during unsupervised domain adaptation (UDA). By taking a close look at the category-level data distribution, CLAN adaptively weight the adversarial loss for each feature according to how well their category-level alignment is. In this spirit, each class is aligned with an adaptive adversarial loss. Our method effectively prevents the well-aligned features from being incorrectly mapped by the side effect of pure global distribution alignment. Experimental results validate the effectiveness of CLAN, which yields very competitive segmentation accuracy compared with state-of-the-art UDA approaches.

Acknowledgment. This work is partially supported by the National Natural Science Foundation of China (No. 61572211).

References