(ECCV 2020) Classes Matter: A Fine-grained Adversarial Approach to Cross-domain Semantic Segmentation
Despite great progress in supervised semantic segmentation,a large performance drop is usually observed when deploying the model in the wild. Domain adaptation methods tackle the issue by aligning the source domain and the target domain. However, most existing methods attempt to perform the alignment from a holistic view, ignoring the underlying class-level data structure in the target domain. To fully exploit the supervision in the source domain, we propose a fine-grained adversarial learning strategy for class-level feature alignment while preserving the internal structure of semantics across domains. We adopt a fine-grained domain discriminator that not only plays as a domain distinguisher, but also differentiates domains at class level. The traditional binary domain labels are also generalized to domain encodings as the supervision signal to guide the fine-grained feature alignment. An analysis with Class Center Distance (CCD) validates that our fine-grained adversarial strategy achieves better class-level alignment compared to other state-of-the-art methods. Our method is easy to implement and its effectiveness is evaluated on three classical domain adaptation tasks, i.e., GTA5 to Cityscapes, SYNTHIA to Cityscapes and Cityscapes to Cross-City. Large performance gains show that our method outperforms other global feature alignment based and class-wise alignment based counterparts. The code is publicly available at https://github.com/JDAI-CV/FADA.READ FULL TEXT VIEW PDF
Simulation-to-real domain adaptation for semantic segmentation has been
Human pose estimation has been widely studied with much focus on supervi...
Recent advances in unsupervised domain adaptation (UDA) show that
Generic object detection has been immensely promoted by the development ...
Recent advances in unsupervised domain adaptation for semantic segmentat...
Deep learning models usually require a large amount of labeled data to
Data inconsistency and bias are inevitable among different facial expres...
(ECCV 2020) Classes Matter: A Fine-grained Adversarial Approach to Cross-domain Semantic Segmentation
An end-to-end model to detect and recognize characters from a corpus
The success of semantic segmentation  in recent years is mostly driven by a large amount of accessible labeled data. However, collecting massive densely annotated data for training is usually a labor-intensive task . Recent advances in computer graphics provide an alternative for replacing expensive human labor. Through physically based rendering, we can obtain photo-realistic images with the pixel-level ground-truth readily available in an effortless way [23, 24].
However, performance drop is observed when the model trained with synthetic data (a source domain) is applied in realistic scenarios (a target domain), because the data from different domains usually share different distributions. This phenomenon is known as domain shift problem , which poses a challenge to cross-domain tasks .
Domain adaptation aims to alleviate the domain shift problem by aligning the feature distributions of the source and the target domain. A group of works focus on adopting an adversarial framework, where a domain discriminator is trained to distinguish the target samples from the source ones, while the feature network tries to fool the discriminator by generating domain-invariant features [16, 30, 15, 38, 20, 25, 34, 8, 35].
Although impressive progress has been achieved in domain adaptive semantic segmentation, most of prior works strive to align global feature distributions without paying much attention to the underlying structures among classes. However, as discussed in recent works [17, 3], matching the marginal feature distributions does not guarantee small expected error on the target domain. The class conditional distributions should also be aligned, meaning that class-level alignment also plays an important role. As illustrated in Figure 1, the upper part shows the result of global feature alignment where the two domains are well-aligned but some samples are falsely mixed up. This motivates us to incorporate class information into the adversarial framework to enable fine-grained feature alignment. As illustrated in the bottom of Figure 1, features are expected to be aligned according to specific classes.
There have been some pioneering works [20, 7] trying to address this problem. Chen et al.  propose to use several independent discriminators to perform class-wise alignment, but independent discriminators might fail to capture the relationships between classes. Luo et al.  introduce an self-adaptive adversarial loss to apply different weights to each region. However, in fact, they do not explicitly incorporate class information in their methods, which might fail to promote class-level alignment.
Our motivation is to directly incorporate class information into the discriminator and encourage it to align features at a fine-grained level. Traditional adversarial training has been proven effective for aligning features by using a binary domain discriminator to model distribution ( refers to domain and
is the feature extracted from input data). By confusing such a discriminator, expectingwhere 0 stands for the source domain and 1 for the target domain, the features become domain invariant and well aligned. To further take classes into account, we split the output into multiple channels according to (where refers to classes ). We directly model the discriminator as to formulate a fine-grained domain alignment task. Although in the setting of domain adaptation the category-level labels for target domain are inaccessible, we find that the model predictions on target domain also contain class information and prove that it is possible to supervise the discriminator with the predictions on both domains. In the adversarial learning process, class information is incorporated and the features are expected to be aligned according to specific classes.
In this paper, we propose such a fine-grained adversarial learning framework for domain adaptive semantic segmentation (FADA). As illustrated in Figure 1, we represent the supervision of traditional discriminator at a fine-grained semantic level, which enables our fine-grained discriminator to capture rich class-level information. The adversarial learning process is performed at fine-grained level, so the features are expected to be adaptively aligned according to their corresponding semantic categories. The class mismatch problem, which broadly exists in the global feature alignment, is expected to be further suppressed. Correspondingly, by incorporating class information, the binary domain labels are also generalized to a more complex form, called “domain encodings” to serve as the new supervision signal. Domain encodings could be extracted from the network’s predictions on both domains. Different strategies of constructing domain encodings will be discussed. We conduct an analysis with Class Center Distance to demonstrate the effectiveness of our method regarding class-level alignment. Our method is also evaluated on three popular cross-domain benchmarks and presents new state-of-the-art results.
The main contributions of this paper are summarized below.
We propose a fine-grained adversarial learning framework for cross-domain semantic segmentation that explicitly incorporates class-level information.
The fine-grained learning framework enables class-level feature alignment, which is further verified by analysis using Class Center Distance.
We evaluate our methods with comprehensive experiments. Significant improvements compared to other state-of-the-art methods are achieved on popular domain adaptive segmentation tasks including GTA5 Cityscapes, SYNTHIA Cityscapes and Cityscapes Cross-City.
Semantic segmentation is a task of predicting unique semantic label for each pixel of the input image. With the advent of deep convolutional neural networks, the academia of computer vision witnesses a huge progress in this field. FCN
triggered the interests in introducing deep learning for this task. Many follow-up methods are proposed to enlarge the receptive fields to cover more context information[5, 4, 36, 6]. Among all these works, the family of Deeplab [5, 4, 6] attracts a lot of attention and has been widely applied in many works for their simplicity and effectiveness.
Domain adaptation strives to address the performance drop caused by the different distributions of training data and testing data. In the recent years, several works are proposed to approach this problem in image classification [25, 3]. Inspired by the theoretical upper bound of risk in target domain , some pioneering works suggest to optimize some distance measurements between the two domains to align the features [18, 29]. Recently, motivated by GAN , adversarial training becomes popular for its power to align features globally [30, 7, 25].
Unlike domain adaptation for image classification task, domain adaptive semantic segmentation receives less attention for its difficulty even though it supports many important applications including autonomous driving in the wild [16, 8]. Based on the theoretical insight  on domain adaptive classification, most works follow the path of shortening the domain discrepancy between the two domains. Large progress is achieved through optimization by adversarial training or explicit domain discrepancy measures [30, 15, 16]. In the context of domain adaptive semantic segmentation task, AdaptSegnet  attempts to align the distribution in the output space. Inspired by CycleGAN , CyCADA  suggests to adapt the representation in pixel-level and feature-level. There are also many works focusing on aligning different properties between two domains such as entropy  and information .
Although huge progress has been made in this field, most of existing methods share a common limitation: Enforcing global feature alignment would inevitably mix samples with different semantic labels together when drawing two domains closer, which usually results in a mismatch of classes from different domains. CLAN  is a pioneer work to address category-level alignment. It suggests applying different adversarial weight to different regions, but it does not directly and explicitly incorporate the classes into the model.
Semantic segmentation aims to predict per-pixel unique label for the input image . In an unsupervised domain adaptation setting for semantic segmentation, we have access to a collection of labeled data in a source domain , and unlabeled data in a target domain where and are the numbers of samples from different domains. Domain and domain share the same semantic class labels . The goal is to learn a segmentation model which could achieve a low expected risk on the target domain. Generally, segmentation network could be divided into a feature extractor and a multi-class classifier , where .
Traditional feature-level adversarial training relies on a binary domain discriminator to align the features extracted by on both domains. Domain adaptation is tackled by alternatively optimizing and with two steps:
(1) is trained to distinguish features from different domains. This process is usually achieved by fixing and and solving:
where and are the features extracted by on source sample and target sample ; refers to the domain variable where 0 refers to the source domain and 1 refers to the target domain.
is the probability output from the discriminator.
(2) is trained with the task loss on the source domain and the adversarial loss on the target domain. This process requires fixing and updating and :
The cross-entropy loss on source domain minimizes the difference between the prediction and the ground truth, which helps to learn the task specific knowledge.
where is the probability confidence of source sample belonging to semantic class k predicted by , is the entry for the one-hot label.
The adversarial loss is used to confuse the discriminator to encourage to generate domain invariant features.
To incorporate the class information into the adversarial learning framework, we propose a novel discriminator and enable a fine-grained adversarial learning process. The whole pipeline is illustrated in Figure 2.
The traditional adversarial training strives to align the marginal distribution by confusing a binary discriminator. To make the discriminator not merely focus on distinguishing domains, we split each of the two output channels of the binary discriminator into K channels and encourage a fine-grained level adversarial learning. With this design, the predicted confidence for domains is represented as a confidence distribution over different classes, which enables the new fine-grained discriminator to model more complex underlying structures between classes, thus encouraging class-level alignment.
Correspondingly, the binary domain labels are also converted to a general form, namely domain encodings, to incorporate class information. Traditionally, the domain labels used for training the binary discriminator are and
for the source and target domains respectively. The domain encodings are represented as a vectorand for the two domains respectively, where is the knowledge extracted from the classifier C represented by a -dimensional vector; is an all-zero -dimensional vector. The choices of how to generate domain knowledge will be discussed in Section 3.3.
During the training process, the discriminator not only tries to distinguish domains, but also learns to model class structures. The in Equation 1 becomes:
where and are the th entries of the class knowledge for the source sample and target sample . The adversarial loss used to confuse the discriminator and guide the generation of domain-invariant features in Equation 4 becomes:
is designed to maximize the probability of features from target domain being considered as the source features without hurting the relationship between features and classes.
The overall network in Figure 2 is used in the training stage. During inference, the domain adaptation component is removed and one only needs to use the original segmentation network with the adapted weights.
Now that we have a fine-grained domain discriminator, which could adaptively align features according to the class-level information contained in domain encodings, another challenge raises: how to get the class knowledge and in Equations 5 and 6 to construct domain encoding for each sample? Considering that in the unsupervised domain adaptive semantic segmentation task none of annotations in target domain is accessible, it seems contradictory to use the class knowledge on the target domain for guiding class-level alignment. However, during training, with ground-truth annotations from the source domain, the classifier learns to map features into the semantic classes. Considering that the source domain and the target domain share the same semantic classes, it would be a natural choice to use the predictions of as knowledge to supervise the discriminator.
As illustrated in equations 5 and 6, the class knowledge for optimizing the fine-grained discriminator works as the supervision signal. The choices of and are open to many possibilities. For specific tasks, people could design different forms to produce class knowledge with prior knowledge. Here we discuss two general solutions to extract class knowledge from network predictions for constructing domain encodings. Because the class-level knowledge for different domains could be extracted in the same way, in the following discussion we would use to represent th entry for a single sample without differentiating the domain.
The one-hot hard labels could be a straightforward solution for generating knowledge, which could be denoted as:
where is the softmax probability output of for class . In this way, only the most confident class is selected. In practice, in order to remove the impact of noisy samples, we can select samples whose confidence is higher than a certain threshold and ignore those with low confidence.
Another alternative is multi-channel soft labels, which has the following definition:
th entry of logits and
is a temperature to encourage soft probability distribution over classes. Note that during training, an additional regularization could also be applied. For example, we practically find that clipping the values of the soft labels by a given threshold achieves more stable performance because it prevents from overfitting to certain classes.
We present a comprehensive evaluation of our proposed method on three popular unsupervised domain adaptive semantic segmentation benchmarks, e.g., Cityscapes Cross-City, SYNTHIA Cityscapes, and GTA5 Cityscapes.
Cityscapes Cityscapes  is a real-world urban scene dataset consisting of a training set with 2,975 images, a validation set with 500 images and a testing set with 1,525 images. Following the standard protocols [16, 15, 30], we use the 2,975 images from Cityscapes training set as the unlabeled target domain training set and evaluate our adapted model on the 500 images from the validation set.
Cross-City Cross-City  is an urban scene dataset collected with Google Street View. It contains 3,200 unlabeled images and 100 annotated images of four different cities respectively. The annotations of Cross-City share 13 classes with Cityscapes.
SYNTHIA SYNTHIA  is a synthetic urban scene dataset. We pick its subset SYNTHIA-RAND-CITYSCAPES, which shares 16 semantic classes with Cityscapes, as the source domain. In total, 9,400 images from SYNTHIA dataset are used as source domain training data for the task.
GTA5 GTA5 dataset  is another synthetic dataset sharing 19 semantic classes with Cityscapes. 24,966 urban scene images are collected from a physically-based rendered video game Grand Theft Auto V (GTAV) and are used as source training data.
|FCNs in the wild ||11.5||19.6||30.8||4.4||0.0||20.3||0.1||11.7||42.3||68.7||51.2||3.8||54.0||3.2||0.2||0.6||20.2||22.9|
|Baseline (feat. only) ||63.6||26.8||67.3||3.8||0.3||21.5||1.0||7.4||76.1||76.5||40.5||11.2||62.1||19.4||5.3||13.2||31.0||36.2|
|Baseline (feat. only) ||62.4||21.9||76.3||11.5||0.1||24.9||11.7||11.4||75.3||80.9||53.7||18.5||59.7||13.7||20.6||24.0||35.4||40.8|
|FCNs in the wild ||70.4||32.4||62.1||14.9||5.4||10.9||14.2||2.7||79.2||21.3||64.6||44.1||4.2||70.4||8.0||7.3||0.0||3.5||0.0||27.1|
|Baseline (feat. only) ||85.7||22.8||77.6||24.8||10.6||22.2||19.7||10.8||79.7||27.8||64.8||41.5||18.4||79.7||19.9||21.8||0.5||16.2||4.2||34.1|
|Baseline (feat. only) ||83.7||27.6||75.5||20.3||19.9||27.4||28.3||27.4||79.0||28.4||70.1||55.1||20.2||72.9||22.5||35.7||8.3||20.6||23.0||39.3|
The metrics for evaluating our algorithm is consistent with the common semantic segmentation task. Specifically, we compute PSACAL VOC intersection-over-union ()  of our prediction and the ground truth label. We have , where , and are the numbers of true positive, false positive and false negative pixels respectively. In addition to the for each class, a is also reported as the mean of s over all classes.
Our pipeline is implemented by PyTorch. For fair comparison, we employ DeeplabV2  with VGG-16  and ResNet-101 
as the segmentation base networks. All models are pre-trained on ImageNet. For the fine-grained discriminator, we adopt a simple structure consisting of 3 convolution layers with channel numbers , 21] parameterized by 0.2 except for the last layer.
To train the segmentation network, we use the Stochastic Gradient Descent (SGD) optimizer where the momentum is 0.9 and the weight decay is. The learning rate is initially set to and is decreased following a ‘poly’ learning rate policy with power of 0.9. For training the discriminator, we adopt the Adam optimizer with , and the initial learning rate as . The same ’poly’ learning rate policy is used. is constantly set to 0.001. Temperature T is set as 1.8 for all experiments.
Regarding the training procedure, the network is first trained on source data for 20k iterations and then fine-tuned using our framework for 40k iterations. The batch size is eight. Four are source images and the other four are target images. Some data augmentations are used including random flip and color jittering to prevent overfitting.
Although our model is already able to achieve new state-of-the-art results, we further boost the performance by using self distillation [12, 1, 33] and multi-scale testing. A detailed ablation study is conducted in Section 4.5 to reveal the effect of each component, which, we hope, could provide more insights into the topic.
Small shift: Cross city adaptation. Adaptation between real images from different cities is a scenario with great potential for practical applications. Table 1 shows the results of domain adaptation on Cityscapes Cross-City dataset. Our method has different performance gains for the four cities. On average over four cities, our FADA achieves 8.5% improvement compared with the source-only baselines, and 2.25% gain compared with the previous best method.
Large shift: Synthetic to real adaptation. Table 2 and 3 demonstrate the semantic segmentation performance on SYNTHIA Cityscapes and GTA5 Cityscapes tasks in comparison with existing state-of-the-art domain adaptation methods. We could observe that our FADA outperforms the existing methods by a large margin and obtain new state-of-the-art performance in terms of mIoU. Compared to the source model without any adaptation, a gain of 16.4% and 13.9% are achieved for VGG16 and ResNet101 respectively on SYNTHIA Cityscapes. FADA also obtains 15.5% and 12.4% improvement on different baelines for GTA5 Cityscapes task. Besides, compared to the state-of-the-art feature-level methods, a general improvement of over 4% is witnessed. Note that as mentioned in , the “train” images in Cityscapes are more visually similar to the “bus” in GTA5 instead of the “train” in GTA5, which is also a challenge to other methods. Qualitative results for GTA5 Cityscapes task are presented at Figure 5, reflecting that FADA also brings a significant visual improvement.
To verify whether our fine-grained adversarial framework aligns features on a class-level, we design an experiment to investigate to what degree the class-level features are aligned. Considering different networks map features to different feature spaces, it’s necessarily to find a stable metric. CLAN  suggests to use a Cluster Center Distance, which is defined as the ratio of intra-class distance between the trained model and the initial model, to measure class-level alignment degree. To better evaluate the effectiveness of class-level feature alignment on the same scale, we propose to modify the Cluster Center Distance to the Class Center Distance (CCD) by taking inter-class distance into account. The CCD for class is defined as follows:
where is the class center for class , is the set of all features belonging to class . With CCD, we could measure the ratio of intra-class compactness over inter-class distance. A low CCD suggests the features of same class are clustered densely while the distance between different classes is relatively large. We randomly pick 2,000 source samples and 2,000 target samples respectively, and compare the CCD values with other state-of-the-art methods: AdaptSegNet for global alignment and CLAN for class-wise alignment without explicitly modeling the class relationship. As shown in the Figure 4, FADA achieves a much lower CCD on most classes and get the lowest mean CCD value 1.1 compared to other algorithms. With FADA, we can achieve better class-level alignment and preserve consistent class structures between domains.
Analysis of different components. Table 4 presents the impact of each component on DeeplabV2 with ResNet-101 on GTA5 Cityscapes task. The fine-grained adversarial training brings an improvement of 10.1%, which already makes it the new state of the art. To further explore the potential of the model, the self distillation strategy leads to an improvement of 2.3% and multi-scale testing further boosts the performance by 0.7%.
Hard labels vs. Soft labels. As discussed in Section 3.3, the knowledge extracted from the classifier C could be produced from hard labels or soft labels. Here we compare these two forms of label on GTA5 Cityscapes and SYNTHIA Cityscapes tasks with DeeplabV2 ResNet-101. For soft labels, we use ”confidence clipping“ with threhold 0.9 as regularization. For hard labels, we only keep high-confidence samples, while ignoring the samples with confidence lower than 0.9. The results are reported in Table 5. Both choices give great boost to the baseline global feature alignment model. We observe that soft label is a more flexible choice and present more superior performance.
Impact of Confidence Clipping. In our experiments, we use ”confidence clipping” as a regularizer to prevent overfitting on noisy soft labels. The values of the confidence are truncated by a given threshold, therefore the values are not encouraged to heavily fit to a certain class. We test several thresholds and the results are shown in Table 6. Note that when the threshold is 1.0, it means no regularization is used. We observe constant performance gain using the confidence clipping. The best result is found when the threshold is 0.9.
In this paper, we address the problem of domain adaptive semantic segmentation by proposing a fine-grained adversarial training framework. A novel fine-grained discriminator is designed to not only distinguish domains, but also capture category-level information to guide a fine-grained feature alignment. The binary domain labels used to supervise the discriminator are generalized to domain encodings correspondingly to incorporate class information. Comprehensive experiments and analysis validate the effectiveness of our method. Our method achieves new state-of-the-art results on three popular tasks, outperforming other methods by a large margin.
This work was partially supported by Beijing Academy of Artificial Intelligence (BAAI).
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.1.
A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 7130–7138. External Links: Cited by: §4.3.
Unpaired image-to-image translation using cycle-consistent adversarial networkss. In 2017 IEEE International Conference on Computer Vision (ICCV), Cited by: §2.3.