Uncertainty Aware Curriculum Domain Adaptation. Code for The UIoU Dark Zurich Challenge @ Vision for All Seasons Workshop, CVPR 2020
We consider the unsupervised scene adaptation problem of learning from both labeled source data and unlabeled target data. Existing methods focus on minoring the inter-domain gap between the source and target domains. However, the intra-domain knowledge and inherent uncertainty learned by the network are under-explored. In this paper, we propose an orthogonal method, called memory regularization in vivo to exploit the intra-domain knowledge and regularize the model training. Specifically, we refer to the segmentation model itself as the memory module, and minor the discrepancy of the two classifiers, i.e., the primary classifier and the auxiliary classifier, to reduce the prediction inconsistency. Without extra parameters, the proposed method is complementary to the most existing domain adaptation methods and could generally improve the performance of existing methods. Albeit simple, we verify the effectiveness of memory regularization on two semantic segmentation datasets: GTA5 -> Cityscapes and SYNTHIA -> Cityscapes, yielding +11.1 baseline model, respectively.READ FULL TEXT VIEW PDF
We tackle an unsupervised domain adaptation problem for which the domain...
Unsupervised domain adaptation (UDA) aims to adapt existing models of th...
Convolutional neural network-based approaches have achieved remarkable
In this paper, we consider the problem of unsupervised domain adaptation...
Recent advances in unsupervised domain adaptation have seen considerable...
Domain adaptation is to transfer the shared knowledge learned from the s...
Existing domain adaptation methods aim to reduce the distributional
Uncertainty Aware Curriculum Domain Adaptation. Code for The UIoU Dark Zurich Challenge @ Vision for All Seasons Workshop, CVPR 2020
Due to the unaffordable cost of the segmentation annotation, unsupervised scene adaptation is to adapt the learned model to a new domain without extra annotation. In contrast to the conventional segmentation tasks, unsupervised scene adaptation reaches one step closer to the real-world practice. In the real-world scenario, the annotation of the target scene is usually hard to acquire. In contrast, abundant source data is easy to access. To improve the model scalability on the unlabeled target domain, most researchers resort to transfer the common knowledge learned from the source domain to the target domain.
The existing scene adaptation methods typically focus on reducing the discrepancy between the source domain and the target domain. The alignment between the source and target domains could be conducted on different levels, such as pixel level [Hoffman et al.2018, Wu et al.2018], feature level [Hoffman et al.2018, Huang et al.2018, Yue et al.2019, Luo et al.2019b, Zhang et al.2019a] and semantic level [Tsai et al.2018, Tsai et al.2019, Wang et al.2019]. Despite the great success, the brute-force alignment drives the model to learn the domain-agnostic shared features of both domains. We consider that this line of methods is sub-optimal in that it ignores the domain-specific feature learning on the target domain, and compromise the final adaptation performance.
Since the domain-specific knowledge is ignored for the target unlabeled data, the regularization by the data itself does not aid in the domain adaptation. To qualitatively verify this, we leverage the auxiliary classifier of the baseline model [Tsai et al.2018] as a probe to pinpoint the inconsistency. As shown in Fig. 1, we observe that the model predicts one consistent supervised result of the source labeled data, while the unlabeled target data suffers from the inconsistency. The predicted result of the primary classifier is different from the auxiliary classifier prediction, especially in the target domain. It implies that the intra-domain consistency has not been learned automatically, when we minor the inter-domain discrepancy.
To effectively exploit the intra-domain knowledge and reduce the target prediction inconsistency, we propose a memory mechanism into the deep neural network training, called memory regularizationin vivo. Different from the previous works focusing on the inter-domain alignment, the proposed method intends to align the different predictions within the same domain to regularize the training. As shown in Fig. 2(c), we consider the inputs as key and the output prediction as the corresponding value. In other words, the proposed method deploys the model itself as the memory module, which memorizes the historical prediction and learns the key-value projection. Since we have the auxiliary classifier and the primary classifier, we could obtain two values for one key. We note that the proposed method is also different from other semi-supervised works deploying the extra memory terms. Since the proposed method does not require additional parameters or modules, we use the term “in vivo” to differentiate our method from [Chen et al.2018, Tarvainen and Valpola2017, Zhang et al.2018b]; these methods deploy external memory modules.
Our contribution is two-fold: (1) We propose to leverage the memory of model learning to pinpoint the prediction uncertainty and exploit the intra-domain knowledge. This is in contrast to most existing adaption methods focusing on the inter-domain alignment. (2) We formulate the memory regularization in vivo as the internal prediction discrepancy between the two classifiers. Different from the existing memory-based models, the proposed method does not need extra parameters, and is compatible with most scene segmentation networks.
Most existing works typically focus on minoring the domain discrepancy between the source domain and the target domain to learn the shared knowledge. Some pioneering works [Hoffman et al.2018, Wu et al.2018] apply the image generator to transfer the source data to the style of the target data, and intend to reduce the low-level visual appearance difference. Similarly, Yu et al.[Yue et al.2019] and Wu et al.[Wu et al.2019] generate the training images of different styles to learn the domain-agnostic feature. Adversarial loss is also widely studied. Tsai et al.[Tsai et al.2018, Tsai et al.2019] apply the adversarial losses to different network layers to enforce the domain alignment. Luo et al.[Luo et al.2019b] leverage the attention mechanism and the class-aware adversarial loss to further improve the performance. Besides, some works also focus on mining the target domain knowledge, which is close to our work. Zou et al.[Zou et al.2018, Zou et al.2019] leverage the confident pseudo labels to further fine-tune the model on the target domain, yielding a competitive benchmark. Recently, Shen et al.[Shen et al.2019] propose to utilize the discriminator to find the confident pseudo label. Different from the pseudo label based methods, the proposed method focuses on target domain knowledge by mining the intrinsic uncertainty of the model learning on the unlabeled target-domain data. We note that the proposed method is orthogonal to the existing methods, including the inter-domain alignment [Tsai et al.2018, Tsai et al.2019, Luo et al.2019b] and self-training with pseudo labels [Zou et al.2018, Zou et al.2019]. In Section 4.2, we show the proposed method can be integrated with other domain adaption methods to further improve the performance.
As one of the early works, Weston et al.[Weston et al.2014] propose to use external memory module to store the long-term memory. In this way, the model could reason with the related experience more effectively. Chen et al.[Chen et al.2018]
further apply the memory to the semi-supervised learning to learn from the unlabeled data. In this work, we argue that the teacher model, which is applied in many frameworks, also could be viewed as one kind of external memory terms. Because the teacher model distills the knowledge of the original setting, and memorizes the key concepts to the final prediction[Hinton et al.2015]. For instance, one of the early work, called temporal ensemble [Laine and Aila2016], uses the historical models to regularize the running model, yielding the competitive performance. The training sample could be viewed as the key, and the historical models are the memory model to find the corresponding value for the key. Since the historical models memorize the experience from the previous training samples, the temporal ensemble could provide stable and relatively accurate predictions of the unlabeled data. Except for [Laine and Aila2016], there are different kinds of external memory models. Mean Teacher [Tarvainen and Valpola2017] leverages the weight moving average model as the memory model to regularize the training. Further, French et al.[French et al.2017] extend Mean Teacher for visual domain adaptation. Zhang et al.[Zhang et al.2018b] propose mutual learning, which learns the knowledge from multiple student models.
Different from existing memory-based methods [Chen et al.2018, Tarvainen and Valpola2017, Zhang et al.2018b], the proposed method leverages the memory of the model itself to regularize the running model. The proposed memory regularization does not introduce extra parameters and external modules. (see Fig. 2)
Formulation. We denote the images from the source domain and the target domain as and , where are the number of the source images and target images. Every source domain data in is annotated with corresponding ground-truth segmentation maps . Given one unlabeled target domain image , we intend to learn a function to project the image to the segmentation map . Following the practice in [Tsai et al.2018, Luo et al.2019b], we adopt the modified DeepLabv2 as our baseline model, which contains one backbone model and two classifiers, i.e., the primary classifier and the auxiliary classifier . To simplify, we denote the two functions and as the segmentation functions, where projects the image to the prediction of the primary classifier, and maps the input to the prediction of the auxiliary classifier.
Overview. As shown in Fig. 3, the proposed method has two training stages, i.e., Stage-I and Stage-II, to progressively transfer the learned knowledge from the labeled source data to the unlabeled target data. In the Stage-I, we follow the conventional domain adaptation methods to minor the inter-domain discrepancy between the source domain and the target domain. When training, we regularize the model by adding the memory regularization. The memory regularization helps to minor the intra-domain inconsistency, yielding the performance improvement. In the Stage-II, we leverage the trained model to predict the label for the unlabeled target data. Then the model is further fine-tuned on the target domain. With the help of pseudo labels, the model could focus on learning domain-specific knowledge on the target domain. The pseudo labels inevitably contain noise, and the memory regularization in Stage-II could prevent the model from overfitting to the noise in pseudo labels. Next we introduce different objectives for the model adaptation in detail. We divide the losses into two classes: (1) Domain-agnostic learning to learn the shared inter-domain features from the source domain; (2) Domain-specific learning to learn the intra-domain knowledge, especially the features for the target domain.
Segmentation loss. First, we leverage the annotated source-domain data to learn the source-domain knowledge. The segmentation loss is widely applied, and could be formulated as the pixel-wise cross-entropy loss:
where the first loss is for the primary prediction, and the second objective is for the auxiliary prediction. and denote the height and the width of the input image, and is the number of segmentation classes.
Adversarial loss. Segmentation loss only focuses on the source domain. We demand one objective to minor the discrepancy of the target domain and the source domain, and hope that the model could transfer the source-domain knowledge to the target domain. We, therefore, introduce the adversarial loss [Tsai et al.2018] to minor the discrepancy of the source domain and the target domain. The adversarial loss is applied to both predictions of the primary classifier and the auxiliary classifier:
where denotes the discriminator. In this work, we deploy two different discriminators, i.e., and , for the primary prediction and the auxiliary prediction, respectively. The discriminator is to find out whether the target prediction is close to the source prediction in the semantic space. By optimizing the adversarial loss, we force the model to bridge the inter-domain gap on the semantic level.
However, the segmentation loss and the adversarial loss do not solve the intra-domain inconsistency, especially in the target domain. In the Stage-I, we consider leveraging the uncertainty in the target domain and propose the memory regularization in vivo to enforce the consistency. In the Stage-II, we further utilize the memory to regularize the training and prevent the model overfitting to the noisy pseudo labels.
Memory regularization. In this paper, we argue that the model itself could be viewed as one kind of memory module, in that the model memorizes the historical experience. Without introducing extra parameters or external modules, we enforce the model to learn from itself. In particular, we view the input image as the key, and the model as the memory module. Given the input image (key), the model could generate the value by simply feeding forward the key. We could obtain two values by the primary classifier and the auxiliary classifier, respectively. To minor the uncertainty of the model learning on the target domain, we hope that the two values of the same key could be as close to each other as possible, so we deploy the KL-divergence loss:
We only apply the memory regularization loss on the target domain and ask the mapping functions and to generate a consistent prediction on the unlabeled target data.
Discussion. 1. What is the advantage of the memory regularization? By using the memory regularization, we enable the model to learn the intra-domain knowledge on the unlabeled target data with an explicit and complementary objective. As discussed in the [Tarvainen and Valpola2017, Chen et al.2018], we could not ensure that the memory always provides a right class prediction for the unlabeled data. The memory mechanism is more likely to act as a teacher model, providing the class distribution based on the historical experience. 2.Will the auxiliary classifier hurt the primary classifier? As shown in many semi-supervised methods [Zhang et al.2018b, Tarvainen and Valpola2017], the bad-student model also could provide essential information for the top-student models. Our experiment also verifies that the sub-optimal auxiliary classifier could help the primary classifier learning, and vice versa (see Section 4.2).
Self-training with pseudo labels. In the Stage-II, we do not use the source data anymore. The model is fine-tuned on the unlabeled target data and mine the target domain knowledge. Following the self-training policy in [Zou et al.2018, Zou et al.2019], we retrain the model with the pseudo label . The pseudo label combines the output of and from the trained model in the Stage-I. In particular, we set the . The pseudo segmentation loss could be formulated as:
We apply the pixel-wise cross-entropy loss as the supervised segmentation loss. Since most pseudo labels are correct, the model still could learn from the noisy labels. In Section 4.2, we show the self-training with pseudo labels further boosts the performance on the target domain despite the noise in pseudo labels.
Discussion. What is the advantage of the memory regularization in the Stage-II? In fact, we treat the pseudo labels as the supervised annotations in the Stage-II. However, the pseudo labels contain the noise and may mislead the model to overfit the noise. The proposed memory regularization in the Stage-II works as a smoothing term, which enforces the consistency in the model prediction, rather than focusing on fitting the pseudo label extremely.
We integrate the above-mentioned losses. The total loss of the Stage-I and Stage-II training could be formulated as:
where is the weight for the memory regularization. We follow the setting in PSPNet [Zhao et al.2017] to set for segmentation losses on the auxiliary classifier. , . For adversarial losses, we follow the setting in [Tsai et al.2018, Luo et al.2019b], and select small weights for adversarial loss terms . Besides, we fix the weight of memory regularization as for all experiments.
Network Architectures. We deploy the widely-used Deeplab-v2 [Chen et al.2017] as the baseline model, which adopts the ResNet-101 [He et al.2016] as the backbone model. Since the auxiliary classifier has been widely adopted in the scene segmentation frameworks, such as PSPNet [Zhao et al.2017] and modified DeepLab [Tsai et al.2018, Luo et al.2019b], for fair comparison, we also applied the auxiliary classifier in our baseline model as well as the final full model. We also insert the dropout layer before the classifier layer, and the dropout rate is . Besides, we follow the PatchGAN [Isola et al.2017] and deploy the multi-scale discriminator model.
Implementation Details. The input image is resized to , and we randomly crop for training. We deploy the SGD optimizer with the batch size for the segmentation model, and the initial learning rate is set to . The optimizer of the discriminator is Adam and the learning rate is set to . Following [Zhao et al.2017, Zhang et al.2019b], both segmentation model and discriminator deploy the ploy learning rate decay by multiplying the factor . We set the total iteration as iteration and adopt the early-stop policy. The model is first trained without the memory regularization for to avoid the initial prediction noise, and then we add the memory regularization to the model training. For Stage-I, we train the model with iterations. We further fine-tune the model in the Stage-II for iterations. We also adopt the class balance policy in the [Zou et al.2018] to increase the weight of the rare class, and the small-scale objects. When inference, we combine the outputs of both classifiers
. Our implementation is based on Pytorch. We will release our code for reproducibility.
We mainly evaluate the proposed method on the two unsupervised scene adaption settings, i.e., GTA5 [Richter et al.2016] Cityscapes [Cordts et al.2016] and SYNTHIA [Ros et al.2016] Cityscapes [Cordts et al.2016]. Both source datasets, i.e., GTA5 and SYNTHIA, are the synthetic datasets. GTA5 contains training images, while SYNTHIA has images for training. The target dataset, Cityscapes, is collected in the realistic scenario, including unlabeled training images. We follow the setting in [Tsai et al.2018, Luo et al.2019b, Zou et al.2018] and evaluate the model on the Cityscapes validation set, which contains
images. For the evaluation metric, we report the mean Intersection over Union (mIoU), averaged over all classes.
Effect of the memory regularization. To investigate how the memory helps both classifiers, we report the results of the single classifier in Table 1. The observation suggests two points: First, memory regularization helps both classifier learning and improves the performance of both classifiers, especially the auxiliary classifier. Second, the accuracy of the primary classifier prediction does not decrease due to the relatively poor results of the auxiliary classifier. The primary classifier also increases by mIoU. It verifies that the proposed memory regularization helps to reduce the inconsistency and mine intra-domain knowledge. Furthermore, we report the results of the full model after Stage-I training, which combines the predictions of both classifiers. The full model arrives mIoU accuracy, which is slightly higher than the prediction accuracy of the primary classifier. It also indicates that the predictions of the auxiliary classifier and primary classifier are complementary.
|CyCADA [Hoffman et al.2018]||79.1||33.1||77.9||23.4||17.3||32.1||33.3||31.8||81.5||26.7||69.0||62.8||14.7||74.5||20.9||25.6||6.9||18.8||20.4||39.5|
|MCD [Saito et al.2018]||90.3||31.0||78.5||19.7||17.3||28.6||30.9||16.1||83.7||30.0||69.1||58.5||19.6||81.5||23.8||30.0||5.7||25.7||14.3||39.7|
|AdaptSegNet [Tsai et al.2018]||86.5||36.0||79.9||23.4||23.3||23.9||35.2||14.8||83.4||33.3||75.6||58.5||27.6||73.7||32.5||35.4||3.9||30.1||28.1||42.4|
|SIBAN [Luo et al.2019a]||88.5||35.4||79.5||26.3||24.3||28.5||32.5||18.3||81.2||40.0||76.5||58.1||25.8||82.6||30.3||34.4||3.4||21.6||21.5||42.6|
|CLAN [Luo et al.2019b]||87.0||27.1||79.6||27.3||23.3||28.3||35.5||24.2||83.6||27.4||74.2||58.6||28.0||76.2||33.1||36.7||6.7||31.9||31.4||43.2|
|APODA [Yang et al.2020]||85.6||32.8||79.0||29.5||25.5||26.8||34.6||19.9||83.7||40.6||77.9||59.2||28.3||84.6||34.6||49.2||8.0||32.6||39.6||45.9|
|PatchAlign [Tsai et al.2019]||92.3||51.9||82.1||29.2||25.1||24.5||33.8||33.0||82.4||32.8||82.2||58.6||27.2||84.3||33.4||46.3||2.2||29.5||32.3||46.5|
|AdvEnt [Vu et al.2019]||DeepLabv2||89.4||33.1||81.0||26.6||26.8||27.2||33.5||24.7||83.9||36.7||78.8||58.7||30.5||84.8||38.5||44.5||1.7||31.6||32.4||45.5|
|FCAN [Zhang et al.2018a]||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||46.6|
|CBST [Zou et al.2018]||91.8||53.5||80.5||32.7||21.0||34.0||28.9||20.4||83.9||34.2||80.9||53.1||24.0||82.7||30.3||35.9||16.0||25.9||42.8||45.9|
|MRKLD [Zou et al.2019]||91.0||55.4||80.0||33.7||21.4||37.3||32.9||24.5||85.0||34.1||80.8||57.7||24.6||84.1||27.8||30.1||26.9||26.0||42.3||47.1|
|MCD [Saito et al.2018]||84.8||43.6||79.0||3.9||0.2||29.1||7.2||5.5||83.8||83.1||51.0||11.7||79.9||27.2||6.2||0.0||43.5||37.3|
|SIBAN [Luo et al.2019a]||82.5||24.0||79.4||16.5||12.7||79.2||82.8||58.3||18.0||79.3||25.3||17.6||25.9||46.3|
|PatchAlign [Tsai et al.2019]||82.4||38.0||78.6||8.7||0.6||26.0||3.9||11.1||75.5||84.6||53.5||21.6||71.4||32.6||19.3||31.7||46.5||40.0|
|AdaptSegNet [Tsai et al.2018]||84.3||42.7||77.5||4.7||7.0||77.9||82.5||54.3||21.0||72.3||32.2||18.9||32.3||46.7|
|CLAN [Luo et al.2019b]||81.3||37.0||80.1||16.1||13.7||78.2||81.5||53.4||21.2||73.0||32.9||22.6||30.7||47.8|
|APODA [Yang et al.2020]||86.4||41.3||79.3||22.6||17.3||80.3||81.6||56.9||21.0||84.1||49.1||24.6||45.7||53.1|
|AdvEnt [Vu et al.2019]||DeepLabv2||85.6||42.2||79.7||8.7||0.4||25.9||5.4||8.1||80.4||84.1||57.9||23.8||73.3||36.4||14.2||33.0||48.0||41.2|
|CBST [Zou et al.2018]||68.0||29.9||76.3||10.8||1.4||33.9||22.8||29.5||77.6||78.3||60.6||28.3||81.6||23.5||18.8||39.8||48.9||42.6|
|MRKLD [Zou et al.2019]||67.7||32.2||73.9||10.7||1.6||37.4||22.2||31.2||80.8||80.5||60.8||29.1||82.8||25.0||19.4||45.3||50.1||43.8|
Effect of different losses in Stage-I. As shown in Table 2, the full model could improve the performance from to mIoU. When only using the adversarial loss , the model equals to the widely-used domain adaptation method [Tsai et al.2018]. We note that the model only using the memory regularization also achieves significant improvement comparing to the baseline model without adaption. We speculate that the memory regularization helps to mine the target domain knowledge, yielding the better performance on the target domain. After combining all three loss terms, the full model arrives mIoU on Cityscapes.
Effect of different losses in Stage-II. If we only deploy the pseudo segmentation loss , the model equals to several previous self-training methods [Zou et al.2018, Zou et al.2019]. However, this line of previous methods usually demands a well-designed threshold for the label confidence. In contrast, we do not introduce any threshold, but apply the memory regularization to prevent overfitting toward noisy labels. As shown in Table 3, the full model with memory regularization arrives mIoU accuracy, which is higher than the result of the model only trained on the pseudo labels. It verifies that the proposed memory regularization also helps the model learning from noisy labels.
Hyperparameter Analysis. In this work, we introduce as the weight of the memory regularization. As shown in Fig. 4, we evaluate different weight values . We observe that the model is robust to the value of . However, when the value is too large or small, the model may mislead to overfitting or underfitting the consistency. Therefore, without loss of generality, we use for all experiments.
We compare the proposed method with different domain adaptation methods on GTA5 Cityscapes (See Table 4). For a fair comparison, we mainly show the results based on the same backbone, i.e., DeepLabv2. The proposed method has achieved mIoU, which is higher than the competitive methods, e.g., pixel-level alignment [Hoffman et al.2018], semantic level alignment [Tsai et al.2018], as well as the self-training methods, i.e., [Zou et al.2018, Zou et al.2019]. Compared with the strong source-only model, the proposed method yields improvement. Besides, we observe a similar result on SYNTHIA Cityscapes (see Table 5). The proposed method arrives mIoU* and mIoU, which is also competitive to other methods. We obtain improvement in mIoU accuracy over the baseline.
We propose a memory regularization method for unsupervised scene adaption. Our model leverages the intra-domain knowledge and reduces the uncertainty of model learning. Without introducing any extra parameters or external modules, we deploy the model itself as the memory module to regularize the training. Albeit simple, the proposed method is complementary to previous works and achieves competitive results on two benchmarks, i.e., GTA5 Cityscapes and SYNTHIA Cityscapes.
Semi-supervised deep learning with memory.In ECCV, 2018.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
Class-specific reconstruction transfer learning for visual recognition across domains.TIP, 2019.