What Can Be Transferred: Unsupervised Domain Adaptation for Endoscopic Lesions Segmentation

04/24/2020 ∙ by Jiahua Dong, et al. ∙ Huaqiao University University of Arkansas at Little Rock 0

Unsupervised domain adaptation has attracted growing research attention on semantic segmentation. However, 1) most existing models cannot be directly applied into lesions transfer of medical images, due to the diverse appearances of same lesion among different datasets; 2) equal attention has been paid into all semantic representations instead of neglecting irrelevant knowledge, which leads to negative transfer of untransferable knowledge. To address these challenges, we develop a new unsupervised semantic transfer model including two complementary modules (i.e., T_D and T_F ) for endoscopic lesions segmentation, which can alternatively determine where and how to explore transferable domain-invariant knowledge between labeled source lesions dataset (e.g., gastroscope) and unlabeled target diseases dataset (e.g., enteroscopy). Specifically, T_D focuses on where to translate transferable visual information of medical lesions via residual transferability-aware bottleneck, while neglecting untransferable visual characterizations. Furthermore, T_F highlights how to augment transferable semantic features of various lesions and automatically ignore untransferable representations, which explores domain-invariant knowledge and in return improves the performance of T_D. To the end, theoretical analysis and extensive experiments on medical endoscopic dataset and several non-medical public datasets well demonstrate the superiority of our proposed model.



There are no comments yet.


page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The successes of unsupervised domain adaptation have been widely-extended into a large amount of computer vision applications,

e.g., semantic segmentation [Tsai_2019_ICCV, Dong_2019_ICCV]. Due to the powerful generalization capacity for segmentation task of unlabeled target data, enormous unsupervised domain adaptation methods [Lee_2019_CVPR, Li_2019_CVPR, Lian_2019_ICCV, exp:LtA, exp:CGAN] has been developed to narrow the distribution divergence between labeled source dataset and unlabeled target dataset.

Figure 1: Illustration of our unsupervised semantic transfer model, where two complementary modules and can alternatively explore where to translate transferable visual characterizations of medical lesions and how to augment transferable semantic feature of various diseases, respectively.

However, most state-of-the-art models [Luo_2019_CVPR, Dong_2019_ICCV, Dou2018UCD3304415, Chen2019SynergisticIA] cannot efficiently address semantic transfer of medical lesions with various appearances, due to the difficulty in determining what kind of visual characterizations could boost or cripple the performance of semantic transfer. Additionally, they fail to brush untransferable representations aside while forcefully utilizing these irrelevant knowledge heavily degrades the transfer performance. Take the clinical lesions diagnosis as an example, cancer and ulcer present diverse visual information (e.g., appearance, shape and texture) among gastroscope and enteroscopy datasets, which is a thorny diagnosis challenge due to the large distribution shift among different datasets. Obviously, it is difficult to manually determine what kind of lesions information could promote the transfer performance, i.e., exploring domain-invariant knowledge for various lesions. Therefore, how to automatically capture transferable visual characterizations and semantic representations while neglecting irrelevant knowledge across domains is our focus in this paper.

To address the above mentioned challenges, as shown in Figure 1, we develop a new unsupervised semantic lesions transfer model to mitigate the domain gap between labeled source lesions dataset (e.g., gastroscope) and unlabeled target diseases dataset (e.g., enteroscopy). To be specific, the proposed model consists of two complementary modules, i.e., and , which could automatically determine where and how to explore transferable knowledge from source diseases dataset to assist target lesions segmentation task. On one hand, motivated by information theory [45903], residual transferability-aware bottleneck is developed for to highlight where to translate transferable visual information while preventing irrelevant translation. On the other hand, Residual Attention on Attention Block () is proposed to encode domain-invariant knowledge with high transferability scores, which assists in exploring how to augment transferable semantic features and boost the translation performance of module in return. Meanwhile, target samples are progressively assigned with confident pseudo pixel labels along the alternative training process of and , which further bridges the distribution shift in the retraining phase. Finally, theoretical analysis about our proposed model in term of narrowing domain discrepancy among source and target datasets is elaborated. Extensive experiments on both medical endoscopic dataset and several non-medical datasets are conducted to justify the effectiveness of our proposed model.

The main contributions of this paper are as follows:

  • A new unsupervised semantic representations transfer model is proposed for endoscopic lesions segmentation. To our best knowledge, this is an earlier attempt to automatically highlight the transferable semantic knowledge for endoscopic lesions segmentation in the biomedical imaging field.

  • Two complementary modules and are developed to alternatively explore the transferable representations while neglecting untransferable knowledge, which can not only determine where to translate transferable visual information via , but also highlight how to augment transferable representations via .

  • Comprehensive theory analysis about how our model narrows domain discrepancy is provided. Experiments are also conducted to validate the superiority of our model against state-of-the-arts on the medical endoscopic dataset and several non-medical public datasets.

2 Related Work

This section reviews some related works about semantic lesions segmentation and unsupervised domain adaptation.

Semantic Segmentation of Lesions:

Deep neural networks

[wang2019laplacian, wang_TMM] have achieved significant successes in enormous applications, e.g., medical lesions segmentation [Bozorgtabar2017, dezsampx001512018, xuLargeScaleTissue2017, article_Baillard, Dong_2019_ICCV]. When compared with traditional models [ChengComputerAided, HorschAutomaticSeg]

requiring handcrafted lesions features, it relies on powerful lesions characterization capacity to boost accuracy and efficiency of diseases diagnosis, but needs large-scale pixel labels. To save the annotations cost, unsupervised learning has been widely-applied into medical lesions segmentation

[DBLP-journals/corr/abs-1806-04972, Dou2018UCD3304415, Bozorgtabar2017, BowlesBrainLesion, Atlason2018UnsupervisedBL, pmlr-v102-baur19a]. However, these models require effective prior information [DBLP-journals/corr/abs-1806-04972] or distribution hypothesis [Atlason2018UnsupervisedBL] to generalize previously unseen diseases, which only produces inaccurate and coarse lesions prediction. Thus, it is a thorny challenge to perform well on unseen target lesions when training on source diseases data [Dou2018UCD3304415, Chen2019SynergisticIA, Dong_2019_ICCV].

Unsupervised Domain Adaptation: After Hoffman et al. [exp:Wild] first utilize adversarial network [Goodfellow:2014:GAN] to achieve domain adaptation for semantic segmentation task, diverse variants based on adversarial strategy [exp:LtA, exp:CCA, exp:LSD, exp:CGAN, Wu_2018_ECCV, Saito_2018_CVPR] are proposed to address the domain shift challenge. Different from these models, [exp:CL, Lian_2019_ICCV] employ curriculum learning to infer important properties for target images according to source samples. [Zou_2018_ECCV] designs a non-adversarial model to transfer semantic representation in a self-training manner. [Gong_2019_CVPR] presents the domain flow translation to explore expected intermediate domain. Li et al. [Li_2019_CVPR] propose a bidirectional learning model for target adaptation. In addition, novel adaptation losses [Lee_2019_CVPR, Vu_2019_CVPR, Luo_2019_CVPR] are designed to measure discrepancy among different datasets. [Dong_2019_ICCV] develop a pseudo pixel label generator to focus on hard-to-transfer target samples. [Tsai_2019_ICCV, Luo_2019_ICCV, ding2018robust, ding2018graph, NIPS2019_8940] explore discriminative semantic knowledge to narrow the distribution divergence.

Figure 2: Overview architecture of our proposed model, which is composed of two alternatively complementary modules and . Specifically, focuses on exploring where to translate transferable visual characterizations via residual transferability-aware bottleneck. highlights how to augment transferable semantic representations while neglecting untransferable knowledge, which incorporates multiple residual attention on attention blocks () to capture domain-invariant features with high transferability.

3 The Proposed Model

In this section, we first present overall framework of our proposed model and then introduce detailed model formulation, followed by comprehensive theoretical analysis.

3.1 Overview

Given the source dataset (e.g., gastroscope) and target dataset (e.g., enteroscopy) , where and represent source samples with pixel annotations and target images without pixel labels, respectively. Although existing semantic transfer models [Lee_2019_CVPR, Vu_2019_CVPR, Luo_2019_CVPR, Tsai_2019_ICCV, Luo_2019_ICCV] attempt to narrow the distribution shift among source and target datasets, semantic representations are not all transferable while forcefully taking advantage of irrelevant knowledge could lead to negative transfer. Besides, various lesions with diverse appearances make them difficult to explore what kind of visual characterizations will promote transfer performance. Therefore, we endeavor to automatically highlight the transferable representations between source and target datasets to improve the lesions segmentation performance for unlabeled target samples, while ignoring the irrelevant knowledge for semantic transfer.

As depicted in Figure 2, the proposed model consists of two complementary modules, i.e., and , which alternatively determines where and how to highlight transferable knowledge. Specifically, with quantified transferability perception from discriminator in , the source samples are first passed into to explore where to translate transferable visual characterizations, according to the style information of target images . Afterwards, we forward the translated source samples along with into to determine how to augment transferable semantic features while ignoring those untransferable representations. further mitigates the domain gap in the feature space and in return promotes the translation performance of . Our model could be regarded as a closed loop to alternatively update the parameters of and . Furthermore, along the alternative training process of and , our model progressively mines confident pseudo pixel labels for target samples, which fine-tunes the segmentation model in to learn domain-invariant knowledge.

3.2 Quantified Transferability Perception

Intuitively, domain uncertainty estimation of the discriminator

in can assist in identifying those representations which can be transferred, cannot be transferred, or already transferred. For example, the input source features and target features that are already aligned across domains will fool the discriminator for distinguishing whether the input is from or . In other words, we can easily discriminate whether the input feature maps or

is transferable or not according to the output probabilities of discriminator

. Therefore, in order to highlight those transferable representations, we utilize uncertainty measure function of information theory (i.e., entropy criterion ) to quantify the transferability perception of corresponding semantic features. Take the source samples as an example, given the output probability of discriminator with network weights , the transferability perception for input source feature can be formally quantified as follows:


Similarly, Eq. (1) can also quantify the transferability for target features according to the output . Note that the quantified transferability for source and target features share the same notation for simplification.

However, false transferability perception may hurt semantic transfer task to some degree. Therefore, residual transferability perception mechanism is designed to feedback the positive transferability into in Section 3.3 and feature augmentor in Section 3.4, as shown in Figure 2.

3.3 Transferable Data Translation ()

Different from previous translation model [domain:class-preserve], our module could highlight where to translate transferable visual characterizations for better transfer performance. With the quantified transferability perception from , pays more attention to selectively explore transferable mappings and while preventing irrelevant translations with low transfer scores. The samples from both and are forwarded into to train the translation model, which produces the corresponding translated source dataset and mapped target dataset . and correspond to translated samples from and , where and are network parameters of and , respectively, and denotes the reverse translation of that learns the mapping . Notice that translated source images share same pixel annotations with original image , though there exists large visual gap among them. To encourage have closer distribution with , is employed to train , which can be written as follows:


where is the discriminator with network parameters that distinguishes between translated source images and real target samples . Likewise, we utilize to learn the mapping translation from to , i.e.,


where shares similar definition with but discriminates whether the inputs are from real source images or translated target samples . represents the corresponding network weights of . Additionally, semantic consistency between input and reconstructed samples for both source and target data are ensured by the loss :


As a result, the overall objective for training is:


However, Eq. (5) cannot selectively capture important semantic knowledge with high transferability. Therefore, as shown in Figure 2, we develop a residual transferability-aware bottleneck, which determines where to translate transferable information by purifying semantic knowledge with high transfer scores. Specifically, built upon the information theory [45903], we design an information constraint on the latent feature space, which is adaptively weighted by the quantified transferability perception in Eq. (1). It encourages the feature extractor in to encode transferable representations. Formally, Eq. (5) can be reformulated as:


where represents the channel-wise product. is the marginal distribution of

, which denotes the standard Gaussian distribution

. and represent transferability bottleneck thresholds for source and target datasets, respectively. They are set as the same value in this paper and denoted as for simplification. and are the extracted features via for source and target samples, where denotes the network parameters. Take samples as the intuitive explanation for Eq. (6): the larger KL divergence among and indicates the closer dependence among and , which enforces to encode more semantic representations from samples . Obviously, these semantic representations are not all transferable for translation while utilizing irrelevant knowledge leads to the negative transfer. Thus, by enforcing KL divergence weighted with quantified transferability to the threshold , untransferable representations from could be neglected, which is then regarded as latent feature of and forwarded into the decoder network. To optimize Eq. (6), we equally formulate it as Eq. (7) by employing two Lagrange multipliers and for source and target datasets:


where and are updated by and , respectively. The last two terms of Eq. (7) are defined as the transferability constraint losses and . denotes the updating step of and .

3.4 Transferable Feature Augmentation ()

Although is designed to translate transferable visual characterizations, it cannot ensure feature distribution across domains to be well aligned. Motivated by this observation, transferable feature augmentation module is developed to automatically determine how to augment transferable semantic features, which further mitigates the domain gap among different datasets and in return boosts the performance of . As depicted in Figure 2, feature augmentor encodes transferable representations from low-level and high-level layers that preserve informative details by incorporating with multiple residual attention on attention blocks (), where focuses on highlighting the relevance transferability of transferable representations and the details of are presented as follows.

Figure 3: The detailed illustration of .

As shown in Figure 3, given the input feature , we forward it into three convolutional blocks to produce three new features and (), where and represent the height, width and channels of corresponding features. After reshaping and into ( denotes the number of pixel positions), attention matrix with softmax activation is obtained. We then utilize the matrix multiplication operator between the transpose of and reshaped to output the attention feature map , where and can be formulated as:


where and respectively denote the corresponding features at the -th and -th pixel positions. Even though there is no relevant transferable features, Eq. (8) still generates an average weighted feature map , which could heavily degrade the transferability of semantic knowledge or even encourage them to be untransferable.

Therefore, we develop the module to measure the relevance between attention result and input feature . Then the transferable information flow and relevance gate are produced via the linear operation on and , i.e.,


where , are the transformation matrices. Afterwards, and are reshaped into and employed to perform element-wise multiplication. We multiply the produced result by a scalar parameter and employ an element-wise sum operation with to obtain the ultimate feature :


where is initialized as 0, and its value is adaptively learned along the training process.

With quantified transferability perception from , could selectively augment the transferable representations while preventing the irrelevant augmentation, which promotes the segmentation module to learn domain-invariant knowledge and further improves the performance of in return. The details about how to train are as follows:

Step I: The translated source samples with pixel annotations and target images with generated pseudo pixel labels are forwarded into segmentation model , where and indicates network weights of . The segmentation loss for training can be concretely expressed as:


where and denote the output probabilities of predicted as class at the -th and the -th pixels, respectively. is the classes number. generates confident pseudo labels at -th pixel for training, where is a probability threshold.

Step II: In order to encourage synthesize new transferable features that resemble the extracted features from source or target datasets (i.e., feature augmentation), is employed to distinguish whether the input is from or . Intuitively, with the assistance of quantified transferability from , selectively augments transferable domain-invariant features while neglecting untransferable knowledge. Consequently, is designed to train while fixing the parameters of learned from Step I:


where are parameters of . denotes Gaussian distribution from which noise samples are drawn.

Step III: The network weights of learned in Step II are fixed in Step III. is retrained to discriminate whether the input is from original datasets or augmented transferable representation. It encourages to explore a common feature space, where target features are indistinguishable from the source one. As a result, the training objective in Eq. (13) is proposed to optimize , which captures transferable domain invariant knowledge while neglecting the untransferable representations.


Notice that the quantified transferability perception in Section 3.2 is from in Step III rather than Step II.

Metrics BL [net:deeplab] LtA [exp:LtA] CGAN [exp:CGAN] CLAN [Luo_2019_CVPR] ADV [Vu_2019_CVPR] BDL [Li_2019_CVPR] SWES [Dong_2019_ICCV] DPR [Tsai_2019_ICCV] PyCDA [Lian_2019_ICCV]  Ours
() 74.47 81.04 79.75 81.74 81.95 84.22 83.96 83.23 84.31 85.48
() 32.65 40.35 40.52 41.33 42.27 42.84 42.63 42.11 43.08 43.67
mIoU() 53.56 60.70 60.13 61.54 62.11 63.53 63.29 62.67 63.70 64.58
Table 1: Performance comparison between our proposed model and several competing methods on medical endoscopic dataset.

3.5 Implementation Details

Network Architecture: For the transferable visual translation module , CycleGAN [DBLP:journals/corr/ZhuPIE17] is employed as the baseline network. As depicted in Figure 2, the residual transferability-aware bottleneck is attached on the last convolutional block of . In the transferable feature augmentation module , segmentation network is DeepLab-v3 [net:deeplab] with ResNet-101 [net:resnet]

as the backbone architecture, whose the strides of the last two convolutional blocks are transformed from 2 to 1 for higher dimension output.

encodes the features from the bottom and the last convolutional blocks of , which are first augmented with the noise from Gaussian distribution. For discriminator

, we utilize 5 fully convolutional layers with channel number as {16, 32, 64, 64, 1}, where the leaky RELU function parameterized by 0.2 is employed to activate each layer excluding the last convolution filter activated by the sigmoid function.

Training and Testing: Two complementary modules and are alternatively trained until convergence. When training the network , inspired by [DBLP:journals/corr/ZhuPIE17], we set . The learning rate is initialized as

for first 10 epochs and linearly decreases to 0 in the later 5 epochs. In Eq. (

7), , and are initialized as with updating step as . For backbone DeepLab-v3 [net:deeplab], we utilize SGD optimizer with an initial learning rate as and power as 0.9. The Adam optimizer with initial learning rate as is employed for training . We set its momentum as 0.9 and 0.99. In the testing stage, the target images (e.g., enteroscopy) are directly forwarded into for evaluation.

3.6 Theoretical Analysis

In this subsection, we elaborate the theoretical analysis about our model in term of narrowing domain discrepancy between source and target distributions ( and ), with regard to the hypothesis set . As pointed out by [BenDavid2010], the expected error

of any classifier

performing on target dataset has theory upper bound, i.e.,


where is an independent constant. is the expected error of any classifying on source samples, which can be negligibly small under the supervisory training. denotes the -divergence distance between and . Thus, the relationships between our model and domain discrepancy will be discussed.

As the metric distance of distributions and , satisfies the following triangle inequality, i.e.,


where is the marginal distribution of .

Recall that two complementary modules and (Eq. (7) and Eq. (13)) alternatively prevent the negative transfer of untransferable knowledge, which encourages the distributions of both and tend to the standard Gaussian, i.e., and . Consequently, our proposed model forces the last two terms of Eq. (15) to be near zero, i.e., and . In summary, our model could efficiently achieve the tighter upper bound for target excepted error and reduce domain discrepancy .

Figure 4: The complementary effect of modules and about mIoU (left) and domain gap (right) on the endoscopic dataset.

4 Experiments

4.1 Datasets and Evaluation

Medical Endoscopic Dataset [Dong_2019_ICCV] is collected from various endoscopic lesions, i.e., cancer, polyp, gastritis, ulcer and bleeding. Specifically, it consists of 2969 gasteroscope samples and 690 enteroscopy images. For the training phase, 2969 gastroscope images with pixel annotations are regarded as the source data. We treat 300 enteroscopy samples without pixel labels as target data. In the testing stage, we use the other 390 enteroscopy samples for evaluation.

Cityscapes [data:city] is a real-world dataset about European urban street scenes, which is collected from 50 cities and has total 34 defined categories. It is composed of three disjoint subsets with 2993, 503 and 1531 images for training, testing and validation, respectively.

GTA [data:GTA] consists of 24996 images generated from fictional city scenes of Los Santos in the computer game Grand Theft Auto V. The annotation categories are compatible with the Cityscapes dataset [data:city].

SYNTHIA [data:synthia] is a large-scale synthetic dataset whose urban scenes are collected from virtual city without corresponding to any realistic city. We utilize its subset called SYNTHIA-RANDCITYSCAPES in our experiments, which contains 9400 images with 12 automatically labeled object classes and some undefined categories.

Evaluation Metric:

Intersection over union (IoU) is regarded as basic evaluation metric. Besides, we utilize three derived metrics,

i.e., mean IoU (mIoU), IoU of normal (), and IoU of disease ().

Notations: In all experiments, BL represents the baseline network DeepLab-v3 [net:deeplab] without semantic transfer.

4.2 Experiments on Medical Endoscopic Dataset

In our experiments, all the competing methods in Table 1 employ ResNet-101 [net:resnet] as backbone architecture for a fair comparison. From the presented results in Table 1, we can observe that: 1) Our model could significantly mitigate the domain gap about 11.02% between source and target datasets when comparing with baseline BL [net:deeplab]. 2) Existing transfer models [Dong_2019_ICCV, Vu_2019_CVPR, Lian_2019_ICCV, Tsai_2019_ICCV, Li_2019_CVPR] perform worse than our model, since they pay equal attention to all semantic representation instead of neglecting irrelevant knowledge, which causes the negative transfer of untransferable knowledge.

Effect of Complementary Modules and : This subsection introduces alternative iteration experiments to validate the effectiveness of complementary modules and . As shown in Figure 4, and can mutually promote each other and progressively narrow the domain gap along the alternative iteration process. After a few iterations (e.g., the number is 3 for this medical dataset), the performance of our model achieves efficient convergence. After using to translate transferable visual information, can further automatically determine how to augment transferable semantic features and in return promote the translation performance of . The experimental results are in accordance with the theoretical analysis in Section 3.6.

Variants QT PL TKB AA mIoU(%)
Ours-w/oQT 61.47 -3.11
Ours-w/oPL 61.35 -3.23
Ours-w/oTKB 62.73 -1.85
Ours-w/oAA 63.06 -1.52
Ours 64.58 -
Table 2: Ablation experiments on the medical endoscopic datasets.
(a) Number of (b) Pseudo labels
Figure 5: The effect of different number of (left), and the generation process of pseudo labels along the alternative iteration number of modules and (right) on the medical dataset.

Ablation Studies: To verify the importance of different components in our proposed model, we intend to conduct the variant experiments with the ablation of different components on medical endoscopic dataset, i.e., quantified transferability (QT), pseudo labels (PL), transferability-aware bottleneck (TKB) and attention on attention (AA) of . Training the model without QT, PL, TKB and AA are respectively denoted as Ours-w/oQT, Ours-w/oPL, Ours-w/oTKB and Ours-w/oAA. From the presented results in Table 2, we can notice that the performance degrades after removing any component of our model, which justifies the rationality and effectiveness of each designed component. Besides, with quantified transferability perception from , our model could efficiently encode transferable semantic knowledge among source and target datasets while brushing irrelevant representations aside. Multiple s play an essential role in capturing the relevance transferability of transferable knowledge and we set its number as 16, as illustrated in Figure 5 (a). Moreover, the distribution shift between different datasets could be further bridged by confident pseudo labels, which are generated progressively along the iteration process, as depicted in Figure 5 (b).

Parameters Investigations: In this subsection, extensive hyper-parameter experiments are empirically conducted to investigate the effect of hyper-parameters and , which assists to determine the optimal parameters. and share same value in our experiments. Notice that our model achieves stable performance over the wide range of different parameters, as shown in Figure 6. Furthermore, it also validates that residual transferability-aware bottleneck in Eq. (7) can efficiently purify the transferable semantic representations with high transfer scores.

(a) (b)
Figure 6: The parameters investigations about (left) and (right) on the medical dataset.
Figure 7: The complementary effect of modules and about mIoU (left) and domain gap (right) on several benchmark datasets.
Figure 8: Pseudo Labels generated along the alternative iteration number of modules and on GTA Cityscapes task.
Method road sidewalk building wall fence pole light sign veg terrain sky person rider car truck bus train mbike bike mIoU(%)
LtA [exp:LtA] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
MCD [Saito_2018_CVPR] 90.3 31.0 78.5 19.7 17.3 28.6 30.9 16.1 83.7 30.0 69.1 58.5 19.6 81.5 23.8 30.0 5.7 25.7 14.3 39.7
CGAN [exp:CGAN] 89.2 49.0 70.7 13.5 10.9 38.5 29.4 33.7 77.9 37.6 65.8 75.1 32.4 77.8 39.2 45.2 0.0 25.2 35.4 44.5
CBST [Zou_2018_ECCV] 88.0 56.2 77.0 27.4 22.4 40.7 47.3 40.9 82.4 21.6 60.3 50.2 20.4 83.8 35.0 51.0 15.2 20.6 37.0 46.2
CLAN [Luo_2019_CVPR] 87.0 27.1 79.6 27.3 23.3 28.3 35.5 24.2 83.6 27.4 74.2 58.6 28.0 76.2 33.1 36.7 6.7 31.9 31.4 43.2
SWD [Lee_2019_CVPR] 92.0 46.4 82.4 24.8 24.0 35.1 33.4 34.2 83.6 30.4 80.9 56.9 21.9 82.0 24.4 28.7 6.1 25.0 33.6 44.5
ADV [Vu_2019_CVPR] 89.4 33.1 81.0 26.6 26.8 27.2 33.5 24.7 83.9 36.7 78.8 58.7 30.5 84.8 38.5 44.5 1.7 31.6 32.5 45.5
BDL [Li_2019_CVPR] 91.0 44.7 84.2 34.6 27.6 30.2 36.0 36.0 85.0 43.6 83.0 58.6 31.6 83.3 35.3 49.7 3.3 28.8 35.6 48.5
SWLS [Dong_2019_ICCV] 92.7 48.0 78.8 25.7 27.2 36.0 42.2 45.3 80.6 14.6 66.0 62.1 30.4 86.2 28.0 45.6 35.9 16.8 34.7 47.2
DPR [Tsai_2019_ICCV] 92.3 51.9 82.1 29.2 25.1 24.5 33.8 33.0 82.4 32.8 82.2 58.6 27.2 84.3 33.4 46.3 2.2 29.5 32.3 46.5
PyCDA [Lian_2019_ICCV] 90.5 36.3 84.4 32.4 28.7 34.6 36.4 31.5 86.8 37.9 78.5 62.3 21.5 85.6 27.9 34.8 18.0 22.9 49.3 47.4
BL 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.6
Ours-w/oQT 89.0 40.0 83.4 34.0 23.7 32.2 36.6 33.1 84.0 39.3 74.3 58.9 27.2 78.8 32.6 35.1 0.1 28.4 37.4 45.7
Ours-w/oPL 90.6 40.8 84.1 31.3 22.7 32.0 39.0 33.7 84.3 39.5 80.7 58.4 28.7 82.8 27.4 48.1 1.0 27.0 28.5 46.4
Ours-w/oTKB 88.9 45.2 82.9 32.7 26.6 31.5 34.8 34.3 83.5 38.8 81.5 60.0 31.5 80.6 30.8 44.9 5.2 33.8 35.4 47.5
Ours-w/oAA 89.1 49.8 82.7 32.8 26.6 32.0 35.8 32.4 83.1 37.2 83.8 58.7 32.9 81.0 34.9 47.1 1.5 33.1 36.8 48.0
Ours 89.4 50.1 83.9 35.9 27.0 32.4 38.6 37.5 84.5 39.6 85.7 61.6 33.7 82.2 36.0 50.4 0.3 33.6 32.1 49.2
Table 3: Performance comparison of transferring semantic representations from GTA to Cityscapes.
Method road sidewalk building wall fence pole light sign veg sky person rider car bus mbike bike mIoU(%)
LSD [exp:LSD] 80.1 29.1 77.5 2.8 0.4 26.8 11.1 18.0 78.1 76.7 48.2 15.2 70.5 17.4 8.7 16.7 36.1
MCD [Saito_2018_CVPR] 84.8 43.6 79.0 3.9 0.2 29.1 7.2 5.5 83.8 83.1 51.0 11.7 79.9 27.2 6.2 0.0 37.3
CGAN [exp:CGAN] 85.0 25.8 73.5 3.4 3.0 31.5 19.5 21.3 67.4 69.4 68.5 25.0 76.5 41.6 17.9 29.5 41.2
DCAN [Wu_2018_ECCV] 82.8 36.4 75.7 5.1 0.1 25.8 8.0 18.7 74.7 76.9 51.1 15.9 77.7 24.8 4.1 37.3 38.4
CBST [Zou_2018_ECCV] 53.6 23.7 75.0 12.5 0.3 36.4 23.5 26.3 84.8 74.7 67.2 17.5 84.5 28.4 15.2 55.8 42.5
ADV [Vu_2019_CVPR] 85.6 42.2 79.7 8.7 0.4 25.9 5.4 8.1 80.4 84.1 57.9 23.8 73.3 36.4 14.2 33.0 41.2
SWLS [Dong_2019_ICCV] 68.4 30.1 74.2 21.5 0.4 29.2 29.3 25.1 80.3 81.5 63.1 16.4 75.6 13.5 26.1 51.9 42.9
DPR [Tsai_2019_ICCV] 82.4 38.0 78.6 8.7 0.6 26.0 3.9 11.1 75.5 84.6 53.5 21.6 71.4 32.6 19.3 31.7 40.0
PyCDA [Lian_2019_ICCV] 75.5 30.9 83.3 20.8 0.7 32.7 27.3 33.5 84.7 85.0 64.1 25.4 85.0 45.2 21.2 32.0 46.7
BL 55.6 23.8 74.6 9.2 0.2 24.4 6.1 12.1 74.8 79.0 55.3 19.1 39.6 23.3 13.7 25.0 33.5
Ours-w/oQT 69.4 30.9 79.8 21.3 0.5 30.2 31.0 22.7 82.3 82.6 66.4 15.2 79.1 20.5 26.7 48.2 44.2
Ours-w/oPL 70.3 32.1 77.8 22.9 0.8 29.6 32.4 24.3 81.7 80.1 62.9 22.0 75.4 26.2 25.3 51.0 44.7
Ours-w/oTKB 78.6 39.2 80.4 19.5 0.6 27.8 29.1 21.5 80.8 82.0 64.5 24.7 83.5 29.6 24.1 46.3 45.8
Ours-w/oAA 81.3 41.5 79.2 21.8 0.7 28.3 27.6 20.1 81.7 80.9 62.7 25.3 82.1 34.5 23.6 47.3 46.2
Ours 81.7 43.8 80.1 22.3 0.5 29.4 28.6 21.2 83.4 82.3 63.1 26.2 83.7 34.9 26.3 48.4 47.2
Table 4: Performance comparison of transferring semantic knowledge from SYNTHIA to Cityscapes.

4.3 Experiments on Benchmark Datasets

Extensive experiments on several non-medical benchmark datasets are also conducted to further illustrate the generalization performance of our model. For a fair comparison, we set the same experimental data configuration with all comparable state-of-the-arts [exp:CGAN, exp:LtA, Dong_2019_ICCV, Lee_2019_CVPR, Vu_2019_CVPR]. To be specific, in the training phase, GTA [data:GTA] and SYNTHIA [data:synthia] are regarded as the source dataset, and the training subset of Cityscapes [data:city] is treated as the target dataset. We use the validation subset of Cityscapes [data:city] for evaluation. Table 3 and Table 4 respectively report the results of transferring from GTA and SYNTHIA to Cityscapes. From Table 3 and Table 4, we have the following observations: 1) Our model outperforms all the existing advanced transfer models [exp:CGAN, exp:LtA, Dong_2019_ICCV, Saito_2018_CVPR, Vu_2019_CVPR] about , since two complementary modules could alternatively explore where and how to highlight transferable knowledge to bridge the domain gap, as shown in Figure 7. 2) Ablation studies about different components illustrate they play an important role in highlighting transferable domain-invariant knowledge to improve the transfer performance. 3) Our model achieves larger improvements for those hard-to-transfer classes with various appearances among different datasets (e.g., sidewalk, wall, motorbike, rider, sky and terrain) by selectively neglecting untransferable knowledge. In addition, Figure 8 presents the iteratively generated pseudo labels on GTA Cityscapes task, which narrows the distribution divergence.

5 Conclusion

In this paper, we develop a new unsupervised semantic transfer model including two complementary modules ( and ), which alternatively explores transferable domain-invariant knowledge between labeled source gastroscope lesions dataset and unlabeled target enteroscopy diseases dataset. Specifically, explores where to translate transferable visual characterizations while preventing untransferable translation. highlights how to augment those semantic representations with high transferability scores, which in return promotes the translation performance of . Comprehensive theory analysis and experiments on the medical endoscopic dataset and several non-medical benchmark datasets validate the effectiveness of our model.